Command reference

This command can be used for launching

freeports

from the command line. To get a contextual help use the option --help shortened to -h

freeports -h

Important

Before using the command please read the validation section

To control the operation of the script different options are in use, these can be specified in 3 different ways that overwrite each other, with hierarchy:

  1. command line options

  2. environment variables

  3. configuration file

in this way options specified through the command line are never overwritten and have maximum priority. An options not specified defaults to the values specified in the conf_parse submodule. These are overwritten by the options specified in the configuration file, then the result is overwritten by the environment variables, command line options and finally if when in BATCH MODE by the job contextual options. The options available to be overwritten and how are documented in the respective reference pages:

After specified the options are overwritten as described in the section about validation. Each method of overwriting also has a specific validation mechanism documented in the respective page and applied before the validation of resulting configuration. Each of these three ways to set parameters goes to modify the value of the option in the program.

The options

Option

Type

Description

Default

VERBOSITY

int

Sets how verbose the program is in the terminal

2

BATCH

Path

If set to path of batch file, it triggers BATCH MODE

N_WORKERS

int

Number of parallel processes in BATCH MODE

os.process_cpu_count()

OUT_CSV

Path

File where to output csv

/dev/stdout

SAVE_PDF

bool

If set and URL is specified, it saves the input pdf

True

URL

str

Url of the pdf to take as input

PDF

Path

Path to local pdf

FORMAT

PdfFormats

Format to parse the pdf document

CONFIG_FILE

Path

Custom config file location

Calculated dynamically

SEPARATE_OUT_FILES

bool

In BATCH_MODE do not merge the results of the batch

False

PREFIX_OUT

str

In BATCH_MODE define an id for the different outputs

VERBOSITY

This values goes from 0 to 5, 0 indicate min verbosity called CRITICAL VERBOSITY and 4 indicate the max verbosity also called DEBUG VERBOSITY, 5 is the NOSET VERBOSITY. The meaning of the other levels are the ones used by the python logging package:

freeports

logging

0

loggign.CRITICAL

1

logging.ERROR

2

loggign.WARNING

3

loggign.INFO

4

logging.DEBUG

5

logging.NOSET

URL, PDF and SAVE_PDF

Either URL or PDF, or both, has to be specified, directly or if in BATCH_MODE it can be left to job contextual options overwriting. If URL is specified the program use the pdf resource corresponding to the url, if PDF is specified it loads a pdf file from local filesystem. If both are specified it tries to load from local storage, then fallback to the url. If both are specified and SAVE_PDF is True, if the file is not present locally, it will download it and save it on disk with name indicate by PDF option.

OUT_CSV

When not in BATCH MODE it indicates where to output the resulting csv file parsed from the pdf document.

Note

The OUT_CSV default on Windows systems is CON

FORMAT

It indicates which algorithm to use to parse the pdf, these algorithms are called the ‘formats’ of the pdf reports. It is mandatory to specify this variable if no URL is provided, if it is provided the format will be inferred using a mapping file that maps different url regular expressions to a format. The file is called format_url_mapping.yaml in the source code.

CONFIG_FILE

This option indicates the config file loaded to overwrite the default options, this option can only be specified using an environment variable or using a command line argument, and it is evaluated before any other option.

N_WORKWERS

Integer that rappresents the number of process spawned (if not set it defaults to the number of available CPUs). When in BATCH_MODE it indicates the process to spawn concurrently to achieve parrallelization on the processing of different files. When not in BATCH_MODE the program divides the pdf document in different section of pages and parallelizes processing document wise.

Validation of resulting configuration

Each way of specifying options has its algorithm to validate the user’s choice, but after those checks a consistency check is performed on the resulting configuration. Noticebly the most important performed checks are:

  • In BATCH_MODE OUT_CSV is the name of an archive or of a directory

  • After job contextual options overwriting at least one between PDF or URL is defined

BATCH_MODE

This mode permits to process different files all at one in parallel. This mode is caratterized by the BATCH variable set to a batch csv file and the possibility of setting SEPARATE_OUT_FILES to True ( in this case OUT_CSV should be a directory name or the name of a .tar.gz archive to create) The batch csv file is a csv file with some header that indicate the option to overwrite to the resulting configuration. These options are called job contextual options and each row of the csv file is called a job. The available overwrittables options are:

Header

Overwritten option

url

URL

save pdf

SAVE_PDF

pdf

PDF

format

FORMAT

prefix out

See below

the header is case insensitive, so for example url, URL and Url are considered the same header. the bool matching is done so that cast to True if csv value is one between (case insensitive) true, on, yes, y, t, 1 or False if between false, off, no, n, f, 0.

OUT_CSV and prefix out

When in BATCH_MODE there are two output profiles, the standard is a single csv and the non standard are separate files (this distinction can be made setting SEPARATE_OUT_FILES to True or False). The prefix out cell of the batch file sets the PREFIX_OUT option. When outputing on the same file the data is separated by Format column to indicate the format used to parse the pdf report. Identifying precisely the line of the batch file that generates the data is done by setting PREFIX_OUT to a string, which is added to a column called Report identifier.

Tip

Set PREFIX_OUT to something meaningfull that distinguishes the input document, like for example the date of the publication of the pdf and istitution that created the report

When on different files OUT_CSV has to be a directory or a .tar.gz archive. The program creates, if it doesn’t exist, a directory named OUT_CSV if it’s not an archive or the name of the archive without the .tar.gz exstension otherwise, and for each job, save an output file named {PREFIX_OUT}-{FORMAT}.csv or just {FORMAT}.csv if absent or empty prefix. If OUT_CSV was specified as an archive, the directory is compressed into .tar.gz. If the directory didn’t exist and an archive is created, after creation the directory is deleted from the filesystem.