Command reference
This command can be used for launching
freeports
from the command line. To get a contextual help use the option --help shortened to -h
freeports -h
Important
Before using the command please read the validation section
To control the operation of the script different options are in use, these can be specified in 3 different ways that overwrite each other, with hierarchy:
command line options
environment variables
configuration file
in this way options specified through the command line are never overwritten and have maximum priority. An options not specified defaults to the values specified in the conf_parse submodule. These are overwritten by the options specified in the configuration file, then the result is overwritten by the environment variables, command line options and finally if when in BATCH MODE by the job contextual options. The options available to be overwritten and how are documented in the respective reference pages:
Reference config pages:
After specified the options are overwritten as described in the section about validation. Each method of overwriting also has a specific validation mechanism documented in the respective page and applied before the validation of resulting configuration. Each of these three ways to set parameters goes to modify the value of the option in the program.
The options
Option |
Type |
Description |
Default |
|---|---|---|---|
|
|
Sets how verbose the program is in the terminal |
2 |
|
|
If set to path of batch file, it triggers |
|
|
|
Number of parallel processes in |
|
|
|
File where to output |
|
|
|
If set and |
|
|
|
Url of the pdf to take as input |
|
|
|
Path to local pdf |
|
|
|
Format to parse the pdf document |
|
|
|
Custom config file location |
Calculated dynamically |
|
|
In |
|
|
|
In |
VERBOSITY
This values goes from 0 to 5, 0 indicate min verbosity called CRITICAL VERBOSITY and 4 indicate the max verbosity also called DEBUG VERBOSITY, 5 is the NOSET VERBOSITY.
The meaning of the other levels are the ones used by the python logging package:
freeports |
|
|---|---|
0 |
|
1 |
|
2 |
|
3 |
|
4 |
|
5 |
|
URL, PDF and SAVE_PDF
Either URL or PDF, or both, has to be specified, directly or if in BATCH_MODE it can be left to
job contextual options overwriting.
If URL is specified the program use the pdf resource corresponding to the url,
if PDF is specified it loads a pdf file from local filesystem. If both are specified
it tries to load from local storage, then fallback to the url.
If both are specified and SAVE_PDF is True, if the file is not present locally, it will download it
and save it on disk with name indicate by PDF option.
OUT_CSV
When not in BATCH MODE it indicates where to output the resulting csv file parsed from the pdf document.
Note
The OUT_CSV default on Windows systems is CON
FORMAT
It indicates which algorithm to use to parse the pdf, these algorithms are called the ‘formats’ of the pdf reports.
It is mandatory to specify this variable if no URL is provided, if it is provided the format will be inferred using
a mapping file that maps different url regular expressions to a format. The file is called format_url_mapping.yaml in the source code.
CONFIG_FILE
This option indicates the config file loaded to overwrite the default options, this option can only be specified using an environment variable or using a command line argument, and it is evaluated before any other option.
N_WORKWERS
Integer that rappresents the number of process spawned (if not set it defaults to the number of available CPUs).
When in BATCH_MODE it indicates the process to spawn concurrently to achieve parrallelization on the
processing of different files. When not in BATCH_MODE the program divides the pdf document in different
section of pages and parallelizes processing document wise.
Validation of resulting configuration
Each way of specifying options has its algorithm to validate the user’s choice, but after those checks a consistency check is performed on the resulting configuration. Noticebly the most important performed checks are:
In
BATCH_MODEOUT_CSVis the name of an archive or of a directoryAfter job contextual options overwriting at least one between
PDForURLis defined
BATCH_MODE
This mode permits to process different files all at one in parallel. This mode is caratterized by the BATCH
variable set to a batch csv file and the possibility of setting SEPARATE_OUT_FILES to True
( in this case OUT_CSV should be a directory name or the name of a .tar.gz archive to create)
The batch csv file is a csv file with some header that indicate the option to overwrite to the
resulting configuration. These options are called job contextual options and each row of the csv file is called a job.
The available overwrittables options are:
Header |
Overwritten option |
|---|---|
|
|
|
|
|
|
|
|
|
See below |
the header is case insensitive, so for example url, URL and Url are considered the same header.
the bool matching is done so that cast to True if csv value is one between (case insensitive)
true, on, yes, y, t, 1 or False if between false, off, no, n, f, 0.
OUT_CSV and prefix out
When in BATCH_MODE there are two output profiles, the standard is a single csv and
the non standard are separate files (this distinction can be made setting SEPARATE_OUT_FILES to True or False).
The prefix out cell of the batch file sets the PREFIX_OUT option.
When outputing on the same file the data is separated by Format column to indicate the format used to parse the
pdf report. Identifying precisely the line of the batch file that generates
the data is done by setting PREFIX_OUT to a string, which is added to a column called Report identifier.
Tip
Set PREFIX_OUT to something meaningfull that distinguishes the input document, like for example
the date of the publication of the pdf and istitution that created the report
When on different files OUT_CSV has to be a directory or a .tar.gz archive.
The program creates, if it doesn’t exist, a directory named OUT_CSV if it’s not an archive
or the name of the archive without the .tar.gz exstension otherwise, and for each job, save an output file
named {PREFIX_OUT}-{FORMAT}.csv or just {FORMAT}.csv if absent or empty prefix.
If OUT_CSV was specified as an archive, the directory
is compressed into .tar.gz. If the directory didn’t exist and an archive is created, after creation
the directory is deleted from the filesystem.