================= Command reference ================= This command can be used for launching .. code-block:: console freeports from the command line. To get a contextual help use the option ``--help`` shortened to ``-h`` .. code-block:: console freeports -h .. important:: Before using the command please read the :doc:`validation section ` To control the operation of the script different options are in use, these can be specified in 3 different ways that overwrite each other, with hierarchy: 1. **command line options** 2. **environment variables** 3. **configuration file** in this way options specified through the command line are never overwritten and have maximum priority. An options not specified defaults to the values specified in the :ref:`conf_parse submodule `. These are overwritten by the options specified in the **configuration file**, then the result is overwritten by the **environment variables**, **command line options** and finally if when in :ref:`BATCH MODE ` by the *job contextual options*. The options available to be overwritten and how are documented in the respective reference pages: .. toctree:: :maxdepth: 1 :caption: Reference config pages: config/cmd_args.rst config/env_variables.rst config/config_file.rst After specified the options are overwritten as described in :ref:`the section about validation `. Each method of overwriting also has a specific validation mechanism documented in the respective page and applied before the validation of resulting configuration. Each of these three ways to set parameters goes to modify the value of the option in the program. ----------- The options ----------- +------------------------+-------------------------+----------------------------------------------------------+----------------------------+ | Option | Type | Description | Default | +========================+=========================+==========================================================+============================+ | ``VERBOSITY`` | ``int`` | Sets how verbose the program is in the terminal | 2 | +------------------------+-------------------------+----------------------------------------------------------+----------------------------+ | ``BATCH`` | ``Path`` | If set to path of batch file, it triggers ``BATCH MODE`` | | +------------------------+-------------------------+----------------------------------------------------------+----------------------------+ | ``N_WORKERS`` | ``int`` | Number of parallel processes in ``BATCH MODE`` | ``os.process_cpu_count()`` | +------------------------+-------------------------+----------------------------------------------------------+----------------------------+ | ``OUT_CSV`` | ``Path`` | File where to output ``csv`` | ``/dev/stdout`` | +------------------------+-------------------------+----------------------------------------------------------+----------------------------+ | ``SAVE_PDF`` | ``bool`` | If set and ``URL`` is specified, it saves the input pdf | ``True`` | +------------------------+-------------------------+----------------------------------------------------------+----------------------------+ | ``URL`` | ``str`` | Url of the pdf to take as input | | +------------------------+-------------------------+----------------------------------------------------------+----------------------------+ | ``PDF`` | ``Path`` | Path to local pdf | | +------------------------+-------------------------+----------------------------------------------------------+----------------------------+ | ``FORMAT`` | :py:class:`PdfFormats` | Format to parse the pdf document | | +------------------------+-------------------------+----------------------------------------------------------+----------------------------+ | ``CONFIG_FILE`` | ``Path`` | Custom config file location | Calculated dynamically | +------------------------+-------------------------+----------------------------------------------------------+----------------------------+ | ``SEPARATE_OUT_FILES`` | ``bool`` | In ``BATCH_MODE`` do not merge the results of the batch | ``False`` | +------------------------+-------------------------+----------------------------------------------------------+----------------------------+ | ``PREFIX_OUT`` | ``str`` | In ``BATCH_MODE`` define an id for the different outputs | | +------------------------+-------------------------+----------------------------------------------------------+----------------------------+ """"""""""""" ``VERBOSITY`` """"""""""""" This values goes from 0 to 5, 0 indicate min verbosity called ``CRITICAL VERBOSITY`` and 4 indicate the max verbosity also called ``DEBUG VERBOSITY``, 5 is the ``NOSET VERBOSITY``. The meaning of the other levels are the ones used by the python `logging package `_: +-----------+----------------------------------------------------------------------------+ | freeports | `logging `_ | +===========+============================================================================+ | 0 | ``loggign.CRITICAL`` | +-----------+----------------------------------------------------------------------------+ | 1 | ``logging.ERROR`` | +-----------+----------------------------------------------------------------------------+ | 2 | ``loggign.WARNING`` | +-----------+----------------------------------------------------------------------------+ | 3 | ``loggign.INFO`` | +-----------+----------------------------------------------------------------------------+ | 4 | ``logging.DEBUG`` | +-----------+----------------------------------------------------------------------------+ | 5 | ``logging.NOSET`` | +-----------+----------------------------------------------------------------------------+ """"""""""""""""""""""""""""""""" ``URL``, ``PDF`` and ``SAVE_PDF`` """"""""""""""""""""""""""""""""" Either ``URL`` or ``PDF``, or both, has to be specified, directly or if in ``BATCH_MODE`` it can be left to *job contextual options* overwriting. If ``URL`` is specified the program use the pdf resource corresponding to the url, if ``PDF`` is specified it loads a pdf file from local filesystem. If both are specified it tries to load from local storage, then fallback to the url. If both are specified and ``SAVE_PDF`` is ``True``, if the file is not present locally, it will download it and save it on disk with name indicate by ``PDF`` option. """"""""""" ``OUT_CSV`` """"""""""" When not in ``BATCH MODE`` it indicates where to output the resulting ``csv`` file parsed from the pdf document. .. note:: The ``OUT_CSV`` default on ``Windows`` systems is ``CON`` """""""""" ``FORMAT`` """""""""" It indicates which algorithm to use to parse the pdf, these algorithms are called the 'formats' of the pdf reports. It is mandatory to specify this variable if no ``URL`` is provided, if it is provided the format will be inferred using a mapping file that maps different url regular expressions to a format. The file is called ``format_url_mapping.yaml`` in the source code. """"""""""""""" ``CONFIG_FILE`` """"""""""""""" This option indicates the config file loaded to overwrite the default options, this option can only be specified using an environment variable or using a command line argument, and it is evaluated before any other option. """""""""""""" ``N_WORKWERS`` """""""""""""" Integer that rappresents the number of process spawned (if not set it defaults to the number of available CPUs). When in ``BATCH_MODE`` it indicates the process to spawn concurrently to achieve parrallelization on the processing of different files. When not in ``BATCH_MODE`` the program divides the pdf document in different section of pages and parallelizes processing document wise. .. _conf_validation: ------------------------------------- Validation of resulting configuration ------------------------------------- Each way of specifying options has its algorithm to validate the user's choice, but after those checks a consistency check is performed on the resulting configuration. Noticebly the most important performed checks are: * In ``BATCH_MODE`` ``OUT_CSV`` is the name of an archive or of a directory * After *job contextual options* overwriting at least one between ``PDF`` or ``URL`` is defined .. batch_mode: -------------- ``BATCH_MODE`` -------------- This mode permits to process different files all at one in parallel. This mode is caratterized by the ``BATCH`` variable set to a *batch csv file* and the possibility of setting ``SEPARATE_OUT_FILES`` to ``True`` ( in this case ``OUT_CSV`` should be a directory name or the name of a ``.tar.gz`` archive to create) The *batch csv file* is a csv file with some header that indicate the option to overwrite to the resulting configuration. These options are called *job contextual options* and each row of the csv file is called a *job*. The available overwrittables options are: +----------------+--------------------+ | Header | Overwritten option | +================+====================+ | ``url`` | ``URL`` | +----------------+--------------------+ | ``save pdf`` | ``SAVE_PDF`` | +----------------+--------------------+ | ``pdf`` | ``PDF`` | +----------------+--------------------+ | ``format`` | ``FORMAT`` | +----------------+--------------------+ | ``prefix out`` | *See below* | +----------------+--------------------+ the header is case insensitive, so for example *url*, *URL* and *Url* are considered the same header. the bool matching is done so that cast to ``True`` if csv value is one between (case insensitive) *true, on, yes, y, t, 1* or ``False`` if between *false, off, no, n, f, 0*. """""""""""""""""""""""""""""" ``OUT_CSV`` and ``prefix out`` """""""""""""""""""""""""""""" When in ``BATCH_MODE`` there are two output profiles, the standard is a single *csv* and the non standard are separate files (this distinction can be made setting ``SEPARATE_OUT_FILES`` to ``True`` or ``False``). The ``prefix out`` cell of the batch file sets the ``PREFIX_OUT`` option. When outputing on the same file the data is separated by *Format* column to indicate the format used to parse the pdf report. Identifying precisely the line of the batch file that generates the data is done by setting ``PREFIX_OUT`` to a string, which is added to a column called *Report identifier*. .. tip:: Set ``PREFIX_OUT`` to something meaningfull that distinguishes the input document, like for example the date of the publication of the pdf and istitution that created the report When on different files ``OUT_CSV`` has to be a directory or a ``.tar.gz`` archive. The program creates, if it doesn't exist, a directory named ``OUT_CSV`` if it's not an archive or the name of the archive without the ``.tar.gz`` exstension otherwise, and for each *job*, save an output file named ``{PREFIX_OUT}-{FORMAT}.csv`` or just ``{FORMAT}.csv`` if absent or empty prefix. If ``OUT_CSV`` was specified as an archive, the directory is compressed into ``.tar.gz``. If the directory didn't exist and an archive is created, after creation the directory is deleted from the filesystem.