================ Code development ================ The focus of this guide is the development of the code related specific to the parsing of a `pdf format`. With `pdf format` we identify the series of operations and the code necessary to parse to `csv` a series of financial reports. We decided to implement a structure flexible enough to be able to parse different reports, but making some hypotesys in order to semplify and standardize the parsing process. The aim is to make to not repeat all the code common to all the format so that the contributions can focus on enlarge the database of the supported pdf. The general schema for parsing a pdf can be rappresented with the following: .. image:: ./assets/schema_algorithm.svg The document is firstly divided in different pages and converted in a ``xml`` string using the `PyMuPDF `_ package. The assumption is that each page of the document contain the necessary information to get the context and to understand the meaning of the data relevant to construct the final `csv file`. This assumption seems quite resrtictive, but it is very essential in order to simplify and design reasonably fast pasrsing algorithms. This assumption makes possible to implement data level parallelism on the pdf pages; now the current implementation run on different batch of pages in parallel spawning different processes (when in :ref:`BATCH MODE ` the parallelism is done at document level and the different pages are processed sequentially). The idea behind the assumption is that a well designed pdf page should contain all the necessary information for a human being to understand the context. If this assumption will show to be too resrtictive we could not assume that the context of the page would be the previous pages, but we would be forced to parse the document subsequent times in order to get the relevant context. .. warning:: Sometimes the information doesn't seems to be self contained but this dependence is only a one to one, one to many or many to many relation between the different objects. An example of this can be the case of the parsing of a pdf document containing some data related to some customers and in other pages, about orders and products. In this example it seems that the `Customer`, the `Product` and the `Order` records are interdependent, but depending of the report design the algorithm can parse the self contained information of `Customers`, `Products` and `Orders` and just right before the csv output relate the different record with an ``id`` field (for uderstanding better the proposed approach for how to treat these kind of cases think at the concept of `foreign key `_ in the database realm). The mechanism for supporting this degree of interdependance it is by the use of ``Promises`` and the output of ``PromisesResolutionContexs``. The assumption breaks in the presence of a true dependence of the information. The developer that wants to add a series of reports to the supported ones, has to provide the specific implementation of the ``PdfFilter``, ``TextExtract`` and ``Deserialize`` functions (with them of the ``PdfBlock`` and ``TextBlock`` classes). Each ``xml`` page will be parsed by a ``PdfFilter`` function that will output a list of relevant ``PdfBlocks``. The content of these blocks is related and parsed by a function called ``TextExtract`` and output a list of blocks that have a one to one relation with the information that will populate a csv row called ``TextBlock``. Each block is parsed independently into an object (an abstract ``FinancialData``) that as to be one between the one reconnaissed by the system (for now ``Equity`` and ``Bond``) or a ``PromisesResolutionContext``. The ``FinancialData`` can contain the actual data or deferred values called in this context 'promises' of value. In the most common case the financial data contain all the relevant data and is serialized in a python dict in a predefined way and output to a the resulting csv file. ------------- ``PdfFilter`` ------------- The aim of this function is to filter from the ``xml`` rappresentation of the pdf page the relevant information using the layout or the typografic signature of them. A constraint that we choose to try to respect is to use only information related to the grafical appearance of the pdf without looking at the textual content and parse it semantically or evaluating the meaning. This constraint is important to divide the complexity of parsing a pdf in more straight forward and standardized steps. The typical ``PdfFilter`` function will look at the font, the position of the different parts of the pdf and eventually part of the text that is fixed and is considered part of the page layout (for example a header present in all pages), from this will compute some bit of information and add converge to some ``PdfBlocks`` with their custom metadata. --------------- ``TextExtract`` --------------- The aim of this function is to extract from the ``PdfBlocks`` the relevant blocks of text and add semantic information. In particular this function is the one that takes in input the list of the target companies and that look at the content of the variable text present in the ``PdfBlocks``. The typical ``TextExtract`` function will use `regular expressions `_ to understand the role of each part of test and to extract the relevant information of the financial data related with the companies of interest. --------------- ``Deserialize`` --------------- The aim of this function is to transform the parsed information into a python object with some standard fields and validate them. This function will cast the data into the python types that more reflect the nature of the data. In this steps no filtering is done, each ``TextBlock`` will correspond to a python object. This constraint impose the focus of this function to the cast operation of the different field. The abstract class that this function output is called ``FinancialData``. Each class field is read only and consists in some core information and some additional infos. The concrete classes that inherit from ``FinancialData`` for now are the ``Equity`` and ``Bond`` classes (the bond class has the interest rate and the maturity date as additional infos). This objects are then casted in a predefined way in a python dict and then with `Pandas `_ on a dataframe dumped directly in a `csv file`. The ``FinancialData`` usually contain all the concrete values but sometimes can happen that some fields are resolvables only after complete parsing; for cases like that refear to the :ref:`Promises section ` ------------------------------ ``PdfBlock`` and ``TextBlock`` ------------------------------ These block consists of three part: - A ``BlockType`` (``PdfBlockType`` and ``TextBlockType``), that is a required label that identify the grafical and the semantic role of the information bit. These two fieds are implemented as enums and the possible values has to be chosen by the format developer (it is part of the format specification) - A ``metadata`` field, that should contain custom data related to graphical and semantic information. The ``BlockType`` should be related to the expected keys in the metadata field. The implementation is a free and optional python dict - An optional ``content`` field, that should contain the textual information in the pdf related to the block in addition to these fields, the ``TextBlock`` has a ``PdfBlock`` related to him from which is is taken from (if it is computed from more, the most important). .. _promises: ----------- ``Promise`` ----------- This class is a generic class that exists in more concrete forms like ``SubfundPromise`` or ``CurrencyPromise`` and it represent a placeholder to an information that can be resolved only after the pdf is completely parsed. When a ``FinancialData`` is istantiated the developer can `'promise'` that the data will be provided putting a promise as a value. The promise is caraterized by an ``id`` field that identify it in order to resolved to a specific value after the parsing. The value will be resolved using a ``PromisesResolutionMap`` that link ``id`` to values or to other ``Promises`` in a chain that has to close with a concrete value. In this way right before generating the ``csv`` the sobstitution are made and the actual field are computed using the map. This map is built merging all the ``PromisesResolutionContexts`` in output of the ``deserialize`` function. The ``PromisesResolutionContexs`` should be something in the form of a python dict with a key (that will be matched on the ``Promises`` ids) and a value (concrete or itself a ``Promise``). .. warning:: The ``id`` is expected to be provided one and only one time in the whole parsing process. If the same id is provided in two different ``PromisesResolutionContexs`` the value is overwritten in a unpredictable way. ----------- In practice ----------- If you want to create the support for a new format you should choose a name and create a directory with same name into ``src/freeports_analysis/formats``. In this directory you can put the number of files and subdirectory that you want, the only constraint is that when imported as a submodule, the package expect to find the three function ``PdfFilter``, ``TextExtract``, ``Deserialize`` with their respective signatures and the definition of the two enums ``PdfBlockType`` and ``TextBlockType``. You can take as reference the format called ``_default`` as a template. For helping the support for new formats we provide a series of utilities into the submodule ``src/freeports_analysis/formats_utils``. These utilities divide in four different kind of helpers: - useful function to create the parsing algorithm - decorators for adding standard functionalities to the parsing functions - decorators that provide a standard implementation of ``PdfFilter``, ``TextExtract`` and ``Deserialize`` - enums containing common ``PdfBlockType`` and ``TextBlockType`` implementations the aim is to see the repeatitive parsing patterns and create some routines and implementation that can be parametrized and reused. To activate the support for a certain format, his name in capslock has to be added to the ``src/freeports_analysis/data/format_url_mapping.yaml`` file; this file is composed with some format names (this name lowered will be the same of the module loaded to parse the pdf) and a list of regular expressions used against url provided (``URL``) as a program argument to infer the correct format to use upon remote resource parsing. .. note:: the order of the regular expressions and of the formats matters, because the inferred one is the first match of this list. Last but not least there is a directory in the repository called ``devtools/`` whit a template `Jupyter Notebook `_ and some utilities for examining the pdf, developing the format specification and tests. This directory is intended as the directory where to design and experiment with the different component before integrating in the package. In order to use the notebook, copy the template and rename it as you want (deleting the ``.template.`` part).