========================= How the data is validated ========================= We think that an important part of developing an istrument that parse human input and aims to extract homogeneous information from an etherogeneus set of pdfs, should be granting a the best of his contextual possibilities the correctness and accuracy of the output. We identify two opposite approaches to address this need: * don't take any responsibility and force the user to accept an **EULA** that imply his duty to check any output * slow down development and limit the scope of the project, outputting data that is already present on third party databases and taking the implicit responsibility of a inexistent or delayed work we decided to place ourself in near the second end, trying to don't slow too much the work flow but to provide some kind of grant to the end user. For this reason we are committed to research and actively resonate in order to structure some protocols that make us sure to share responsibility with the final user for the data that we output. The user takes the duty of being conscious of the protocols used and should be aware and realistic about the limitations of any tool used for extracting data. We take the duty to continue to be transparent about the protocols used and to try to develop them at our best (and to grant their use). We know that understanding and developing an opinion on the methodolgies used for validation is a work by itself and for this reason we will try to be available for explanations and receptive to the critiques. Our aim is to develop a system that try to be democratic in practice, so taking into account the different backgrounds and expertises of the users and of the community. We take responsibility for our mistakes because denying them would be dening being humans. At the same time we know how it is difficult to admit error especially in the context we live in. For this reason it is in our best interest to make reasonable evaluations for the health of the project and for delivering reliable data. ******************** The general approach ******************** :doc:`This page ` of the documentation explains the general protocol used for granting some degree of reliability on the tests and in particular how we provide accountability for mistakes. The aim is to provide the user a transparent idea of the validation pipeline so that he can evaluate the limits of it. The content of the page is condensed in a hexadecimal encoded identifier calculated applying ``SHA256`` hashing function to content of the ``.rst`` file used for generating it. This file is compiled in the :doc:`HTML page ` that you can read on the documentation website. The source file is in the repository at `docs/source/validation/general_methodology.rst `_. It contains the guide and explanation for how we performed the tests and how the final user can track and reconstruct all the validation pipeline. In various files this file will be referenced through his *hash* or with the beginning part of it (enough to distinguish it from other *hashes*) .. note:: The hash ``8b1ba204bb69a0ade2bfcf65ef294a920f6bb361b317dba43c7ef29d96332b9b`` can be shortened to ``8b1ba204`` if is enough to identify the full hash .. tip:: The ``SHA256`` of a file can be calculated on linux with the command ``sha256sum `` with ```` the path of the file to hash .. note:: **What The FHash?!?** The hash function is a matematical function that take as input some data of arbitrary size and output a number (in this case in hexadecimal format) with a fixed size. These functions are used to map the content of a file in a code easy to verify and parse. The deterministic nature of the hash function grants that from the same input is generated the same *hash*. Opposingly the length of the output and the procedure for calculating it, grants that different files generate different *hashes* (this last adfermation is not analytically true but it is a reasonable approximantion under the assumption of the absence of `hash collisions `_ and it is one of the reason on why hashing functions are widely used in computer science) .. toctree:: :maxdepth: 2 :caption: Contents: general_methodology methodologies/basic_check methodologies/golden_standard methodologies/agreement_and_good_faith assertions/validation_algorithm_trustworthiness