How the data is validated

We think that an important part of developing an istrument that parse human input and aims to extract homogeneous information from an etherogeneus set of pdfs, should be granting a the best of his contextual possibilities the correctness and accuracy of the output.

We identify two opposite approaches to address this need:

don’t take any responsibility and force the user to accept an EULA that imply his duty to check any output
slow down development and limit the scope of the project, outputting data that is already present on third party databases and taking the implicit responsibility of a inexistent or delayed work

we decided to place ourself in near the second end, trying to don’t slow too much the work flow but to provide some kind of grant to the end user. For this reason we are committed to research and actively resonate in order to structure some protocols that make us sure to share responsibility with the final user for the data that we output. The user takes the duty of being conscious of the protocols used and should be aware and realistic about the limitations of any tool used for extracting data. We take the duty to continue to be transparent about the protocols used and to try to develop them at our best (and to grant their use).

We know that understanding and developing an opinion on the methodolgies used for validation is a work by itself and for this reason we will try to be available for explanations and receptive to the critiques.

Our aim is to develop a system that try to be democratic in practice, so taking into account the different backgrounds and expertises of the users and of the community.

We take responsibility for our mistakes because denying them would be dening being humans. At the same time we know how it is difficult to admit error especially in the context we live in. For this reason it is in our best interest to make reasonable evaluations for the health of the project and for delivering reliable data.

The general approach

This page of the documentation explains the general protocol used for granting some degree of reliability on the tests and in particular how we provide accountability for mistakes. The aim is to provide the user a transparent idea of the validation pipeline so that he can evaluate the limits of it.

The content of the page is condensed in a hexadecimal encoded identifier calculated applying SHA256 hashing function to content of the .rst file used for generating it. This file is compiled in the HTML page that you can read on the documentation website. The source file is in the repository at docs/source/validation/general_methodology.rst. It contains the guide and explanation for how we performed the tests and how the final user can track and reconstruct all the validation pipeline. In various files this file will be referenced through his hash or with the beginning part of it (enough to distinguish it from other hashes)

Note

The hash 8b1ba204bb69a0ade2bfcf65ef294a920f6bb361b317dba43c7ef29d96332b9b can be shortened to 8b1ba204 if is enough to identify the full hash

Tip

The SHA256 of a file can be calculated on linux with the command sha256sum <filename> with <filename> the path of the file to hash

Note

What The FHash?!? The hash function is a matematical function that take as input some data of arbitrary size and output a number (in this case in hexadecimal format) with a fixed size. These functions are used to map the content of a file in a code easy to verify and parse. The deterministic nature of the hash function grants that from the same input is generated the same hash. Opposingly the length of the output and the procedure for calculating it, grants that different files generate different hashes (this last adfermation is not analytically true but it is a reasonable approximantion under the assumption of the absence of hash collisions and it is one of the reason on why hashing functions are widely used in computer science)

Contents: