freeports_analysis.formats.data

Data management for PDF format definitions and URL mappings.

This module handles the loading and validation of format definitions and URL-to-format mappings used in document processing.

Functions

get_formats()

Load and validate the list of formats from formats.csv.

get_url_mapping()

Get URL mappings grouped by format name.

url_to_format(url)

Associate a URL with a format name.

freeports_analysis.formats.data.get_formats() DataFrame

Load and validate the list of formats from formats.csv.

Returns:

Validated DataFrame of formats with ‘Format name’ as index

Return type:

pd.DataFrame

Raises:

pa.errors.SchemaError – If the format data does not conform to the expected schema

Notes

Format names are constructed as: Name-LocaleYear[Country][Version] For example: ‘Amundi-IT23’ or ‘Eurizon-IT24@IT.v2

freeports_analysis.formats.data.get_url_mapping() DataFrame

Get URL mappings grouped by format name.

Returns:

DataFrame with format names as index and lists of URLs as values

Return type:

pd.DataFrame

Notes

The returned DataFrame aggregates all URLs associated with each format name into lists, allowing multiple URLs to map to the same format.

freeports_analysis.formats.data.url_to_format(url: str) str | None

Associate a URL with a format name.

Parameters:

url (str) – URL to match against known format URLs

Returns:

Format name if a match is found, None otherwise

Return type:

Optional[str]

Notes

This function uses prefix matching to determine the format - it selects the format with the longest matching URL prefix. This allows for more specific URLs to override more general ones.