freeports_analysis.formats.data
Data management for PDF format definitions and URL mappings.
This module handles the loading and validation of format definitions and URL-to-format mappings used in document processing.
Functions
Load and validate the list of formats from formats.csv. |
|
Get URL mappings grouped by format name. |
|
|
Associate a URL with a format name. |
- freeports_analysis.formats.data.get_formats() DataFrame
Load and validate the list of formats from formats.csv.
- Returns:
Validated DataFrame of formats with ‘Format name’ as index
- Return type:
pd.DataFrame
- Raises:
pa.errors.SchemaError – If the format data does not conform to the expected schema
Notes
Format names are constructed as: Name-LocaleYear[Country][Version] For example: ‘Amundi-IT23’ or ‘Eurizon-IT24@IT.v2’
- freeports_analysis.formats.data.get_url_mapping() DataFrame
Get URL mappings grouped by format name.
- Returns:
DataFrame with format names as index and lists of URLs as values
- Return type:
pd.DataFrame
Notes
The returned DataFrame aggregates all URLs associated with each format name into lists, allowing multiple URLs to map to the same format.
- freeports_analysis.formats.data.url_to_format(url: str) str | None
Associate a URL with a format name.
- Parameters:
url (str) – URL to match against known format URLs
- Returns:
Format name if a match is found, None otherwise
- Return type:
Optional[str]
Notes
This function uses prefix matching to determine the format - it selects the format with the longest matching URL prefix. This allows for more specific URLs to override more general ones.