freeports_analysis.formats.algorithms.semistructured

Semi-structured algorithm pipeline management.

This module handles the loading and configuration of semi-structured PDF processing algorithms, including PDF filtering, text extraction, and deserialization functions for formats that have some structure but require flexible parsing.

Functions

get_formats_mapping()

Load and validate the formats mapping configuration.

get_pipes(format_name)

Get processing pipelines for a specific format.

freeports_analysis.formats.algorithms.semistructured.get_formats_mapping() DataFrame

Load and validate the formats mapping configuration.

Returns:

Validated DataFrame with format-pipeline mappings

Return type:

pd.DataFrame

Notes

The mapping defines which PDF filter, text extraction, and deserialization functions should be used for each format and pipeline combination.

freeports_analysis.formats.algorithms.semistructured.get_pipes(format_name: str) Tuple[Dict[str, List[Callable]], Dict[str, List[Callable]], Dict[str, List[Callable]]]

Get processing pipelines for a specific format.

Parameters:

format_name (str) – Name of the format to get pipelines for

Returns:

Tuple containing three dictionaries for pdf_filter, text_extract, and deserialize segments. Each dictionary maps pipeline names to lists of processing functions.

Return type:

Tuple[Dict[str, List[Callable]], Dict[str, List[Callable]], Dict[str, List[Callable]]]

Notes

Returns empty dictionaries if the format name is not found in the mapping.

Modules

deserialize

Deserializing algorithms for semi-structured document processing.

pdf_filter

PDF filtering algorithms for semi-structured document processing.

text_extract

Text extraction algorithms for semi-structured document processing.