freeports_analysis.formats.algorithms.semistructured
Semi-structured algorithm pipeline management.
This module handles the loading and configuration of semi-structured PDF processing algorithms, including PDF filtering, text extraction, and deserialization functions for formats that have some structure but require flexible parsing.
Functions
Load and validate the formats mapping configuration. |
|
|
Get processing pipelines for a specific format. |
- freeports_analysis.formats.algorithms.semistructured.get_formats_mapping() DataFrame
Load and validate the formats mapping configuration.
- Returns:
Validated DataFrame with format-pipeline mappings
- Return type:
pd.DataFrame
Notes
The mapping defines which PDF filter, text extraction, and deserialization functions should be used for each format and pipeline combination.
- freeports_analysis.formats.algorithms.semistructured.get_pipes(format_name: str) Tuple[Dict[str, List[Callable]], Dict[str, List[Callable]], Dict[str, List[Callable]]]
Get processing pipelines for a specific format.
- Parameters:
format_name (str) – Name of the format to get pipelines for
- Returns:
Tuple containing three dictionaries for pdf_filter, text_extract, and deserialize segments. Each dictionary maps pipeline names to lists of processing functions.
- Return type:
Tuple[Dict[str, List[Callable]], Dict[str, List[Callable]], Dict[str, List[Callable]]]
Notes
Returns empty dictionaries if the format name is not found in the mapping.
Modules
Deserializing algorithms for semi-structured document processing. |
|
PDF filtering algorithms for semi-structured document processing. |
|
Text extraction algorithms for semi-structured document processing. |