freeports_analysis.formats.algorithms.structured
Structured algorithm pipeline management.
This module handles the loading and configuration of structured PDF processing algorithms for formats with well-defined layouts and consistent data structures.
Functions
Gets and validates the additional args table |
|
Gets and validates the additional headers table |
|
|
Gets and validates the args table |
Gets and validates the deselection list table |
|
Gets and validates the partial pipes table |
|
|
Get processing pipelines for a specific structured format. |
Get complete structured formats configuration with all parameters. |
|
|
Create a validation function for partial pipeline configurations. |
- freeports_analysis.formats.algorithms.structured.get_additional_args() DataFrame
Gets and validates the additional args table
- Returns:
Validated DataFrame
- Return type:
pd.DataFrame
- freeports_analysis.formats.algorithms.structured.get_additional_headers() DataFrame
Gets and validates the additional headers table
- Returns:
Validated DataFrame
- Return type:
pd.DataFrame
- freeports_analysis.formats.algorithms.structured.get_args() DataFrame
Gets and validates the args table
- Returns:
Validated DataFrame
- Return type:
pd.DataFrame
- freeports_analysis.formats.algorithms.structured.get_deselection_lists() DataFrame
Gets and validates the deselection list table
- Returns:
Validated DataFrame
- Return type:
pd.DataFrame
- freeports_analysis.formats.algorithms.structured.get_partial_pipes() DataFrame
Gets and validates the partial pipes table
- Returns:
Validated DataFrame
- Return type:
pd.DataFrame
- freeports_analysis.formats.algorithms.structured.get_pipes(format_name: str) Tuple[Dict[str, List[Callable]], Dict[str, List[Callable]], Dict[str, List[Callable]]]
Get processing pipelines for a specific structured format.
- Parameters:
format_name (str) – Name of the format to get pipelines for
- Returns:
Tuple containing three dictionaries for pdf_filter, text_extract, and deserialize segments. Each dictionary maps pipeline names to lists of processing functions.
- Return type:
Tuple[Dict[str, List[Callable]], Dict[str, List[Callable]], Dict[str, List[Callable]]]
Notes
Returns empty dictionaries if the format name is not found in the mapping.
- freeports_analysis.formats.algorithms.structured.get_structured_formats() DataFrame
Get complete structured formats configuration with all parameters.
- Returns:
DataFrame containing all structured format configurations
- Return type:
pd.DataFrame
Notes
This function combines multiple configuration tables into a single comprehensive DataFrame with all parameters needed for structured PDF processing algorithms.
- freeports_analysis.formats.algorithms.structured.validate_partial_pipes(segment: str, columns: List[str]) Callable[[DataFrame], Series]
Create a validation function for partial pipeline configurations.
This function generates a validator that ensures when a pipeline segment is disabled, the corresponding configuration columns are also empty.
- Parameters:
segment (str) – Name of the pipeline segment (‘pdf_filter’, ‘text_extract’, or ‘deserialize’)
columns (List[str]) – List of column names that should be empty when the segment is disabled
- Returns:
Validation function that returns a boolean Series indicating valid rows
- Return type:
Callable[[pd.DataFrame], pd.Series]