freeports_analysis.formats.algorithms.structured

Structured algorithm pipeline management.

This module handles the loading and configuration of structured PDF processing algorithms for formats with well-defined layouts and consistent data structures.

Functions

get_additional_args()

Gets and validates the additional args table

get_additional_headers()

Gets and validates the additional headers table

get_args()

Gets and validates the args table

get_deselection_lists()

Gets and validates the deselection list table

get_partial_pipes()

Gets and validates the partial pipes table

get_pipes(format_name)

Get processing pipelines for a specific structured format.

get_structured_formats()

Get complete structured formats configuration with all parameters.

validate_partial_pipes(segment, columns)

Create a validation function for partial pipeline configurations.

freeports_analysis.formats.algorithms.structured.get_additional_args() DataFrame

Gets and validates the additional args table

Returns:

Validated DataFrame

Return type:

pd.DataFrame

freeports_analysis.formats.algorithms.structured.get_additional_headers() DataFrame

Gets and validates the additional headers table

Returns:

Validated DataFrame

Return type:

pd.DataFrame

freeports_analysis.formats.algorithms.structured.get_args() DataFrame

Gets and validates the args table

Returns:

Validated DataFrame

Return type:

pd.DataFrame

freeports_analysis.formats.algorithms.structured.get_deselection_lists() DataFrame

Gets and validates the deselection list table

Returns:

Validated DataFrame

Return type:

pd.DataFrame

freeports_analysis.formats.algorithms.structured.get_partial_pipes() DataFrame

Gets and validates the partial pipes table

Returns:

Validated DataFrame

Return type:

pd.DataFrame

freeports_analysis.formats.algorithms.structured.get_pipes(format_name: str) Tuple[Dict[str, List[Callable]], Dict[str, List[Callable]], Dict[str, List[Callable]]]

Get processing pipelines for a specific structured format.

Parameters:

format_name (str) – Name of the format to get pipelines for

Returns:

Tuple containing three dictionaries for pdf_filter, text_extract, and deserialize segments. Each dictionary maps pipeline names to lists of processing functions.

Return type:

Tuple[Dict[str, List[Callable]], Dict[str, List[Callable]], Dict[str, List[Callable]]]

Notes

Returns empty dictionaries if the format name is not found in the mapping.

freeports_analysis.formats.algorithms.structured.get_structured_formats() DataFrame

Get complete structured formats configuration with all parameters.

Returns:

DataFrame containing all structured format configurations

Return type:

pd.DataFrame

Notes

This function combines multiple configuration tables into a single comprehensive DataFrame with all parameters needed for structured PDF processing algorithms.

freeports_analysis.formats.algorithms.structured.validate_partial_pipes(segment: str, columns: List[str]) Callable[[DataFrame], Series]

Create a validation function for partial pipeline configurations.

This function generates a validator that ensures when a pipeline segment is disabled, the corresponding configuration columns are also empty.

Parameters:
  • segment (str) – Name of the pipeline segment (‘pdf_filter’, ‘text_extract’, or ‘deserialize’)

  • columns (List[str]) – List of column names that should be empty when the segment is disabled

Returns:

Validation function that returns a boolean Series indicating valid rows

Return type:

Callable[[pd.DataFrame], pd.Series]