freeports_analysis.formats.algorithms

Core algorithms module for PDF document processing pipelines.

This module provides the main execution functions for the three-stage processing pipeline: 1. PDF filtering - extract relevant blocks from PDF XML 2. Text extraction - convert PDF blocks to text blocks with company matching 3. Deserialization - convert text blocks to structured financial data

The module also handles pipeline composition and execution coordination.

Functions

deserialize_exec(i_batch_page, n_pages, ...)

Execute deserialization functions to convert TextBlocks to financial data objects.

get_pipelines(format_name[, ...])

Get processing pipelines for a specific format.

pdf_filter_exec(i_batch_page, n_pages, ...)

Execute PDF filtering functions to extract relevant blocks from PDF XML.

text_extract_exec(i_batch_page, n_pages, ...)

Execute text extraction functions to convert PdfBlocks to TextBlocks with company matching.

Classes

LogFormatterWithPage(old_formatter)

Log formatter that adds page number context to log messages.

class freeports_analysis.formats.algorithms.LogFormatterWithPage(old_formatter: Formatter)

Log formatter that adds page number context to log messages.

This formatter wraps an existing formatter and inserts page number information into formatted log records.

_parent_fmt

The original formatter to wrap

Type:

log.Formatter

page

Current page number for context

Type:

Optional[int]

format(record: LogRecord) str

Format a log record with page number context.

Parameters:

record (log.LogRecord) – The log record to format

Returns:

Formatted log message with page number inserted

Return type:

str

freeports_analysis.formats.algorithms.deserialize_exec(i_batch_page: int, n_pages: int, text_blocks_batch: List[List[TextBlock]], deserialize_funcs: List[Callable[[TextBlock], Investment | Dict[str, Promise | Any]]]) List[List[Investment | Dict[str, Promise | Any]]]

Execute deserialization functions to convert TextBlocks to financial data objects.

Parameters:
  • i_batch_page (int) – Starting page index for this batch

  • n_pages (int) – Total number of pages in the document

  • text_blocks_batch (List[List[TextBlock]]) – Batch of TextBlock lists to process

  • deserialize_funcs (List[Callable[[TextBlock], Union[Investment, PromisesResolutionContext]]]) – List of deserialization functions

Returns:

List of financial data objects or promise contexts

Return type:

List[List[Union[Investment, PromisesResolutionContext]]]

freeports_analysis.formats.algorithms.get_pipelines(format_name: str, allow_partial_pipelines: bool = False) Dict[str, Tuple[List[Callable], List[Callable], List[Callable]]]

Get processing pipelines for a specific format.

Combines structured, semi-structured, and unstructured pipelines for the given format.

Parameters:
  • format_name (str) – Name of the format to get pipelines for

  • allow_partial_pipelines (bool) – Whether to allow pipelines with missing components

Returns:

Dictionary mapping pipeline names to (pdf_filters, text_extract, deserialize) tuples

Return type:

Dict[str, Tuple[List[Callable], List[Callable], List[Callable]]]

Raises:

ValueError – If required pipeline components are missing and allow_partial_pipelines is False

Notes

Each pipeline consists of three components: - pdf_filters: Functions that extract relevant blocks from PDF XML - text_extract: Functions that convert PDF blocks to text blocks with company matching - deserialize: Functions that convert text blocks to structured financial data

The function combines pipelines from structured, semi-structured, and unstructured processing approaches to provide comprehensive format support.

freeports_analysis.formats.algorithms.pdf_filter_exec(i_batch_page: int, n_pages: int, batch_pages: List[str], pdf_filter_funcs: List[Callable[[str], List[PdfBlock]]]) List[List[PdfBlock]]

Execute PDF filtering functions to extract relevant blocks from PDF XML.

Parameters:
  • i_batch_page (int) – Starting page index for this batch

  • n_pages (int) – Total number of pages in the document

  • batch_pages (List[str]) – List of XML page strings to process

  • pdf_filter_funcs (List[Callable[[str], List[PdfBlock]]]) – List of functions that extract PdfBlocks from XML

Returns:

List of PdfBlock lists, one per page

Return type:

List[List[PdfBlock]]

freeports_analysis.formats.algorithms.text_extract_exec(i_batch_page: int, n_pages: int, pdf_blocks_batch: List[List[PdfBlock]], targets: List[str], text_extract_funcs: List[Callable[[List[PdfBlock], Any], List[TextBlock]]]) List[List[TextBlock]]

Execute text extraction functions to convert PdfBlocks to TextBlocks with company matching.

Parameters:
  • i_batch_page (int) – Starting page index for this batch

  • n_pages (int) – Total number of pages in the document

  • pdf_blocks_batch (List[List[PdfBlock]]) – Batch of PdfBlock lists to process

  • targets (List[str]) – Target companies for matching

  • text_extract_funcs (List[Callable[[List[PdfBlock], Any], List[TextBlock]]]) – List of text extraction functions

Returns:

List of TextBlock lists, one per page

Return type:

List[List[TextBlock]]

Modules

commons

Common utilities and data structures for algorithm pipeline management.

semistructured

Semi-structured algorithm pipeline management.

structured

Structured algorithm pipeline management.

unstructured

Unstructured algorithm pipeline management.