freeports_analysis.formats.algorithms

Core algorithms module for PDF document processing pipelines.

This module provides the main execution functions for the three-stage processing pipeline: 1. PDF filtering - extract relevant blocks from PDF XML 2. Text extraction - convert PDF blocks to text blocks with company matching 3. Deserialization - convert text blocks to structured financial data

The module also handles pipeline composition and execution coordination.

Functions

`deserialize_exec`(i_batch_page, n_pages, ...)	Execute deserialization functions to convert TextBlocks to financial data objects.
`get_pipelines`(format_name[, ...])	Get processing pipelines for a specific format.
`pdf_filter_exec`(i_batch_page, n_pages, ...)	Execute PDF filtering functions to extract relevant blocks from PDF XML.
`text_extract_exec`(i_batch_page, n_pages, ...)	Execute text extraction functions to convert PdfBlocks to TextBlocks with company matching.

Classes

LogFormatterWithPage(old_formatter)

Log formatter that adds page number context to log messages.

class freeports_analysis.formats.algorithms.LogFormatterWithPage(old_formatter: Formatter)

Log formatter that adds page number context to log messages.

This formatter wraps an existing formatter and inserts page number information into formatted log records.

_parent_fmt

The original formatter to wrap

Type:: log.Formatter

page

Current page number for context

Type:: Optional[int]

format(record: LogRecord) → str

Format a log record with page number context.

Parameters:: record (log.LogRecord) – The log record to format
Returns:: Formatted log message with page number inserted
Return type:: str

freeports_analysis.formats.algorithms.deserialize_exec(i_batch_page: int, n_pages: int, text_blocks_batch: List[List[TextBlock]], deserialize_funcs: List[Callable[[TextBlock], Investment | Dict[str, Promise | Any]]]) → List[List[Investment | Dict[str, Promise | Any]]]

Execute deserialization functions to convert TextBlocks to financial data objects.

Parameters:

i_batch_page (int) – Starting page index for this batch
n_pages (int) – Total number of pages in the document
text_blocks_batch (List[List[TextBlock]]) – Batch of TextBlock lists to process
deserialize_funcs (List[Callable[[TextBlock], Union[Investment, PromisesResolutionContext]]]) – List of deserialization functions

Returns:

List of financial data objects or promise contexts

Return type:

List[List[Union[Investment, PromisesResolutionContext]]]

freeports_analysis.formats.algorithms.get_pipelines(format_name: str, allow_partial_pipelines: bool = False) → Dict[str, Tuple[List[Callable], List[Callable], List[Callable]]]

Get processing pipelines for a specific format.

Combines structured, semi-structured, and unstructured pipelines for the given format.

Parameters:

format_name (str) – Name of the format to get pipelines for
allow_partial_pipelines (bool) – Whether to allow pipelines with missing components

Returns:

Dictionary mapping pipeline names to (pdf_filters, text_extract, deserialize) tuples

Return type:

Dict[str, Tuple[List[Callable], List[Callable], List[Callable]]]

Raises:

ValueError – If required pipeline components are missing and allow_partial_pipelines is False

Notes

Each pipeline consists of three components: - pdf_filters: Functions that extract relevant blocks from PDF XML - text_extract: Functions that convert PDF blocks to text blocks with company matching - deserialize: Functions that convert text blocks to structured financial data

The function combines pipelines from structured, semi-structured, and unstructured processing approaches to provide comprehensive format support.

freeports_analysis.formats.algorithms.pdf_filter_exec(i_batch_page: int, n_pages: int, batch_pages: List[str], pdf_filter_funcs: List[Callable[[str], List[PdfBlock]]]) → List[List[PdfBlock]]

Execute PDF filtering functions to extract relevant blocks from PDF XML.

Parameters:

i_batch_page (int) – Starting page index for this batch
n_pages (int) – Total number of pages in the document
batch_pages (List[str]) – List of XML page strings to process
pdf_filter_funcs (List[Callable[[str], List[PdfBlock]]]) – List of functions that extract PdfBlocks from XML

Returns:

List of PdfBlock lists, one per page

Return type:

List[List[PdfBlock]]

freeports_analysis.formats.algorithms.text_extract_exec(i_batch_page: int, n_pages: int, pdf_blocks_batch: List[List[PdfBlock]], targets: List[str], text_extract_funcs: List[Callable[[List[PdfBlock], Any], List[TextBlock]]]) → List[List[TextBlock]]

Execute text extraction functions to convert PdfBlocks to TextBlocks with company matching.

Parameters:

i_batch_page (int) – Starting page index for this batch
n_pages (int) – Total number of pages in the document
pdf_blocks_batch (List[List[PdfBlock]]) – Batch of PdfBlock lists to process
targets (List[str]) – Target companies for matching
text_extract_funcs (List[Callable[[List[PdfBlock], Any], List[TextBlock]]]) – List of text extraction functions

Returns:

List of TextBlock lists, one per page

Return type:

List[List[TextBlock]]

Modules

`commons`	Common utilities and data structures for algorithm pipeline management.
`semistructured`	Semi-structured algorithm pipeline management.
`structured`	Structured algorithm pipeline management.
`unstructured`	Unstructured algorithm pipeline management.