freeports_analysis.formats.algorithms
Core algorithms module for PDF document processing pipelines.
This module provides the main execution functions for the three-stage processing pipeline: 1. PDF filtering - extract relevant blocks from PDF XML 2. Text extraction - convert PDF blocks to text blocks with company matching 3. Deserialization - convert text blocks to structured financial data
The module also handles pipeline composition and execution coordination.
Functions
|
Execute deserialization functions to convert TextBlocks to financial data objects. |
|
Get processing pipelines for a specific format. |
|
Execute PDF filtering functions to extract relevant blocks from PDF XML. |
|
Execute text extraction functions to convert PdfBlocks to TextBlocks with company matching. |
Classes
|
Log formatter that adds page number context to log messages. |
- class freeports_analysis.formats.algorithms.LogFormatterWithPage(old_formatter: Formatter)
Log formatter that adds page number context to log messages.
This formatter wraps an existing formatter and inserts page number information into formatted log records.
- _parent_fmt
The original formatter to wrap
- Type:
log.Formatter
- page
Current page number for context
- Type:
Optional[int]
- format(record: LogRecord) str
Format a log record with page number context.
- Parameters:
record (log.LogRecord) – The log record to format
- Returns:
Formatted log message with page number inserted
- Return type:
str
- freeports_analysis.formats.algorithms.deserialize_exec(i_batch_page: int, n_pages: int, text_blocks_batch: List[List[TextBlock]], deserialize_funcs: List[Callable[[TextBlock], Investment | Dict[str, Promise | Any]]]) List[List[Investment | Dict[str, Promise | Any]]]
Execute deserialization functions to convert TextBlocks to financial data objects.
- Parameters:
i_batch_page (int) – Starting page index for this batch
n_pages (int) – Total number of pages in the document
text_blocks_batch (List[List[TextBlock]]) – Batch of TextBlock lists to process
deserialize_funcs (List[Callable[[TextBlock], Union[Investment, PromisesResolutionContext]]]) – List of deserialization functions
- Returns:
List of financial data objects or promise contexts
- Return type:
List[List[Union[Investment, PromisesResolutionContext]]]
- freeports_analysis.formats.algorithms.get_pipelines(format_name: str, allow_partial_pipelines: bool = False) Dict[str, Tuple[List[Callable], List[Callable], List[Callable]]]
Get processing pipelines for a specific format.
Combines structured, semi-structured, and unstructured pipelines for the given format.
- Parameters:
format_name (str) – Name of the format to get pipelines for
allow_partial_pipelines (bool) – Whether to allow pipelines with missing components
- Returns:
Dictionary mapping pipeline names to (pdf_filters, text_extract, deserialize) tuples
- Return type:
Dict[str, Tuple[List[Callable], List[Callable], List[Callable]]]
- Raises:
ValueError – If required pipeline components are missing and allow_partial_pipelines is False
Notes
Each pipeline consists of three components: - pdf_filters: Functions that extract relevant blocks from PDF XML - text_extract: Functions that convert PDF blocks to text blocks with company matching - deserialize: Functions that convert text blocks to structured financial data
The function combines pipelines from structured, semi-structured, and unstructured processing approaches to provide comprehensive format support.
- freeports_analysis.formats.algorithms.pdf_filter_exec(i_batch_page: int, n_pages: int, batch_pages: List[str], pdf_filter_funcs: List[Callable[[str], List[PdfBlock]]]) List[List[PdfBlock]]
Execute PDF filtering functions to extract relevant blocks from PDF XML.
- Parameters:
i_batch_page (int) – Starting page index for this batch
n_pages (int) – Total number of pages in the document
batch_pages (List[str]) – List of XML page strings to process
pdf_filter_funcs (List[Callable[[str], List[PdfBlock]]]) – List of functions that extract PdfBlocks from XML
- Returns:
List of PdfBlock lists, one per page
- Return type:
List[List[PdfBlock]]
- freeports_analysis.formats.algorithms.text_extract_exec(i_batch_page: int, n_pages: int, pdf_blocks_batch: List[List[PdfBlock]], targets: List[str], text_extract_funcs: List[Callable[[List[PdfBlock], Any], List[TextBlock]]]) List[List[TextBlock]]
Execute text extraction functions to convert PdfBlocks to TextBlocks with company matching.
- Parameters:
i_batch_page (int) – Starting page index for this batch
n_pages (int) – Total number of pages in the document
pdf_blocks_batch (List[List[PdfBlock]]) – Batch of PdfBlock lists to process
targets (List[str]) – Target companies for matching
text_extract_funcs (List[Callable[[List[PdfBlock], Any], List[TextBlock]]]) – List of text extraction functions
- Returns:
List of TextBlock lists, one per page
- Return type:
List[List[TextBlock]]
Modules
Common utilities and data structures for algorithm pipeline management. |
|
Semi-structured algorithm pipeline management. |
|
Structured algorithm pipeline management. |
|
Unstructured algorithm pipeline management. |