freeports_analysis.formats.utils.text_extract
Module for text block processing and extraction in document analysis.
This module provides functionality for: - Defining text block types through enumerations - Matching text against targets using various matching strategies - Extracting text blocks from PDF documents based on target matches - Supporting different matching methods (exact, fuzzy, prefix-based)
Key components: - Decorators for text block type definition (one_txt_blk, EquityBondTextBlockType) - Standard text extraction functionality through standard_text_extraction decorator
Functions
|
Decorator for defining standard text extraction logic from PDF blocks based on target matches. |
Decorator for standard text extraction loop. |
Classes
|
Enum representing two type of text blocks in document processing. |
|
Represents a table structure of PDF blocks organized by row and column. |
- class freeports_analysis.formats.utils.text_extract.EquityBondTextBlockType(*values)
Enum representing two type of text blocks in document processing.
- BOND_TARGET
Text block containing target Bond row.
- Type:
enum
- EQUITY_TARGET
Text block containing target Equity row.
- Type:
enum
- class freeports_analysis.formats.utils.text_extract.PdfBlocksTable(pdf_blocks)
Represents a table structure of PDF blocks organized by row and column.
This class provides a tabular view of PDF blocks based on their row and column metadata, enabling efficient access and manipulation of blocks in a grid-like structure. It transforms a flat list of PDF blocks into a 2D table structure for easier navigation and manipulation of tabular data extracted from PDF documents.
- Parameters:
pdf_blocks (List[PdfBlock]) – A list of PDF blocks that should have ‘table-row’ and ‘table-col’ metadata indicating their position in the table structure.
- _table_indexes
Index mapping from table coordinates to block indices
- Type:
List[List[List[int]]]
- _table
Table structure containing PDF blocks organized by row and column
- Type:
List[List[List[PdfBlock]]]
Notes
The table structure allows for sparse tables (empty cells)
Multiple blocks can occupy the same cell (represented as lists)
Row and column indices start from 0
The shape property provides table dimensions
Examples
>>> # Assuming blocks have table-row and table-col metadata >>> table = PdfBlocksTable(pdf_blocks) >>> print(f"Table shape: {table.shape}") Table shape: (5, 3) # 5 rows, 3 columns >>> >>> # Access a specific cell >>> cell_content = table[2, 1] # Row 2, Column 1 >>> >>> # Iterate through all blocks >>> for block in table: ... process_block(block)
- merge(j, i)
Merge two blocks by combining their content.
- Parameters:
j (int) – Index of first block to merge
i (int) – Index of second block to merge
Notes
The content of both blocks is concatenated and stored in the block with the lower index. The higher-indexed block is removed.
- pop(j)
Remove a block from the table by index.
- Parameters:
j (int) – Index of the block to remove
Notes
Updates the table structure and adjusts row numbers for blocks that come after the removed row.
- property shape
Table dimensions.
- Returns:
(number of rows, number of columns)
- Return type:
Tuple[int, int]
- freeports_analysis.formats.utils.text_extract.standard_text_extraction(market_value_pos: int, nominal_quantity_pos: int | None = None, perc_net_assets_pos: int | None = None, acquisition_currency_pos: int | None = None, acquisition_cost_pos: int | None = None, geometrical_indexes=True, merge_prev=False)
Decorator for defining standard text extraction logic from PDF blocks based on target matches.
- Parameters:
nominal_quantity_pos (Optional[int], optional) – Relative position for nominal quantity metadata
market_value_pos (int) – Relative position for market value metadata
perc_net_assets_pos (Optional[int], optional) – Relative position for percentage of net assets metadata
acquisition_currency_pos (Optional[Currency], optional) – Either relative position for currency metadata or Currency enum value, by default None
acquisition_cost_pos (Optional[int], optional) – Relative position for acquisition cost metadata, by default None
- Returns:
A wrapped text extraction function that processes PDF blocks and returns matched TextBlock objects
- Return type:
callable
Notes
The decorated function can optionally be specified with the purpose of including additional metadata. The extraction process: 1. Normalizes and matches text against targets using the specified match_func 2. Extracts metadata from surrounding blocks based on extract_positions 3. Creates TextBlock objects for successful matches
- freeports_analysis.formats.utils.text_extract.standard_text_extraction_loop(geometrical_indexes=True, merge_prev=False)
Decorator for standard text extraction loop.
This decorator wraps the function provided in the usual loop that gives a simplified and higher level context to the decorated text_extraction function. Specifically it expects that in the metadata of each PdfBlock is present an indicator of which column it is located graphically in the main table of the PDF page (it assumes that the data was tabular in some way) table-col.
- Parameters:
geometrical_indexes (bool, optional) – Whether to use (row, column) coordinates instead of linear indices, by default True
merge_prev (bool, optional) – Whether to merge with previous block instead of next block, by default False
- Returns:
Decorator that wraps text extraction functions with standard processing logic
- Return type:
Callable
Notes
The loop performs the following steps: - Takes each block and concatenates the content with the subsequent if
they are on the same column.
Uses match_func to see if one between the target provided to the extraction function matches with the content of the block.
If it does, it overwrites the list of PdfBlock to persist the concatenation of the block with its subsequent.
Adds company metadata with the match
Creates a TextBlock adding the metadata provided by the wrapped function.
Modules
Target matching algorithms for company name extraction. |