freeports_analysis.formats.utils.text_extract

Module for text block processing and extraction in document analysis.

This module provides functionality for: - Defining text block types through enumerations - Matching text against targets using various matching strategies - Extracting text blocks from PDF documents based on target matches - Supporting different matching methods (exact, fuzzy, prefix-based)

Key components: - Decorators for text block type definition (one_txt_blk, EquityBondTextBlockType) - Standard text extraction functionality through standard_text_extraction decorator

Functions

standard_text_extraction(market_value_pos[, ...])

Decorator for defining standard text extraction logic from PDF blocks based on target matches.

standard_text_extraction_loop([...])

Decorator for standard text extraction loop.

Classes

EquityBondTextBlockType(*values)

Enum representing two type of text blocks in document processing.

PdfBlocksTable(pdf_blocks)

Represents a table structure of PDF blocks organized by row and column.

class freeports_analysis.formats.utils.text_extract.EquityBondTextBlockType(*values)

Enum representing two type of text blocks in document processing.

BOND_TARGET

Text block containing target Bond row.

Type:

enum

EQUITY_TARGET

Text block containing target Equity row.

Type:

enum

class freeports_analysis.formats.utils.text_extract.PdfBlocksTable(pdf_blocks)

Represents a table structure of PDF blocks organized by row and column.

This class provides a tabular view of PDF blocks based on their row and column metadata, enabling efficient access and manipulation of blocks in a grid-like structure. It transforms a flat list of PDF blocks into a 2D table structure for easier navigation and manipulation of tabular data extracted from PDF documents.

Parameters:

pdf_blocks (List[PdfBlock]) – A list of PDF blocks that should have ‘table-row’ and ‘table-col’ metadata indicating their position in the table structure.

_blks

Original list of PDF blocks

Type:

List[PdfBlock]

_table_indexes

Index mapping from table coordinates to block indices

Type:

List[List[List[int]]]

_table

Table structure containing PDF blocks organized by row and column

Type:

List[List[List[PdfBlock]]]

Notes

  • The table structure allows for sparse tables (empty cells)

  • Multiple blocks can occupy the same cell (represented as lists)

  • Row and column indices start from 0

  • The shape property provides table dimensions

Examples

>>> # Assuming blocks have table-row and table-col metadata
>>> table = PdfBlocksTable(pdf_blocks)
>>> print(f"Table shape: {table.shape}")
Table shape: (5, 3)  # 5 rows, 3 columns
>>>
>>> # Access a specific cell
>>> cell_content = table[2, 1]  # Row 2, Column 1
>>>
>>> # Iterate through all blocks
>>> for block in table:
...     process_block(block)
merge(j, i)

Merge two blocks by combining their content.

Parameters:
  • j (int) – Index of first block to merge

  • i (int) – Index of second block to merge

Notes

The content of both blocks is concatenated and stored in the block with the lower index. The higher-indexed block is removed.

pop(j)

Remove a block from the table by index.

Parameters:

j (int) – Index of the block to remove

Notes

Updates the table structure and adjusts row numbers for blocks that come after the removed row.

property shape

Table dimensions.

Returns:

(number of rows, number of columns)

Return type:

Tuple[int, int]

freeports_analysis.formats.utils.text_extract.standard_text_extraction(market_value_pos: int, nominal_quantity_pos: int | None = None, perc_net_assets_pos: int | None = None, acquisition_currency_pos: int | None = None, acquisition_cost_pos: int | None = None, geometrical_indexes=True, merge_prev=False)

Decorator for defining standard text extraction logic from PDF blocks based on target matches.

Parameters:
  • nominal_quantity_pos (Optional[int], optional) – Relative position for nominal quantity metadata

  • market_value_pos (int) – Relative position for market value metadata

  • perc_net_assets_pos (Optional[int], optional) – Relative position for percentage of net assets metadata

  • acquisition_currency_pos (Optional[Currency], optional) – Either relative position for currency metadata or Currency enum value, by default None

  • acquisition_cost_pos (Optional[int], optional) – Relative position for acquisition cost metadata, by default None

Returns:

A wrapped text extraction function that processes PDF blocks and returns matched TextBlock objects

Return type:

callable

Notes

The decorated function can optionally be specified with the purpose of including additional metadata. The extraction process: 1. Normalizes and matches text against targets using the specified match_func 2. Extracts metadata from surrounding blocks based on extract_positions 3. Creates TextBlock objects for successful matches

freeports_analysis.formats.utils.text_extract.standard_text_extraction_loop(geometrical_indexes=True, merge_prev=False)

Decorator for standard text extraction loop.

This decorator wraps the function provided in the usual loop that gives a simplified and higher level context to the decorated text_extraction function. Specifically it expects that in the metadata of each PdfBlock is present an indicator of which column it is located graphically in the main table of the PDF page (it assumes that the data was tabular in some way) table-col.

Parameters:
  • geometrical_indexes (bool, optional) – Whether to use (row, column) coordinates instead of linear indices, by default True

  • merge_prev (bool, optional) – Whether to merge with previous block instead of next block, by default False

Returns:

Decorator that wraps text extraction functions with standard processing logic

Return type:

Callable

Notes

The loop performs the following steps: - Takes each block and concatenates the content with the subsequent if

they are on the same column.

  • Uses match_func to see if one between the target provided to the extraction function matches with the content of the block.

  • If it does, it overwrites the list of PdfBlock to persist the concatenation of the block with its subsequent.

  • Adds company metadata with the match

  • Creates a TextBlock adding the metadata provided by the wrapped function.

Modules

match

Target matching algorithms for company name extraction.