freeports_analysis.formats.utils.text_extract.match

Target matching algorithms for company name extraction.

This module provides functions for matching text against target companies using various matching strategies including exact matches, regex patterns, and symbol-based matching.

Functions

dataframe_to_match(target_companies)

Prepare target company data for matching.

match_company(text, target_companies)

Match text against target companies using multiple matching strategies.

normalize_string(string)

Normalize a string by making it lowercase and removing accents.

freeports_analysis.formats.utils.text_extract.match.dataframe_to_match(target_companies: DataFrame) Tuple[List[Tuple], Dict]

Prepare target company data for matching.

Parameters:

target_companies (pd.DataFrame) – DataFrame containing company matching data

Returns:

Tuple containing: - matching_data: List of tuples with company matching information - regexs_table: Dictionary mapping company indices to compiled regex patterns

Return type:

Tuple[List[Tuple], Dict]

Notes

The returned data structure is optimized for efficient matching: - Companies are sorted by name length (longest first) for exact matching - Regex patterns are pre-compiled for performance - Symbol patterns are compiled with word boundary anchors

freeports_analysis.formats.utils.text_extract.match.match_company(text: str, target_companies: Tuple[List[Tuple], Dict]) str | None

Match text against target companies using multiple matching strategies.

This function implements a sophisticated multi-stage matching algorithm that balances accuracy and performance by trying different matching strategies in order of specificity. It’s designed to handle real-world variations in company name representations in financial documents.

Parameters:
  • text (str) – Text to match against company names. This is typically extracted from PDF documents and may contain formatting artifacts.

  • target_companies (Tuple[List[Tuple], Dict]) – Prepared target company data from dataframe_to_match. The tuple contains: - List[Tuple]: Company data sorted by name length (longest first) - Dict: Pre-compiled regex patterns for each company

Returns:

Company identifier if a match is found, None otherwise. The identifier corresponds to the index in the original target dataframe.

Return type:

Optional[str]

Raises:

ValueError – If multiple companies match the text ambiguously, indicating the text could refer to more than one company in the target list.

Notes

The matching process uses multiple strategies in order of specificity: 1. Exact company name matches: Fastest, most specific 2. BUD (Business Unit Designator) matches: With regex validation 3. Regex pattern matches: Flexible pattern-based matching 4. Stock symbol matches: For ticker symbol identification

This multi-stage approach ensures: - High accuracy through exact matches when possible - Good performance by trying faster strategies first - Flexibility through regex and symbol matching - Ambiguity detection to prevent incorrect matches

Examples

>>> # Assuming target_companies is prepared data
>>> match = match_company("Microsoft Corporation", target_companies)
>>> print(match)
'MSFT'  # Company identifier
>>> # With ambiguous text
>>> match = match_company("ABC Inc", target_companies)
>>> # Raises ValueError if multiple companies match
freeports_analysis.formats.utils.text_extract.match.normalize_string(string: str) str

Normalize a string by making it lowercase and removing accents.

Parameters:

string (str) – Original string to normalize

Returns:

Normalized string with accents removed and whitespace collapsed

Return type:

str

Notes

This function performs the following transformations: - Converts to lowercase - Removes diacritical marks (accents) - Replaces separator characters with spaces - Removes punctuation characters - Collapses multiple whitespace characters into single spaces - Strips leading and trailing whitespace