freeports_analysis.formats.utils.text_extract.match
Target matching algorithms for company name extraction.
This module provides functions for matching text against target companies using various matching strategies including exact matches, regex patterns, and symbol-based matching.
Functions
|
Prepare target company data for matching. |
|
Match text against target companies using multiple matching strategies. |
|
Normalize a string by making it lowercase and removing accents. |
- freeports_analysis.formats.utils.text_extract.match.dataframe_to_match(target_companies: DataFrame) Tuple[List[Tuple], Dict]
Prepare target company data for matching.
- Parameters:
target_companies (pd.DataFrame) – DataFrame containing company matching data
- Returns:
Tuple containing: - matching_data: List of tuples with company matching information - regexs_table: Dictionary mapping company indices to compiled regex patterns
- Return type:
Tuple[List[Tuple], Dict]
Notes
The returned data structure is optimized for efficient matching: - Companies are sorted by name length (longest first) for exact matching - Regex patterns are pre-compiled for performance - Symbol patterns are compiled with word boundary anchors
- freeports_analysis.formats.utils.text_extract.match.match_company(text: str, target_companies: Tuple[List[Tuple], Dict]) str | None
Match text against target companies using multiple matching strategies.
This function implements a sophisticated multi-stage matching algorithm that balances accuracy and performance by trying different matching strategies in order of specificity. It’s designed to handle real-world variations in company name representations in financial documents.
- Parameters:
text (str) – Text to match against company names. This is typically extracted from PDF documents and may contain formatting artifacts.
target_companies (Tuple[List[Tuple], Dict]) – Prepared target company data from dataframe_to_match. The tuple contains: - List[Tuple]: Company data sorted by name length (longest first) - Dict: Pre-compiled regex patterns for each company
- Returns:
Company identifier if a match is found, None otherwise. The identifier corresponds to the index in the original target dataframe.
- Return type:
Optional[str]
- Raises:
ValueError – If multiple companies match the text ambiguously, indicating the text could refer to more than one company in the target list.
Notes
The matching process uses multiple strategies in order of specificity: 1. Exact company name matches: Fastest, most specific 2. BUD (Business Unit Designator) matches: With regex validation 3. Regex pattern matches: Flexible pattern-based matching 4. Stock symbol matches: For ticker symbol identification
This multi-stage approach ensures: - High accuracy through exact matches when possible - Good performance by trying faster strategies first - Flexibility through regex and symbol matching - Ambiguity detection to prevent incorrect matches
Examples
>>> # Assuming target_companies is prepared data >>> match = match_company("Microsoft Corporation", target_companies) >>> print(match) 'MSFT' # Company identifier
>>> # With ambiguous text >>> match = match_company("ABC Inc", target_companies) >>> # Raises ValueError if multiple companies match
- freeports_analysis.formats.utils.text_extract.match.normalize_string(string: str) str
Normalize a string by making it lowercase and removing accents.
- Parameters:
string (str) – Original string to normalize
- Returns:
Normalized string with accents removed and whitespace collapsed
- Return type:
str
Notes
This function performs the following transformations: - Converts to lowercase - Removes diacritical marks (accents) - Replaces separator characters with spaces - Removes punctuation characters - Collapses multiple whitespace characters into single spaces - Strips leading and trailing whitespace