freeports_analysis.main

This module contains the main function used to extract information from PDF files and save the results as CSV files. This file also serves as the source code to be launched (providing options via configuration file or environment variables) to mimic command line behavior. The logic distinguishes between the main function in this file and the command line entry point by handling configuration parsing.

Example

`python main.py`

Functions

`batch_job_confs`(job_config)	Create a list of configurations by reading a batch file with job contextual options.
`main`(main_config)	Main function for PDF processing and data extraction.
`pipeline_batch`(batch_pages, i_page_batch, ...)	Apply the pipeline of actions to extract financial data from PDF pages.

Exceptions

NoPDFormatDetected

Exception raised when the script cannot detect a PDF format to decode the report.

exception freeports_analysis.main.NoPDFormatDetected

Exception raised when the script cannot detect a PDF format to decode the report.

This exception is raised when no explicit format is specified and the program cannot automatically determine the appropriate format for decoding the PDF.

freeports_analysis.main.batch_job_confs(job_config: Dict[str, Any]) → List[Dict[str, Any]]

Create a list of configurations by reading a batch file with job contextual options.

Parameters:

job_config (Dict[str, Any]) – Base configuration to be overwritten with batch file options

Returns:

List of configurations, one for each row in the batch file

Return type:

List[Dict[str, Any]]

Raises:

FileNotFoundError – If the batch file does not exist
csv.Error – If the batch file has invalid CSV format

Notes

The batch file should be a CSV file with columns corresponding to configuration keys that can override the base configuration.

freeports_analysis.main.main(main_config: Dict[str, Any]) → None

Main function for PDF processing and data extraction.

Expects configuration to be already provided (via command line arguments, environment variables, or configuration files).

Parameters:

main_config (Dict[str, Any]) – Configuration dictionary containing all processing parameters

Raises:

NoPDFormatDetected – If no explicit format is provided and the program cannot automatically determine the appropriate format for decoding the PDF
FileNotFoundError – If required input files or directories are not found
ValueError – If configuration contains invalid values

Notes

This function orchestrates the complete PDF processing workflow: 1. Configuration validation and setup 2. Log file initialization 3. Batch or single job processing 4. Parallel execution with multiprocessing 5. Output file generation 6. Result transformation and writing

freeports_analysis.main.pipeline_batch(batch_pages: List[str], i_page_batch: int, n_pages: int, targets: DataFrame, format_name: str) → List[Investment | Dict[str, Promise | Any]]

Apply the pipeline of actions to extract financial data from PDF pages.

Parameters:

batch_pages (List[str]) – List of XML strings representing PDF pages to process
i_page_batch (int) – Starting page number of this batch (1-based index)
n_pages (int) – Total number of pages in the document
targets (pd.DataFrame) – Table containing information of relevant companies to extract from the report
format_name (str) – Name of the format containing format-specific parsing functions

Returns:

List of extracted financial data objects or promise resolution contexts

Return type:

List[Union[Investment, PromisesResolutionContext]]