freeports_analysis.formats

Core data structures and exceptions for PDF document processing.

This module defines the fundamental data structures (PdfBlock, TextBlock) and exception classes used throughout the document processing pipeline.

Classes

PdfBlock(type_block, metadata, xml_ele)

Represents a PDF content block with data to be extracted or relevant for filtering.

TextBlock(type_block, metadata, pdf_block)

Represents a processed text block derived from a PdfBlock.

Exceptions

ExpectedPdfBlockNotFound

Raised when a required PdfBlock is not found during processing.

ExpectedTextBlockNotFound

Raised when a required TextBlock is not found during processing.

ExtractionFieldFail

Raised when the algorithm is unable to parse a field.

LineParseFail

Raised when the algorithm is unable to parse a line.

PageParseFail

Raised when the algorithm is unable to parse a page.

exception freeports_analysis.formats.ExpectedPdfBlockNotFound

Raised when a required PdfBlock is not found during processing.

exception freeports_analysis.formats.ExpectedTextBlockNotFound

Raised when a required TextBlock is not found during processing.

exception freeports_analysis.formats.ExtractionFieldFail

Raised when the algorithm is unable to parse a field.

exception freeports_analysis.formats.LineParseFail

Raised when the algorithm is unable to parse a line.

exception freeports_analysis.formats.PageParseFail

Raised when the algorithm is unable to parse a page.

class freeports_analysis.formats.PdfBlock(type_block: Enum, metadata: dict, xml_ele: Element | List[Element])

Represents a PDF content block with data to be extracted or relevant for filtering.

type_block

The type of the PDF block

Type:

Enum

metadata

Additional metadata associated with the block

Type:

Optional[dict]

content

The textual content extracted from the block

Type:

Optional[str]

class freeports_analysis.formats.TextBlock(type_block: Enum, metadata: dict, pdf_block: PdfBlock)

Represents a processed text block derived from a PdfBlock.

type_block

Type of the text block

Type:

Enum

metadata

Additional metadata associated with the block

Type:

dict

content

Textual content of the block

Type:

str

pdf_block

Original PdfBlock this text was derived from

Type:

PdfBlock

Modules

algorithms

Core algorithms module for PDF document processing pipelines.

data

Data management for PDF format definitions and URL mappings.

utils

Utilities of general interest common to all formats and that can be used for creating pdf_filter or text_extract or deserialize functions