freeports_analysis.formats

Core data structures and exceptions for PDF document processing.

This module defines the fundamental data structures (PdfBlock, TextBlock) and exception classes used throughout the document processing pipeline.

Classes

`PdfBlock`(type_block, metadata, xml_ele)	Represents a PDF content block with data to be extracted or relevant for filtering.
`TextBlock`(type_block, metadata, pdf_block)	Represents a processed text block derived from a PdfBlock.

Exceptions

`ExpectedPdfBlockNotFound`	Raised when a required PdfBlock is not found during processing.
`ExpectedTextBlockNotFound`	Raised when a required TextBlock is not found during processing.
`ExtractionFieldFail`	Raised when the algorithm is unable to parse a field.
`LineParseFail`	Raised when the algorithm is unable to parse a line.
`PageParseFail`	Raised when the algorithm is unable to parse a page.

exception freeports_analysis.formats.ExpectedPdfBlockNotFound: Raised when a required PdfBlock is not found during processing.

exception freeports_analysis.formats.ExpectedTextBlockNotFound: Raised when a required TextBlock is not found during processing.

exception freeports_analysis.formats.ExtractionFieldFail: Raised when the algorithm is unable to parse a field.

exception freeports_analysis.formats.LineParseFail: Raised when the algorithm is unable to parse a line.

exception freeports_analysis.formats.PageParseFail: Raised when the algorithm is unable to parse a page.

class freeports_analysis.formats.PdfBlock(type_block: Enum, metadata: dict, xml_ele: Element | List[Element])

Represents a PDF content block with data to be extracted or relevant for filtering.

type_block

The type of the PDF block

metadata

Additional metadata associated with the block

content

The textual content extracted from the block

class freeports_analysis.formats.TextBlock(type_block: Enum, metadata: dict, pdf_block: PdfBlock)

Represents a processed text block derived from a PdfBlock.

type_block

Type of the text block

metadata

Additional metadata associated with the block

content

Textual content of the block

pdf_block

Original PdfBlock this text was derived from

Modules

`algorithms`	Core algorithms module for PDF document processing pipelines.
`data`	Data management for PDF format definitions and URL mappings.
`utils`	Utilities of general interest common to all formats and that can be used for creating pdf_filter or text_extract or deserialize functions