freeports_analysis.formats
Core data structures and exceptions for PDF document processing.
This module defines the fundamental data structures (PdfBlock, TextBlock) and exception classes used throughout the document processing pipeline.
Classes
|
Represents a PDF content block with data to be extracted or relevant for filtering. |
|
Represents a processed text block derived from a PdfBlock. |
Exceptions
Raised when a required PdfBlock is not found during processing. |
|
Raised when a required TextBlock is not found during processing. |
|
Raised when the algorithm is unable to parse a field. |
|
Raised when the algorithm is unable to parse a line. |
|
Raised when the algorithm is unable to parse a page. |
- exception freeports_analysis.formats.ExpectedPdfBlockNotFound
Raised when a required PdfBlock is not found during processing.
- exception freeports_analysis.formats.ExpectedTextBlockNotFound
Raised when a required TextBlock is not found during processing.
- exception freeports_analysis.formats.ExtractionFieldFail
Raised when the algorithm is unable to parse a field.
- exception freeports_analysis.formats.LineParseFail
Raised when the algorithm is unable to parse a line.
- exception freeports_analysis.formats.PageParseFail
Raised when the algorithm is unable to parse a page.
- class freeports_analysis.formats.PdfBlock(type_block: Enum, metadata: dict, xml_ele: Element | List[Element])
Represents a PDF content block with data to be extracted or relevant for filtering.
- type_block
The type of the PDF block
- Type:
Enum
- metadata
Additional metadata associated with the block
- Type:
Optional[dict]
- content
The textual content extracted from the block
- Type:
Optional[str]
- class freeports_analysis.formats.TextBlock(type_block: Enum, metadata: dict, pdf_block: PdfBlock)
Represents a processed text block derived from a PdfBlock.
- type_block
Type of the text block
- Type:
Enum
- metadata
Additional metadata associated with the block
- Type:
dict
- content
Textual content of the block
- Type:
str
Modules
Core algorithms module for PDF document processing pipelines. |
|
Data management for PDF format definitions and URL mappings. |
|
Utilities of general interest common to all formats and that can be used for creating pdf_filter or text_extract or deserialize functions |