Parse PDF files¶
Bases: Bolt
__init__(input, output, state, **kwargs)
¶
The ParsePdf
class is designed to process PDF files and classify them as either text-based or image-based.
It takes an input folder containing PDF files as an argument and iterates through each file.
For each PDF, it samples a few pages to determine the type of content it primarily contains.
If the PDF is text-based, the class extracts the text from each page and saves it as a JSON file.
If the PDF is image-based, it converts each page to a PNG image and saves them in a designated output folder.
Args:
input (BatchInput): An instance of the BatchInput class for reading the data.
output (BatchOutput): An instance of the BatchOutput class for saving the data.
state (State): An instance of the State class for maintaining the state.
**kwargs: Additional keyword arguments.
Using geniusrise to invoke via command line¶
genius ParsePdf rise \
batch \
--bucket my_bucket \
--s3_folder s3/input \
batch \
--bucket my_bucket \
--s3_folder s3/output \
none \
process
Using geniusrise to invoke via YAML file¶
process(input_folder=None)
¶
📖 Process PDF files in the given input folder and classify them as text-based or image-based.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_folder |
str
|
The folder containing PDF files to process. |
None
|
This method iterates through each PDF file in the specified folder, reads a sample of pages,
and determines whether the PDF is text-based or image-based. It then delegates further processing
to _process_text_pdf
or _process_image_pdf
based on this determination.