Skip to content

Parse PDF files

Bases: Bolt

__init__(input, output, state, **kwargs)

The ParsePdf class is designed to process PDF files and classify them as either text-based or image-based. It takes an input folder containing PDF files as an argument and iterates through each file. For each PDF, it samples a few pages to determine the type of content it primarily contains. If the PDF is text-based, the class extracts the text from each page and saves it as a JSON file. If the PDF is image-based, it converts each page to a PNG image and saves them in a designated output folder.

Args:
    input (BatchInput): An instance of the BatchInput class for reading the data.
    output (BatchOutput): An instance of the BatchOutput class for saving the data.
    state (State): An instance of the State class for maintaining the state.
    **kwargs: Additional keyword arguments.

Using geniusrise to invoke via command line

genius ParsePdf rise \
    batch \
        --bucket my_bucket \
        --s3_folder s3/input \
    batch \
        --bucket my_bucket \
        --s3_folder s3/output \
    none \
    process

Using geniusrise to invoke via YAML file

version: "1"
spouts:
    parse_pdfs:
        name: "ParsePdf"
        method: "process"
        input:
            type: "batch"
            args:
                bucket: "my_bucket"
                s3_folder: "s3/input"
        output:
            type: "batch"
            args:
                bucket: "my_bucket"
                s3_folder: "s3/outupt"

process(input_folder=None)

📖 Process PDF files in the given input folder and classify them as text-based or image-based.

Parameters:

Name Type Description Default
input_folder str

The folder containing PDF files to process.

None

This method iterates through each PDF file in the specified folder, reads a sample of pages, and determines whether the PDF is text-based or image-based. It then delegates further processing to _process_text_pdf or _process_image_pdf based on this determination.