Skip to content

Parse Djvu files

Bases: Bolt

__init__(input, output, state, **kwargs)

The ParseDjvu class is designed to process DJVU files and classify them as either text-based or image-based. It takes an input folder containing DJVU files as an argument and iterates through each file. For each DJVU, it samples a few pages to determine the type of content it primarily contains. If the DJVU is text-based, the class extracts the text from each page and saves it as a JSON file. If the DJVU is image-based, it converts each page to a PNG image and saves them in a designated output folder.

Parameters:

Name Type Description Default
input BatchInput

An instance of the BatchInput class for reading the data.

required
output BatchOutput

An instance of the BatchOutput class for saving the data.

required
state State

An instance of the State class for maintaining the state.

required
**kwargs

Additional keyword arguments.

{}

Using geniusrise to invoke via command line

genius ParseDjvu rise \
    batch \
        --bucket my_bucket \
        --s3_folder s3/input \
    batch \
        --bucket my_bucket \
        --s3_folder s3/output \
    none \
    process

process(input_folder=None)

📖 Process DJVU files in the given input folder and classify them as text-based or image-based.

Parameters:

Name Type Description Default
input_folder str

The folder containing DJVU files to process.

None

This method iterates through each DJVU file in the specified folder, reads a sample of pages, and determines whether the DJVU is text-based or image-based. It then delegates further processing to _process_text_djvu or _process_image_djvu based on this determination.