Parse Djvu files¶

Bases: Bolt

`init(input, output, state, **kwargs)` ¶

The ParseDjvu class is designed to process DJVU files and classify them as either text-based or image-based. It takes an input folder containing DJVU files as an argument and iterates through each file. For each DJVU, it samples a few pages to determine the type of content it primarily contains. If the DJVU is text-based, the class extracts the text from each page and saves it as a JSON file. If the DJVU is image-based, it converts each page to a PNG image and saves them in a designated output folder.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	An instance of the BatchInput class for reading the data.	required
`output`	`BatchOutput`	An instance of the BatchOutput class for saving the data.	required
`state`	`State`	An instance of the State class for maintaining the state.	required
`**kwargs`		Additional keyword arguments.	`{}`

Using geniusrise to invoke via command line¶

genius ParseDjvu rise \
    batch \
        --bucket my_bucket \
        --s3_folder s3/input \
    batch \
        --bucket my_bucket \
        --s3_folder s3/output \
    none \
    process

`process(input_folder=None)` ¶

📖 Process DJVU files in the given input folder and classify them as text-based or image-based.

Parameters:

Name	Type	Description	Default
`input_folder`	`str`	The folder containing DJVU files to process.	`None`

This method iterates through each DJVU file in the specified folder, reads a sample of pages, and determines whether the DJVU is text-based or image-based. It then delegates further processing to _process_text_djvu or _process_image_djvu based on this determination.

Parse Djvu files¶

__init__(input, output, state, **kwargs) ¶

Using geniusrise to invoke via command line¶

process(input_folder=None) ¶

`init(input, output, state, **kwargs)` ¶

`process(input_folder=None)` ¶