Parse Djvu files¶
            Bases: Bolt
__init__(input, output, state, **kwargs)
¶
  The ParseDjvu class is designed to process DJVU files and classify them as either text-based or image-based.
It takes an input folder containing DJVU files as an argument and iterates through each file.
For each DJVU, it samples a few pages to determine the type of content it primarily contains.
If the DJVU is text-based, the class extracts the text from each page and saves it as a JSON file.
If the DJVU is image-based, it converts each page to a PNG image and saves them in a designated output folder.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| input | BatchInput | An instance of the BatchInput class for reading the data. | required | 
| output | BatchOutput | An instance of the BatchOutput class for saving the data. | required | 
| state | State | An instance of the State class for maintaining the state. | required | 
| **kwargs | Additional keyword arguments. | {} | 
Using geniusrise to invoke via command line¶
process(input_folder=None)
¶
  📖 Process DJVU files in the given input folder and classify them as text-based or image-based.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| input_folder | str | The folder containing DJVU files to process. | None | 
This method iterates through each DJVU file in the specified folder, reads a sample of pages,
and determines whether the DJVU is text-based or image-based. It then delegates further processing
to _process_text_djvu or _process_image_djvu based on this determination.