Skip to content

Parse Epub files

Bases: Bolt

__init__(input, output, state, **kwargs)

The ParseEpub class is designed to process EPUB files and classify them as either text-based or image-based. It takes an input folder containing EPUB files as an argument and iterates through each file. For each EPUB, it samples a few items to determine the type of content it primarily contains. If the EPUB is text-based, the class extracts the text from each item and saves it as a JSON file. If the EPUB is image-based, it saves the images in a designated output folder.

Parameters:

Name Type Description Default
input BatchInput

An instance of the BatchInput class for reading the data.

required
output BatchOutput

An instance of the BatchOutput class for saving the data.

required
state State

An instance of the State class for maintaining the state.

required
**kwargs

Additional keyword arguments.

{}

Using geniusrise to invoke via command line

genius ParseEpub rise \
    batch \
        --bucket my_bucket \
        --s3_folder s3/input \
    batch \
        --bucket my_bucket \
        --s3_folder s3/output \
    none \
    process

process(input_folder=None)

📖 Process EPUB files in the given input folder and classify them as text-based or image-based.

Parameters:

Name Type Description Default
input_folder str

The folder containing EPUB files to process.

None

This method iterates through each EPUB file in the specified folder, reads a sample of items, and determines whether the EPUB is text-based or image-based. It then delegates further processing to _process_text_epub or _process_image_epub based on this determination.