Skip to content

Parse PostScript files

Bases: Bolt

__init__(input, output, state, **kwargs)

The ParsePostScript class is designed to process PostScript files and classify them as either text-based or image-based. It takes an input folder containing PostScript files as an argument and iterates through each file. For each PostScript file, it converts it to PDF and samples a few pages to determine the type of content it primarily contains. If the PostScript is text-based, the class extracts the text from each page and saves it as a JSON file. If the PostScript is image-based, it converts each page to a PNG image and saves them in a designated output folder.

Parameters:

Name Type Description Default
input BatchInput

An instance of the BatchInput class for reading the data.

required
output BatchOutput

An instance of the BatchOutput class for saving the data.

required
state State

An instance of the State class for maintaining the state.

required
**kwargs

Additional keyword arguments.

{}

Using geniusrise to invoke via command line

genius ParsePostScript rise \
    batch \
        --bucket my_bucket \
        --s3_folder s3/input \
    batch \
        --bucket my_bucket \
        --s3_folder s3/output \
    none \
    process

process(input_folder=None)

📖 Process PostScript files in the given input folder and classify them as text-based or image-based.

Parameters:

Name Type Description Default
input_folder str

The folder containing PostScript files to process.

None

This method iterates through each PostScript file in the specified folder, converts it to PDF, reads a sample of pages, and determines whether the PostScript is text-based or image-based. It then delegates further processing to _process_text_ps or _process_image_ps based on this determination.