Parse PostScript files¶
Bases: Bolt
__init__(input, output, state, **kwargs)
¶
The ParsePostScript
class is designed to process PostScript files and classify them as either text-based or image-based.
It takes an input folder containing PostScript files as an argument and iterates through each file.
For each PostScript file, it converts it to PDF and samples a few pages to determine the type of content it primarily contains.
If the PostScript is text-based, the class extracts the text from each page and saves it as a JSON file.
If the PostScript is image-based, it converts each page to a PNG image and saves them in a designated output folder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
An instance of the BatchInput class for reading the data. |
required |
output |
BatchOutput
|
An instance of the BatchOutput class for saving the data. |
required |
state |
State
|
An instance of the State class for maintaining the state. |
required |
**kwargs |
Additional keyword arguments. |
{}
|
Using geniusrise to invoke via command line¶
process(input_folder=None)
¶
📖 Process PostScript files in the given input folder and classify them as text-based or image-based.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_folder |
str
|
The folder containing PostScript files to process. |
None
|
This method iterates through each PostScript file in the specified folder, converts it to PDF,
reads a sample of pages, and determines whether the PostScript is text-based or image-based.
It then delegates further processing to _process_text_ps
or _process_image_ps
based on this determination.