Skip to content

OCR API using pix2struct

Bases: Bolt

__init__(input, output, state, **kwargs)

The Pix2StructImageOCRAPI class performs OCR on images using Google's Pix2Struct model. The class exposes an API endpoint for OCR on single images. The endpoint is accessible at /api/v1/ocr. The API takes a POST request with a JSON payload containing a base64 encoded image under the key image_base64. It returns a JSON response containing the OCR result under the key ocr_text.

Parameters:

Name Type Description Default
input BatchInput

Instance of BatchInput for reading data.

required
output BatchOutput

Instance of BatchOutput for saving data.

required
state State

Instance of State for maintaining state.

required
model_name str

The name of the Pix2Struct model to use. Default is "google/pix2struct-large".

required
**kwargs

Additional keyword arguments.

{}

Command Line Invocation with geniusrise

genius Pix2StructImageOCRAPI rise \
    batch \
        --bucket my_bucket \
        --s3_folder s3/input \
    batch \
        --bucket my_bucket \
        --s3_folder s3/output \
    none \
    listen \
        --args endpoint=* port=3000 cors_domain=* use_cuda=True

YAML Configuration with geniusrise

version: "1"
spouts:
    ocr_processing:
        name: "Pix2StructImageOCRAPI"
        method: "listen"
        args:
            endpoint: *
            port: 3000
            cors_domain: *
            use_cuda: true
        input:
            type: "batch"
            args:
                bucket: "my_bucket"
                s3_folder: "s3/input"
                use_cuda: true
        output:
            type: "batch"
            args:
                bucket: "my_bucket"
                s3_folder: "s3/output"
                use_cuda: true