Skip to content

Visual Question Answering

Bases: VisionAPI

VisualQAAPI extends VisionAPI to provide an interface for visual question answering (VQA) tasks. This API supports answering questions about an image by utilizing deep learning models specifically trained for VQA. It processes requests containing an image and a question about the image, performs inference using the loaded model, and returns the predicted answer.

Methods

answer_question(self): Receives an image and a question, returns the answer based on visual content.

Example CLI Usage:

genius VisualQAAPI rise \
    batch \
        --input_folder ./input \
    batch \
        --output_folder ./output \
    none \
    listen \
        --args \
            model_name="llava-hf/bakLlava-v1-hf" \
            model_class="LlavaForConditionalGeneration" \
            processor_class="AutoProcessor" \
            device_map="cuda:0" \
            use_cuda=True \
            precision="bfloat16" \
            quantization=0 \
            max_memory=None \
            torchscript=False \
            compile=False \
            flash_attention=False \
            better_transformers=False \
            endpoint="*" \
            port=3000 \
            cors_domain="http://localhost:3000" \
            username="user" \
            password="password"

__init__(input, output, state, **kwargs)

Initializes the VisualQAAPI with configurations for input, output, state management, and any model-specific parameters for visual question answering tasks.

Parameters:

Name Type Description Default
input BatchInput

Configuration for the input data.

required
output BatchOutput

Configuration for the output data.

required
state State

State management for the API.

required
**kwargs

Additional keyword arguments for extended functionality.

{}

answer_question()

Endpoint for receiving an image with a question and returning the answer based on the visual content of the image. It processes the request containing a base64-encoded image and a question string, and utilizes the loaded model to predict the answer to the question related to the image.

Returns:

Type Description

Dict[str, Any]: A dictionary containing the original question and the predicted answer.

Raises:

Type Description
ValueError

If required fields 'image_base64' and 'question' are not provided in the request.

Exception

If an error occurs during image processing or inference.

Example CURL Request:

curl -X POST localhost:3000/api/v1/answer_question             -H "Content-Type: application/json"             -d '{"image_base64": "<base64-encoded-image>", "question": "What is the color of the sky in the image?"}'

or

(base64 -w 0 test_images_segment_finetune/image1.jpg | awk '{print "{"image_base64": ""$0"", "question": "how many cats are there?"}"}' > /tmp/image_payload.json)
curl -X POST http://localhost:3000/api/v1/answer_question             -H "Content-Type: application/json"             -u user:password             -d @/tmp/image_payload.json | jq