Visual Question Answering¶

Bases: VisionAPI

VisualQAAPI extends VisionAPI to provide an interface for visual question answering (VQA) tasks. This API supports answering questions about an image by utilizing deep learning models specifically trained for VQA. It processes requests containing an image and a question about the image, performs inference using the loaded model, and returns the predicted answer.

Methods

answer_question(self): Receives an image and a question, returns the answer based on visual content.

Example CLI Usage:

genius VisualQAAPI rise \
    batch \
        --input_folder ./input \
    batch \
        --output_folder ./output \
    none \
    listen \
        --args \
            model_name="llava-hf/bakLlava-v1-hf" \
            model_class="LlavaForConditionalGeneration" \
            processor_class="AutoProcessor" \
            device_map="cuda:0" \
            use_cuda=True \
            precision="bfloat16" \
            quantization=0 \
            max_memory=None \
            torchscript=False \
            compile=False \
            flash_attention=False \
            better_transformers=False \
            endpoint="*" \
            port=3000 \
            cors_domain="http://localhost:3000" \
            username="user" \
            password="password"

`init(input, output, state, **kwargs)` ¶

Initializes the VisualQAAPI with configurations for input, output, state management, and any model-specific parameters for visual question answering tasks.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	Configuration for the input data.	required
`output`	`BatchOutput`	Configuration for the output data.	required
`state`	`State`	State management for the API.	required
`**kwargs`		Additional keyword arguments for extended functionality.	`{}`

`answer_question()` ¶

Endpoint for receiving an image with a question and returning the answer based on the visual content of the image. It processes the request containing a base64-encoded image and a question string, and utilizes the loaded model to predict the answer to the question related to the image.

Returns:

Type	Description
	Dict[str, Any]: A dictionary containing the original question and the predicted answer.

Raises:

Type	Description
`ValueError`	If required fields 'image_base64' and 'question' are not provided in the request.
`Exception`	If an error occurs during image processing or inference.

Example CURL Request:

curl -X POST localhost:3000/api/v1/answer_question             -H "Content-Type: application/json"             -d '{"image_base64": "<base64-encoded-image>", "question": "What is the color of the sky in the image?"}'

or

(base64 -w 0 test_images_segment_finetune/image1.jpg | awk '{print "{"image_base64": ""$0"", "question": "how many cats are there?"}"}' > /tmp/image_payload.json)
curl -X POST http://localhost:3000/api/v1/answer_question             -H "Content-Type: application/json"             -u user:password             -d @/tmp/image_payload.json | jq

Visual Question Answering¶

__init__(input, output, state, **kwargs) ¶

answer_question() ¶

`init(input, output, state, **kwargs)` ¶

`answer_question()` ¶