Skip to content

Question Answering

Bases: TextBulk

QABulk is a class designed for managing bulk question-answering tasks using Hugging Face models. It is capable of handling both traditional text-based QA and table-based QA (using TAPAS and TAPEX models), providing a versatile solution for automated question answering at scale.

Parameters:

Name Type Description Default
input BatchInput

Configuration and data inputs for batch processing.

required
output BatchOutput

Configurations for output data handling.

required
state State

State management for the bulk QA task.

required
**kwargs

Arbitrary keyword arguments for extended functionality.

{}

Example CLI Usage:

# For traditional text-based QA:
genius QABulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/qa-traditional \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/qa-traditional \
    postgres \
        --postgres_host 127.0.0.1 \
        --postgres_port 5432 \
        --postgres_user postgres \
        --postgres_password postgres \
        --postgres_database geniusrise\
        --postgres_table state \
    --id distilbert-base-uncased-distilled-squad-lol \
    answer_questions \
        --args \
            model_name="distilbert-base-uncased-distilled-squad" \
            model_class="AutoModelForQuestionAnswering" \
            tokenizer_class="AutoTokenizer" \
            use_cuda=True \
            precision="bfloat16" \
            quantization=0 \
            device_map="cuda:0" \
            max_memory=None \
            torchscript=False

# For table-based QA using TAPAS:
genius QABulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/qa-table \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/qa-table \
    postgres \
        --postgres_host 127.0.0.1 \
        --postgres_port 5432 \
        --postgres_user postgres \
        --postgres_password postgres \
        --postgres_database geniusrise\
        --postgres_table state \
    --id google/tapas-base-finetuned-wtq-lol \
    answer_questions \
        --args \
            model_name="google/tapas-base-finetuned-wtq" \
            model_class="AutoModelForTableQuestionAnswering" \
            tokenizer_class="AutoTokenizer" \
            use_cuda=True \
            precision="float" \
            quantization=0 \
            device_map="cuda:0" \
            max_memory=None \
            torchscript=False

# For table-based QA using TAPEX:
genius QABulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/qa-table \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/qa-table \
    postgres \
        --postgres_host 127.0.0.1 \
        --postgres_port 5432 \
        --postgres_user postgres \
        --postgres_password postgres \
        --postgres_database geniusrise\
        --postgres_table state \
    --id microsoft/tapex-large-finetuned-wtq-lol \
    answer_questions \
        --args \
            model_name="microsoft/tapex-large-finetuned-wtq" \
            model_class="AutoModelForSeq2SeqLM" \
            tokenizer_class="AutoTokenizer" \
            use_cuda=True \
            precision="float" \
            quantization=0 \
            device_map="cuda:0" \
            max_memory=None \
            torchscript=False

__init__(input, output, state, **kwargs)

Initializes the QABulk class with configurations for input, output, and state.

Parameters:

Name Type Description Default
input BatchInput

Configuration for the input data.

required
output BatchOutput

Configuration for the output data.

required
state State

State management for the QA task.

required
**kwargs Any

Additional keyword arguments for extended functionality.

{}

answer_questions(model_name, model_class='AutoModelForQuestionAnswering', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, notification_email=None, **kwargs)

Perform bulk question-answering using the specified model and tokenizer. This method can handle various types of QA models including traditional, TAPAS, and TAPEX.

Parameters:

Name Type Description Default
model_name str

Name or path of the question-answering model.

required
model_class str

Class name of the model (e.g., "AutoModelForQuestionAnswering").

'AutoModelForQuestionAnswering'
tokenizer_class str

Class name of the tokenizer (e.g., "AutoTokenizer").

'AutoTokenizer'
use_cuda bool

Whether to use CUDA for model inference. Defaults to False.

False
precision str

Precision for model computation. Defaults to "float16".

'float16'
quantization int

Level of quantization for optimizing model size and speed. Defaults to 0.

0
device_map str | Dict | None

Specific device to use for computation. Defaults to "auto".

'auto'
max_memory Dict

Maximum memory configuration for devices. Defaults to {0: "24GB"}.

{0: '24GB'}
torchscript bool

Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False.

False
compile bool

Whether to compile the model before fine-tuning. Defaults to True.

False
awq_enabled bool

Whether to enable AWQ optimization. Defaults to False.

False
flash_attention bool

Whether to use flash attention optimization. Defaults to False.

False
batch_size int

Number of questions to process simultaneously. Defaults to 32.

32
**kwargs Any

Arbitrary keyword arguments for model and generation configurations.

{}
Processing

The method processes the data in batches, utilizing the appropriate model based on the model name and generating answers for the questions provided in the dataset.

load_dataset(dataset_path, max_length=512, **kwargs)

Load a dataset from a directory.

Supported Data Formats and Structures:

JSONL

Each line is a JSON object representing an example.

{"context": "The context content", "question": "The question"}

CSV

Should contain 'context' and 'question' columns.

context,question
"The context content","The question"

Parquet

Should contain 'context' and 'question' columns.

JSON

An array of dictionaries with 'context' and 'question' keys.

[{"context": "The context content", "question": "The question"}]

XML

Each 'record' element should contain 'context' and 'question' elements.

<record>
    <context>The context content</context>
    <question>The question</question>
</record>

YAML

Each document should be a dictionary with 'context' and 'question' keys.

- context: "The context content"
  question: "The question"

TSV

Should contain 'context' and 'question' columns separated by tabs.

Excel (.xls, .xlsx)

Should contain 'context' and 'question' columns.

SQLite (.db)

Should contain a table with 'context' and 'question' columns.

Feather

Should contain 'context' and 'question' columns.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset directory.

required
pad_on_right bool

Whether to pad on the right.

required
max_length int

The maximum length of the sequences.

512
doc_stride int

The document stride.

required
evaluate_squadv2 bool

Whether to evaluate using SQuAD v2 metrics.

required

Returns:

Name Type Description
Dataset Optional[Dataset]

The loaded dataset.