Question Answering¶
Bases: TextBulk
QABulk is a class designed for managing bulk question-answering tasks using Hugging Face models. It is capable of handling both traditional text-based QA and table-based QA (using TAPAS and TAPEX models), providing a versatile solution for automated question answering at scale.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
Configuration and data inputs for batch processing. |
required |
output |
BatchOutput
|
Configurations for output data handling. |
required |
state |
State
|
State management for the bulk QA task. |
required |
**kwargs |
Arbitrary keyword arguments for extended functionality. |
{}
|
Example CLI Usage:
# For traditional text-based QA:
genius QABulk rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/qa-traditional \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/qa-traditional \
postgres \
--postgres_host 127.0.0.1 \
--postgres_port 5432 \
--postgres_user postgres \
--postgres_password postgres \
--postgres_database geniusrise\
--postgres_table state \
--id distilbert-base-uncased-distilled-squad-lol \
answer_questions \
--args \
model_name="distilbert-base-uncased-distilled-squad" \
model_class="AutoModelForQuestionAnswering" \
tokenizer_class="AutoTokenizer" \
use_cuda=True \
precision="bfloat16" \
quantization=0 \
device_map="cuda:0" \
max_memory=None \
torchscript=False
# For table-based QA using TAPAS:
genius QABulk rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/qa-table \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/qa-table \
postgres \
--postgres_host 127.0.0.1 \
--postgres_port 5432 \
--postgres_user postgres \
--postgres_password postgres \
--postgres_database geniusrise\
--postgres_table state \
--id google/tapas-base-finetuned-wtq-lol \
answer_questions \
--args \
model_name="google/tapas-base-finetuned-wtq" \
model_class="AutoModelForTableQuestionAnswering" \
tokenizer_class="AutoTokenizer" \
use_cuda=True \
precision="float" \
quantization=0 \
device_map="cuda:0" \
max_memory=None \
torchscript=False
# For table-based QA using TAPEX:
genius QABulk rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/qa-table \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/qa-table \
postgres \
--postgres_host 127.0.0.1 \
--postgres_port 5432 \
--postgres_user postgres \
--postgres_password postgres \
--postgres_database geniusrise\
--postgres_table state \
--id microsoft/tapex-large-finetuned-wtq-lol \
answer_questions \
--args \
model_name="microsoft/tapex-large-finetuned-wtq" \
model_class="AutoModelForSeq2SeqLM" \
tokenizer_class="AutoTokenizer" \
use_cuda=True \
precision="float" \
quantization=0 \
device_map="cuda:0" \
max_memory=None \
torchscript=False
__init__(input, output, state, **kwargs)
¶
Initializes the QABulk class with configurations for input, output, and state.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
Configuration for the input data. |
required |
output |
BatchOutput
|
Configuration for the output data. |
required |
state |
State
|
State management for the QA task. |
required |
**kwargs |
Any
|
Additional keyword arguments for extended functionality. |
{}
|
answer_questions(model_name, model_class='AutoModelForQuestionAnswering', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, notification_email=None, **kwargs)
¶
Perform bulk question-answering using the specified model and tokenizer. This method can handle various types of QA models including traditional, TAPAS, and TAPEX.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
Name or path of the question-answering model. |
required |
model_class |
str
|
Class name of the model (e.g., "AutoModelForQuestionAnswering"). |
'AutoModelForQuestionAnswering'
|
tokenizer_class |
str
|
Class name of the tokenizer (e.g., "AutoTokenizer"). |
'AutoTokenizer'
|
use_cuda |
bool
|
Whether to use CUDA for model inference. Defaults to False. |
False
|
precision |
str
|
Precision for model computation. Defaults to "float16". |
'float16'
|
quantization |
int
|
Level of quantization for optimizing model size and speed. Defaults to 0. |
0
|
device_map |
str | Dict | None
|
Specific device to use for computation. Defaults to "auto". |
'auto'
|
max_memory |
Dict
|
Maximum memory configuration for devices. Defaults to {0: "24GB"}. |
{0: '24GB'}
|
torchscript |
bool
|
Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False. |
False
|
compile |
bool
|
Whether to compile the model before fine-tuning. Defaults to True. |
False
|
awq_enabled |
bool
|
Whether to enable AWQ optimization. Defaults to False. |
False
|
flash_attention |
bool
|
Whether to use flash attention optimization. Defaults to False. |
False
|
batch_size |
int
|
Number of questions to process simultaneously. Defaults to 32. |
32
|
**kwargs |
Any
|
Arbitrary keyword arguments for model and generation configurations. |
{}
|
Processing
The method processes the data in batches, utilizing the appropriate model based on the model name and generating answers for the questions provided in the dataset.
load_dataset(dataset_path, max_length=512, **kwargs)
¶
Load a dataset from a directory.
Supported Data Formats and Structures:¶
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'context' and 'question' columns.
Parquet¶
Should contain 'context' and 'question' columns.
JSON¶
An array of dictionaries with 'context' and 'question' keys.
XML¶
Each 'record' element should contain 'context' and 'question' elements.
YAML¶
Each document should be a dictionary with 'context' and 'question' keys.
TSV¶
Should contain 'context' and 'question' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'context' and 'question' columns.
SQLite (.db)¶
Should contain a table with 'context' and 'question' columns.
Feather¶
Should contain 'context' and 'question' columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
pad_on_right |
bool
|
Whether to pad on the right. |
required |
max_length |
int
|
The maximum length of the sequences. |
512
|
doc_stride |
int
|
The document stride. |
required |
evaluate_squadv2 |
bool
|
Whether to evaluate using SQuAD v2 metrics. |
required |
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Optional[Dataset]
|
The loaded dataset. |