Question Answering¶

Bases: TextBulk

QABulk is a class designed for managing bulk question-answering tasks using Hugging Face models. It is capable of handling both traditional text-based QA and table-based QA (using TAPAS and TAPEX models), providing a versatile solution for automated question answering at scale.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	Configuration and data inputs for batch processing.	required
`output`	`BatchOutput`	Configurations for output data handling.	required
`state`	`State`	State management for the bulk QA task.	required
`**kwargs`		Arbitrary keyword arguments for extended functionality.	`{}`

Example CLI Usage:

genius                          # genius                          # genius

href="#__codelineno-0-1"># For traditional text-based QA: class="w"> QABulk rise \ batch \ --input_s3_bucket geniusrise-test \ --input_s3_folder input/qa-traditional \ batch \ --output_s3_bucket geniusrise-test \ --output_s3_folder output/qa-traditional \ postgres \ --postgres_host 127.0.0.1 \ --postgres_port 5432 \ --postgres_user postgres \ --postgres_password postgres \ --postgres_database geniusrise\ --postgres_table state \ --id distilbert-base-uncased-distilled-squad-lol \ answer_questions \ --args \ model_name="distilbert-base-uncased-distilled-squad" \ model_class="AutoModelForQuestionAnswering" \ tokenizer_class="AutoTokenizer" \ use_cuda=True \ precision="bfloat16" \ quantization=0 \ device_map="cuda:0" \ max_memory=None \ torchscript=False For table-based QA using TAPAS: class="w"> QABulk rise \ batch \ --input_s3_bucket geniusrise-test \ --input_s3_folder input/qa-table \ batch \ --output_s3_bucket geniusrise-test \ --output_s3_folder output/qa-table \ postgres \ --postgres_host 127.0.0.1 \ --postgres_port 5432 \ --postgres_user postgres \ --postgres_password postgres \ --postgres_database geniusrise\ --postgres_table state \ --id google/tapas-base-finetuned-wtq-lol \ answer_questions \ --args \ model_name="google/tapas-base-finetuned-wtq" \ model_class="AutoModelForTableQuestionAnswering" \ tokenizer_class="AutoTokenizer" \ use_cuda=True \ precision="float" \ quantization=0 \ device_map="cuda:0" \ max_memory=None \ torchscript=False For table-based QA using TAPEX: class="w"> QABulk rise \ batch \ --input_s3_bucket geniusrise-test \ --input_s3_folder input/qa-table \ batch \ --output_s3_bucket geniusrise-test \ --output_s3_folder output/qa-table \ postgres \ --postgres_host 127.0.0.1 \ --postgres_port 5432 \ --postgres_user postgres \ --postgres_password postgres \ --postgres_database geniusrise\ --postgres_table state \ --id microsoft/tapex-large-finetuned-wtq-lol \ answer_questions \ --args \ model_name="microsoft/tapex-large-finetuned-wtq" \ model_class="AutoModelForSeq2SeqLM" \ tokenizer_class="AutoTokenizer" \ use_cuda=True \ precision="float" \ quantization=0 \ device_map="cuda:0" \ max_memory=None \ torchscript=False

`init(input, output, state, **kwargs)` ¶

Initializes the QABulk class with configurations for input, output, and state.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	Configuration for the input data.	required
`output`	`BatchOutput`	Configuration for the output data.	required
`state`	`State`	State management for the QA task.	required
`**kwargs`	`Any`	Additional keyword arguments for extended functionality.	`{}`

`answer_questions(model_name, model_class='AutoModelForQuestionAnswering', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, notification_email=None, **kwargs)` ¶

Perform bulk question-answering using the specified model and tokenizer. This method can handle various types of QA models including traditional, TAPAS, and TAPEX.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Name or path of the question-answering model.	required
`model_class`	`str`	Class name of the model (e.g., "AutoModelForQuestionAnswering").	`'AutoModelForQuestionAnswering'`
`tokenizer_class`	`str`	Class name of the tokenizer (e.g., "AutoTokenizer").	`'AutoTokenizer'`
`use_cuda`	`bool`	Whether to use CUDA for model inference. Defaults to False.	`False`
`precision`	`str`	Precision for model computation. Defaults to "float16".	`'float16'`
`quantization`	`int`	Level of quantization for optimizing model size and speed. Defaults to 0.	`0`
`device_map`	`str \| Dict \| None`	Specific device to use for computation. Defaults to "auto".	`'auto'`
`max_memory`	`Dict`	Maximum memory configuration for devices. Defaults to {0: "24GB"}.	`{0: '24GB'}`
`torchscript`	`bool`	Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False.	`False`
`compile`	`bool`	Whether to compile the model before fine-tuning. Defaults to True.	`False`
`awq_enabled`	`bool`	Whether to enable AWQ optimization. Defaults to False.	`False`
`flash_attention`	`bool`	Whether to use flash attention optimization. Defaults to False.	`False`
`batch_size`	`int`	Number of questions to process simultaneously. Defaults to 32.	`32`
`**kwargs`	`Any`	Arbitrary keyword arguments for model and generation configurations.	`{}`

Processing

The method processes the data in batches, utilizing the appropriate model based on the model name and generating answers for the questions provided in the dataset.

`load_dataset(dataset_path, max_length=512, **kwargs)` ¶

Load a dataset from a directory.

Supported Data Formats and Structures:¶

JSONL¶

Each line is a JSON object representing an example.

{"context": "The context content", "question": "The question"}

CSV¶

Should contain 'context' and 'question' columns.

context,question
"The context content","The question"

Parquet¶

Should contain 'context' and 'question' columns.

JSON¶

An array of dictionaries with 'context' and 'question' keys.

[{"context": "The context content", "question": "The question"}]

XML¶

Each 'record' element should contain 'context' and 'question' elements.

<record>
    <context>The context content</context>
    <question>The question</question>
</record>

YAML¶

Each document should be a dictionary with 'context' and 'question' keys.

- context: "The context content"
  question: "The question"

TSV¶

Should contain 'context' and 'question' columns separated by tabs.

Excel (.xls, .xlsx)¶

Should contain 'context' and 'question' columns.

SQLite (.db)¶

Should contain a table with 'context' and 'question' columns.

Feather¶

Should contain 'context' and 'question' columns.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	The path to the dataset directory.	required
`pad_on_right`	`bool`	Whether to pad on the right.	required
`max_length`	`int`	The maximum length of the sequences.	`512`
`doc_stride`	`int`	The document stride.	required
`evaluate_squadv2`	`bool`	Whether to evaluate using SQuAD v2 metrics.	required

Returns:

Name	Type	Description
`Dataset`	`Optional[Dataset]`	The loaded dataset.

Question Answering¶

__init__(input, output, state, **kwargs) ¶

load_dataset(dataset_path, max_length=512, **kwargs) ¶

Supported Data Formats and Structures:¶

JSONL¶

CSV¶

Parquet¶

JSON¶

XML¶

YAML¶

TSV¶

Excel (.xls, .xlsx)¶

SQLite (.db)¶

Feather¶

`init(input, output, state, **kwargs)` ¶

`load_dataset(dataset_path, max_length=512, **kwargs)` ¶