Question Answering¶
Bases: TextFineTuner
A bolt for fine-tuning Hugging Face models on question answering tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The batch input data. |
required |
output |
OutputConfig
|
The output data. |
required |
state |
State
|
The state manager. |
required |
CLI Usage:
genius QAFineTuner rise \
batch \
--input_folder ./input \
batch \
--output_folder ./output \
none \
--id microsoft/tapex-large-finetuned-wtq-lol \
fine_tune \
--args \
model_name=my_model \
tokenizer_name=my_tokenizer \
num_train_epochs=3 \
per_device_train_batch_size=8
__init__(input, output, state, **kwargs)
¶
compute_metrics(eval_pred)
¶
Compute the accuracy of the model's predictions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
eval_pred |
tuple
|
A tuple containing two elements: - predictions (np.ndarray): The model's predictions. - label_ids (np.ndarray): The true labels. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
Optional[Dict[str, float]]
|
A dictionary mapping metric names to computed values. |
load_dataset(dataset_path, pad_on_right=True, max_length=None, doc_stride=None, evaluate_squadv2=False, **kwargs)
¶
Load a dataset from a directory.
Supported Data Formats and Structures:¶
JSONL¶
Each line is a JSON object representing an example.
{"context": "The context content", "question": "The question", "answers": {"answer_start": [int], "text": [str]}}
CSV¶
Should contain 'context', 'question', and 'answers' columns.
context,question,answers
"The context content","The question","{'answer_start': [int], 'text': [str]}"
Parquet¶
Should contain 'context', 'question', and 'answers' columns.
JSON¶
An array of dictionaries with 'context', 'question', and 'answers' keys.
[{"context": "The context content", "question": "The question", "answers": {"answer_start": [int], "text": [str]}}]
XML¶
Each 'record' element should contain 'context', 'question', and 'answers' child elements.
<record>
<context>The context content</context>
<question>The question</question>
<answers answer_start="int" text="str"></answers>
</record>
YAML¶
Each document should be a dictionary with 'context', 'question', and 'answers' keys.
TSV¶
Should contain 'context', 'question', and 'answers' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'context', 'question', and 'answers' columns.
SQLite (.db)¶
Should contain a table with 'context', 'question', and 'answers' columns.
Feather¶
Should contain 'context', 'question', and 'answers' columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
pad_on_right |
bool
|
Whether to pad on the right. |
True
|
max_length |
int
|
The maximum length of the sequences. |
None
|
doc_stride |
int
|
The document stride. |
None
|
evaluate_squadv2 |
bool
|
Whether to evaluate using SQuAD v2 metrics. |
False
|
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Optional[Dataset]
|
The loaded dataset. |
prepare_train_features(examples, cls_token_id=None)
¶
Tokenize our examples with truncation and padding, but keep the overflows using a stride.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
examples |
Dict[str, Union[str, List[str]]]
|
The examples to be tokenized. |
required |
Returns:
Type | Description |
---|---|
Optional[Dict[str, Union[List[int], List[List[int]]]]]
|
The tokenized examples. |