Question Answering¶

Bases: TextFineTuner

A bolt for fine-tuning Hugging Face models on question answering tasks.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	The batch input data.	required
`output`	`OutputConfig`	The output data.	required
`state`	`State`	The state manager.	required

CLI Usage:

    genius QAFineTuner rise \
        batch \
            --input_folder ./input \
        batch \
            --output_folder ./output \
        none \
        --id microsoft/tapex-large-finetuned-wtq-lol \
        fine_tune \
            --args \
                model_name=my_model \
                tokenizer_name=my_tokenizer \
                num_train_epochs=3 \
                per_device_train_batch_size=8

`init(input, output, state, **kwargs)` ¶

Initialize the bolt.

Args:
    input (BatchInput): The batch input data.
    output (OutputConfig): The output data.
    state (State): The state manager.
    **kwargs: Additional keyword arguments.

`compute_metrics(eval_pred)` ¶

Compute the accuracy of the model's predictions.

Parameters:

Name	Type	Description	Default
`eval_pred`	`tuple`	A tuple containing two elements: - predictions (np.ndarray): The model's predictions. - label_ids (np.ndarray): The true labels.	required

Returns:

Name	Type	Description
`dict`	`Optional[Dict[str, float]]`	A dictionary mapping metric names to computed values.

`load_dataset(dataset_path, pad_on_right=True, max_length=None, doc_stride=None, evaluate_squadv2=False, **kwargs)` ¶

Load a dataset from a directory.

Supported Data Formats and Structures:¶

JSONL¶

Each line is a JSON object representing an example.

{"context": "The context content", "question": "The question", "answers": {"answer_start": [int], "text": [str]}}

CSV¶

Should contain 'context', 'question', and 'answers' columns.

context,question,answers
"The context content","The question","{'answer_start': [int], 'text': [str]}"

Parquet¶

Should contain 'context', 'question', and 'answers' columns.

JSON¶

An array of dictionaries with 'context', 'question', and 'answers' keys.

[{"context": "The context content", "question": "The question", "answers": {"answer_start": [int], "text": [str]}}]

XML¶

Each 'record' element should contain 'context', 'question', and 'answers' child elements.

<record>
    <context>The context content</context>
    <question>The question</question>
    <answers answer_start="int" text="str"></answers>
</record>

YAML¶

Each document should be a dictionary with 'context', 'question', and 'answers' keys.

- context: "The context content"
  question: "The question"
  answers:
    answer_start: [int]
    text: [str]

TSV¶

Should contain 'context', 'question', and 'answers' columns separated by tabs.

Excel (.xls, .xlsx)¶

Should contain 'context', 'question', and 'answers' columns.

SQLite (.db)¶

Should contain a table with 'context', 'question', and 'answers' columns.

Feather¶

Should contain 'context', 'question', and 'answers' columns.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	The path to the dataset directory.	required
`pad_on_right`	`bool`	Whether to pad on the right.	`True`
`max_length`	`int`	The maximum length of the sequences.	`None`
`doc_stride`	`int`	The document stride.	`None`
`evaluate_squadv2`	`bool`	Whether to evaluate using SQuAD v2 metrics.	`False`

Returns:

Name	Type	Description
`Dataset`	`Optional[Dataset]`	The loaded dataset.

`prepare_train_features(examples, cls_token_id=None)` ¶

Tokenize our examples with truncation and padding, but keep the overflows using a stride.

Parameters:

Name	Type	Description	Default
`examples`	`Dict[str, Union[str, List[str]]]`	The examples to be tokenized.	required

Returns:

Type	Description
`Optional[Dict[str, Union[List[int], List[List[int]]]]]`	The tokenized examples.

Question Answering¶

__init__(input, output, state, **kwargs) ¶

compute_metrics(eval_pred) ¶

load_dataset(dataset_path, pad_on_right=True, max_length=None, doc_stride=None, evaluate_squadv2=False, **kwargs) ¶

Supported Data Formats and Structures:¶

JSONL¶

CSV¶

Parquet¶

JSON¶

XML¶

YAML¶

TSV¶

Excel (.xls, .xlsx)¶

SQLite (.db)¶

Feather¶

prepare_train_features(examples, cls_token_id=None) ¶

`init(input, output, state, **kwargs)` ¶

`compute_metrics(eval_pred)` ¶

`load_dataset(dataset_path, pad_on_right=True, max_length=None, doc_stride=None, evaluate_squadv2=False, **kwargs)` ¶

`prepare_train_features(examples, cls_token_id=None)` ¶