Question Answering Fine Tuner¶
Bases: OpenAIFineTuner
A bolt for fine-tuning OpenAI models on question answering tasks.
CLI Usage:
genius HuggingFaceCommonsenseReasoningFineTuner rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder train \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder model \
fine_tune \
--args model_name=my_model tokenizer_name=my_tokenizer num_train_epochs=3 per_device_train_batch_size=8
YAML Configuration:
version: "1"
bolts:
my_fine_tuner:
name: "HuggingFaceCommonsenseReasoningFineTuner"
method: "fine_tune"
args:
model_name: "my_model"
tokenizer_name: "my_tokenizer"
num_train_epochs: 3
per_device_train_batch_size: 8
data_max_length: 512
input:
type: "batch"
args:
bucket: "my_bucket"
folder: "my_dataset"
output:
type: "batch"
args:
bucket: "my_bucket"
folder: "my_model"
deploy:
type: k8s
args:
kind: deployment
name: my_fine_tuner
context_name: arn:aws:eks:us-east-1:genius-dev:cluster/geniusrise-dev
namespace: geniusrise
image: geniusrise/geniusrise
kube_config_path: ~/.kube/config
Supported Data Formats
- JSONL
- CSV
- Parquet
- JSON
- XML
- YAML
- TSV
- Excel (.xls, .xlsx)
- SQLite (.db)
- Feather
load_dataset(dataset_path, **kwargs)
¶
Load a dataset from a directory.
Supported Data Formats and Structures:¶
JSONL¶
Each line is a JSON object representing an example.
{"context": "The context content", "question": "The question", "answers": {"answer_start": [int], "context": [str]}}
CSV¶
Should contain 'context', 'question', and 'answers' columns.
context,question,answers
"The context content","The question","{'answer_start': [int], 'text': [str]}"
Parquet¶
Should contain 'context', 'question', and 'answers' columns.
JSON¶
An array of dictionaries with 'context', 'question', and 'answers' keys.
[{"context": "The context content", "question": "The question", "answers": {"answer_start": [int], "context": [str]}}]
XML¶
Each 'record' element should contain 'context', 'question', and 'answers' child elements.
<record>
<context>The context content</context>
<question>The question</question>
<answers answer_start="int" context="str"></answers>
</record>
YAML¶
Each document should be a dictionary with 'context', 'question', and 'answers' keys.
- context: "The context content"
question: "The question"
answers:
answer_start: [int]
context: [str]
TSV¶
Should contain 'context', 'question', and 'answers' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'context', 'question', and 'answers' columns.
SQLite (.db)¶
Should contain a table with 'context', 'question', and 'answers' columns.
Feather¶
Should contain 'context', 'question', and 'answers' columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
pad_on_right |
bool
|
Whether to pad on the right. |
required |
max_length |
int
|
The maximum length of the sequences. |
required |
doc_stride |
int
|
The document stride. |
required |
evaluate_squadv2 |
bool
|
Whether to evaluate using SQuAD v2 metrics. |
required |
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Union[Dataset, DatasetDict, Optional[Dataset]]
|
The loaded dataset. |
prepare_fine_tuning_data(data, data_type)
¶
Prepare the given data for fine-tuning.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Union[Dataset, DatasetDict, Optional[Dataset]]
|
The dataset to prepare. |
required |
data_type |
str
|
Either 'train' or 'eval' to specify the type of data. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If data_type is not 'train' or 'eval'. |