Language Model Fine Tuner¶
Bases: OpenAIFineTuner
A bolt for fine-tuning OpenAI models on language modeling tasks.
This bolt uses the OpenAI API to fine-tune a pre-trained model for language modeling.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The batch input data. |
required |
output |
BatchOutput
|
The output data. |
required |
state |
State
|
The state manager. |
required |
**kwargs |
Additional keyword arguments. |
required |
CLI Usage:
genius HuggingFaceCommonsenseReasoningFineTuner rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder train \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder model \
fine_tune \
--args model_name=my_model tokenizer_name=my_tokenizer num_train_epochs=3 per_device_train_batch_size=8
YAML Configuration:
version: "1"
bolts:
my_fine_tuner:
name: "HuggingFaceCommonsenseReasoningFineTuner"
method: "fine_tune"
args:
model_name: "my_model"
tokenizer_name: "my_tokenizer"
num_train_epochs: 3
per_device_train_batch_size: 8
data_max_length: 512
input:
type: "batch"
args:
bucket: "my_bucket"
folder: "my_dataset"
output:
type: "batch"
args:
bucket: "my_bucket"
folder: "my_model"
deploy:
type: k8s
args:
kind: deployment
name: my_fine_tuner
context_name: arn:aws:eks:us-east-1:genius-dev:cluster/geniusrise-dev
namespace: geniusrise
image: geniusrise/geniusrise
kube_config_path: ~/.kube/config
Supported Data Formats
- JSONL
- CSV
- Parquet
- JSON
- XML
- YAML
- TSV
- Excel (.xls, .xlsx)
- SQLite (.db)
- Feather
load_dataset(dataset_path, **kwargs)
¶
Load a language modeling dataset from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Union[Dataset, DatasetDict, Optional[Dataset]]
|
The loaded dataset. |
Raises:
Type | Description |
---|---|
Exception
|
If there was an error loading the dataset. |
Supported Data Formats and Structures:¶
Dataset files saved by Hugging Face datasets library¶
The directory should contain 'dataset_info.json' and other related files.
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'text' column.
Parquet¶
Should contain 'text' column.
JSON¶
An array of dictionaries with 'text' key.
XML¶
Each 'record' element should contain 'text' child element.
YAML¶
Each document should be a dictionary with 'text' key.
TSV¶
Should contain 'text' column separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'text' column.
SQLite (.db)¶
Should contain a table with 'text' column.
Feather¶
Should contain 'text' column.
prepare_fine_tuning_data(data, data_type)
¶
Load a language modeling dataset from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
masked |
bool
|
Whether to use masked language modeling. Defaults to True. |
required |
max_length |
int
|
The maximum length for tokenization. Defaults to 512. |
required |
Returns:
Name | Type | Description |
---|---|---|
Dataset |
None
|
The loaded dataset. |
Raises:
Type | Description |
---|---|
Exception
|
If there was an error loading the dataset. |
Supported Data Formats and Structures:¶
Dataset files saved by Hugging Face datasets library¶
The directory should contain 'dataset_info.json' and other related files.
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'text' column.
Parquet¶
Should contain 'text' column.
JSON¶
An array of dictionaries with 'text' key.
XML¶
Each 'record' element should contain 'text' child element.
YAML¶
Each document should be a dictionary with 'text' key.
TSV¶
Should contain 'text' column separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'text' column.
SQLite (.db)¶
Should contain a table with 'text' column.
Feather¶
Should contain 'text' column.