Skip to content

Instruction Tuning Fine Tuner

Bases: OpenAIFineTuner

A bolt for fine-tuning OpenAI models on instruction following tasks.

This bolt uses the OpenAI API to fine-tune a pre-trained model for instruction following tasks.

Parameters:

Name Type Description Default
input BatchInput

The batch input data.

required
output BatchOutput

The output data.

required
state State

The state manager.

required

CLI Usage:

    genius HuggingFaceCommonsenseReasoningFineTuner rise \
        batch \
            --input_s3_bucket geniusrise-test \
            --input_s3_folder train \
        batch \
            --output_s3_bucket geniusrise-test \
            --output_s3_folder model \
        fine_tune \
            --args model_name=my_model tokenizer_name=my_tokenizer num_train_epochs=3 per_device_train_batch_size=8

YAML Configuration:

    version: "1"
    bolts:
        my_fine_tuner:
            name: "HuggingFaceCommonsenseReasoningFineTuner"
            method: "fine_tune"
            args:
                model_name: "my_model"
                tokenizer_name: "my_tokenizer"
                num_train_epochs: 3
                per_device_train_batch_size: 8
                data_max_length: 512
            input:
                type: "batch"
                args:
                    bucket: "my_bucket"
                    folder: "my_dataset"
            output:
                type: "batch"
                args:
                    bucket: "my_bucket"
                    folder: "my_model"
            deploy:
                type: k8s
                args:
                    kind: deployment
                    name: my_fine_tuner
                    context_name: arn:aws:eks:us-east-1:genius-dev:cluster/geniusrise-dev
                    namespace: geniusrise
                    image: geniusrise/geniusrise
                    kube_config_path: ~/.kube/config
Supported Data Formats
  • JSONL
  • CSV
  • Parquet
  • JSON
  • XML
  • YAML
  • TSV
  • Excel (.xls, .xlsx)
  • SQLite (.db)
  • Feather

load_dataset(dataset_path, **kwargs)

Load an instruction following dataset from a directory.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset directory.

required
**kwargs Any

Additional keyword arguments.

{}

Returns:

Name Type Description
Dataset Union[Dataset, DatasetDict, Optional[Dataset]]

The loaded dataset.

Raises:

Type Description
Exception

If there was an error loading the dataset.

Supported Data Formats and Structures:

Hugging Face Dataset

Dataset files saved by the Hugging Face datasets library.

JSONL

Each line is a JSON object representing an example.

CSV

Should contain 'instruction' and 'output' columns.

Parquet

Should contain 'instruction' and 'output' columns.

JSON

An array of dictionaries with 'instruction' and 'output' keys.

XML

Each 'record' element should contain 'instruction' and 'output' child elements.

YAML

Each document should be a dictionary with 'instruction' and 'output' keys.

TSV

Should contain 'instruction' and 'output' columns separated by tabs.

Excel (.xls, .xlsx)

Should contain 'instruction' and 'output' columns.

SQLite (.db)

Should contain a table with 'instruction' and 'output' columns.

Feather

Should contain 'instruction' and 'output' columns.

prepare_fine_tuning_data(data, data_type)

Prepare the given data for fine-tuning.

Parameters:

Name Type Description Default
data Union[Dataset, DatasetDict, Optional[Dataset]]

The dataset to prepare.

required
data_type str

Either 'train' or 'eval' to specify the type of data.

required

Raises:

Type Description
ValueError

If data_type is not 'train' or 'eval'.