Instruction Tuning Fine Tuner¶

Bases: OpenAIFineTuner

A bolt for fine-tuning OpenAI models on instruction following tasks.

This bolt uses the OpenAI API to fine-tune a pre-trained model for instruction following tasks.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	The batch input data.	required
`output`	`BatchOutput`	The output data.	required
`state`	`State`	The state manager.	required

CLI Usage:

    genius HuggingFaceCommonsenseReasoningFineTuner rise \
        batch \
            --input_s3_bucket geniusrise-test \
            --input_s3_folder train \
        batch \
            --output_s3_bucket geniusrise-test \
            --output_s3_folder model \
        fine_tune \
            --args model_name=my_model tokenizer_name=my_tokenizer num_train_epochs=3 per_device_train_batch_size=8

YAML Configuration:

    version: "1"
    bolts:
        my_fine_tuner:
            name: "HuggingFaceCommonsenseReasoningFineTuner"
            method: "fine_tune"
            args:
                model_name: "my_model"
                tokenizer_name: "my_tokenizer"
                num_train_epochs: 3
                per_device_train_batch_size: 8
                data_max_length: 512
            input:
                type: "batch"
                args:
                    bucket: "my_bucket"
                    folder: "my_dataset"
            output:
                type: "batch"
                args:
                    bucket: "my_bucket"
                    folder: "my_model"
            deploy:
                type: k8s
                args:
                    kind: deployment
                    name: my_fine_tuner
                    context_name: arn:aws:eks:us-east-1:genius-dev:cluster/geniusrise-dev
                    namespace: geniusrise
                    image: geniusrise/geniusrise
                    kube_config_path: ~/.kube/config

Supported Data Formats

JSONL
CSV
Parquet
JSON
XML
YAML
TSV
Excel (.xls, .xlsx)
SQLite (.db)
Feather

`load_dataset(dataset_path, **kwargs)` ¶

Load an instruction following dataset from a directory.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	The path to the dataset directory.	required
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Returns:

Name	Type	Description
`Dataset`	`Union[Dataset, DatasetDict, Optional[Dataset]]`	The loaded dataset.

Raises:

Type	Description
`Exception`	If there was an error loading the dataset.

Supported Data Formats and Structures:¶

Hugging Face Dataset¶

Dataset files saved by the Hugging Face datasets library.

JSONL¶

Each line is a JSON object representing an example.

CSV¶

Should contain 'instruction' and 'output' columns.

Parquet¶

Should contain 'instruction' and 'output' columns.

JSON¶

An array of dictionaries with 'instruction' and 'output' keys.

XML¶

Each 'record' element should contain 'instruction' and 'output' child elements.

YAML¶

Each document should be a dictionary with 'instruction' and 'output' keys.

TSV¶

Should contain 'instruction' and 'output' columns separated by tabs.

Excel (.xls, .xlsx)¶

Should contain 'instruction' and 'output' columns.

SQLite (.db)¶

Should contain a table with 'instruction' and 'output' columns.

Feather¶

Should contain 'instruction' and 'output' columns.

`prepare_fine_tuning_data(data, data_type)` ¶

Prepare the given data for fine-tuning.

Parameters:

Name	Type	Description	Default
`data`	`Union[Dataset, DatasetDict, Optional[Dataset]]`	The dataset to prepare.	required
`data_type`	`str`	Either 'train' or 'eval' to specify the type of data.	required

Raises:

Type	Description
`ValueError`	If data_type is not 'train' or 'eval'.

Instruction Tuning Fine Tuner¶

load_dataset(dataset_path, **kwargs) ¶

Supported Data Formats and Structures:¶

Hugging Face Dataset¶

JSONL¶

CSV¶

Parquet¶

JSON¶

XML¶

YAML¶

TSV¶

Excel (.xls, .xlsx)¶

SQLite (.db)¶

Feather¶

prepare_fine_tuning_data(data, data_type) ¶

`load_dataset(dataset_path, **kwargs)` ¶

`prepare_fine_tuning_data(data, data_type)` ¶