Language Model Fine Tuner¶

Bases: OpenAIFineTuner

A bolt for fine-tuning OpenAI models on language modeling tasks.

This bolt uses the OpenAI API to fine-tune a pre-trained model for language modeling.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	The batch input data.	required
`output`	`BatchOutput`	The output data.	required
`state`	`State`	The state manager.	required
`**kwargs`		Additional keyword arguments.	required

CLI Usage:

    genius HuggingFaceCommonsenseReasoningFineTuner rise \
        batch \
            --input_s3_bucket geniusrise-test \
            --input_s3_folder train \
        batch \
            --output_s3_bucket geniusrise-test \
            --output_s3_folder model \
        fine_tune \
            --args model_name=my_model tokenizer_name=my_tokenizer num_train_epochs=3 per_device_train_batch_size=8

YAML Configuration:

    version: "1"
    bolts:
        my_fine_tuner:
            name: "HuggingFaceCommonsenseReasoningFineTuner"
            method: "fine_tune"
            args:
                model_name: "my_model"
                tokenizer_name: "my_tokenizer"
                num_train_epochs: 3
                per_device_train_batch_size: 8
                data_max_length: 512
            input:
                type: "batch"
                args:
                    bucket: "my_bucket"
                    folder: "my_dataset"
            output:
                type: "batch"
                args:
                    bucket: "my_bucket"
                    folder: "my_model"
            deploy:
                type: k8s
                args:
                    kind: deployment
                    name: my_fine_tuner
                    context_name: arn:aws:eks:us-east-1:genius-dev:cluster/geniusrise-dev
                    namespace: geniusrise
                    image: geniusrise/geniusrise
                    kube_config_path: ~/.kube/config

Supported Data Formats

JSONL
CSV
Parquet
JSON
XML
YAML
TSV
Excel (.xls, .xlsx)
SQLite (.db)
Feather

`load_dataset(dataset_path, **kwargs)` ¶

Load a language modeling dataset from a directory.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	The path to the dataset directory.	required

Returns:

Name	Type	Description
`Dataset`	`Union[Dataset, DatasetDict, Optional[Dataset]]`	The loaded dataset.

Raises:

Type	Description
`Exception`	If there was an error loading the dataset.

Supported Data Formats and Structures:¶

Dataset files saved by Hugging Face datasets library¶

The directory should contain 'dataset_info.json' and other related files.

JSONL¶

Each line is a JSON object representing an example.

{"text": "The text content"}

CSV¶

Should contain 'text' column.

text
"The text content"

Parquet¶

Should contain 'text' column.

JSON¶

An array of dictionaries with 'text' key.

[{"text": "The text content"}]

XML¶

Each 'record' element should contain 'text' child element.

<record>
    <text>The text content</text>
</record>

YAML¶

Each document should be a dictionary with 'text' key.

- text: "The text content"

TSV¶

Should contain 'text' column separated by tabs.

Excel (.xls, .xlsx)¶

Should contain 'text' column.

SQLite (.db)¶

Should contain a table with 'text' column.

Feather¶

Should contain 'text' column.

`prepare_fine_tuning_data(data, data_type)` ¶

Load a language modeling dataset from a directory.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	The path to the dataset directory.	required
`masked`	`bool`	Whether to use masked language modeling. Defaults to True.	required
`max_length`	`int`	The maximum length for tokenization. Defaults to 512.	required

Returns:

Name	Type	Description
`Dataset`	`None`	The loaded dataset.

Raises:

Type	Description
`Exception`	If there was an error loading the dataset.

Supported Data Formats and Structures:¶

Dataset files saved by Hugging Face datasets library¶

The directory should contain 'dataset_info.json' and other related files.

JSONL¶

Each line is a JSON object representing an example.

{"text": "The text content"}

CSV¶

Should contain 'text' column.

text
"The text content"

Parquet¶

Should contain 'text' column.

JSON¶

An array of dictionaries with 'text' key.

[{"text": "The text content"}]

XML¶

Each 'record' element should contain 'text' child element.

<record>
    <text>The text content</text>
</record>

YAML¶

Each document should be a dictionary with 'text' key.

- text: "The text content"

TSV¶

Should contain 'text' column separated by tabs.

Excel (.xls, .xlsx)¶

Should contain 'text' column.

SQLite (.db)¶

Should contain a table with 'text' column.

Feather¶

Should contain 'text' column.

Language Model Fine Tuner¶

load_dataset(dataset_path, **kwargs) ¶

Supported Data Formats and Structures:¶

Dataset files saved by Hugging Face datasets library¶

JSONL¶

CSV¶

Parquet¶

JSON¶

XML¶

YAML¶

TSV¶

Excel (.xls, .xlsx)¶

SQLite (.db)¶

Feather¶

prepare_fine_tuning_data(data, data_type) ¶

Supported Data Formats and Structures:¶

Dataset files saved by Hugging Face datasets library¶

JSONL¶

CSV¶

Parquet¶

JSON¶

XML¶

YAML¶

TSV¶

Excel (.xls, .xlsx)¶

SQLite (.db)¶

Feather¶

`load_dataset(dataset_path, **kwargs)` ¶

`prepare_fine_tuning_data(data, data_type)` ¶