Skip to content

Instruction Tuning

Bases: TextFineTuner

A bolt for fine-tuning Hugging Face models on instruction tuning tasks.

This class inherits from TextFineTuner and specializes in fine-tuning models for instruction-based tasks. It provides additional methods for loading and preparing datasets in various formats, as well as computing custom metrics.

Parameters:

Name Type Description Default
input BatchInput

The batch input data.

required
output OutputConfig

The output data.

required
state State

The state manager.

required

Attributes:

Name Type Description
max_length int

The maximum length for tokenization.

CLI Usage:

    genius InstructionFineTuner rise \
        batch \
            --input_folder ./input \
        batch \
            --output_folder ./output \
        none \
        --id mistralai/Mistral-7B-Instruct-v0.1-lol \
        fine_tune \
            --args \
                model_name=my_model \
                tokenizer_name=my_tokenizer \
                num_train_epochs=3 \
                per_device_train_batch_size=8 \
                data_max_length=512

compute_metrics(eval_pred)

Compute evaluation metrics for the model's predictions.

This method takes the model's predictions and ground truth labels, converts them to text, and then computes the BLEU score for evaluation.

Parameters:

Name Type Description Default
eval_pred EvalPrediction

A named tuple containing predictions and label_ids. - predictions: The logits predicted by the model of shape (batch_size, sequence_length, num_classes). - label_ids: The ground truth labels of shape (batch_size, sequence_length).

required

Returns:

Type Description
Optional[Dict[str, float]]

Optional[Dict[str, float]]: A dictionary containing the BLEU score. Returns None if an exception occurs.

Raises:

Type Description
Exception

If the tokenizer is not initialized.

load_dataset(dataset_path, max_length=512, **kwargs)

Load an instruction tuning dataset from a directory.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset directory.

required
max_length int

The maximum length for tokenization. Defaults to 512.

512

Returns:

Name Type Description
Dataset Union[Dataset, Dict]

The loaded dataset.

Raises:

Type Description
Exception

If there was an error loading the dataset.

Supported Data Formats and Structures:

JSONL

Each line is a JSON object representing an example.

{"instruction": "The instruction", "output": "The output"}

CSV

Should contain 'instruction' and 'output' columns.

instruction,output
"The instruction","The output"

Parquet

Should contain 'instruction' and 'output' columns.

JSON

An array of dictionaries with 'instruction' and 'output' keys.

[{"instruction": "The instruction", "output": "The output"}]

XML

Each 'record' element should contain 'instruction' and 'output' child elements.

<record>
    <instruction>The instruction</instruction>
    <output>The output</output>
</record>

YAML

Each document should be a dictionary with 'instruction' and 'output' keys.

- instruction: "The instruction"
  output: "The output"

TSV

Should contain 'instruction' and 'output' columns separated by tabs.

Excel (.xls, .xlsx)

Should contain 'instruction' and 'output' columns.

SQLite (.db)

Should contain a table with 'instruction' and 'output' columns.

Feather

Should contain 'instruction' and 'output' columns.

prepare_train_features(examples)

Tokenize the examples and prepare the features for training.

Parameters:

Name Type Description Default
examples dict

A dictionary of examples.

required

Returns:

Name Type Description
dict Dict

The processed features.