Instruction Tuning¶
Bases: TextFineTuner
A bolt for fine-tuning Hugging Face models on instruction tuning tasks.
This class inherits from TextFineTuner
and specializes in fine-tuning models for instruction-based tasks.
It provides additional methods for loading and preparing datasets in various formats, as well as computing custom metrics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The batch input data. |
required |
output |
OutputConfig
|
The output data. |
required |
state |
State
|
The state manager. |
required |
Attributes:
Name | Type | Description |
---|---|---|
max_length |
int
|
The maximum length for tokenization. |
CLI Usage:
genius InstructionFineTuner rise \
batch \
--input_folder ./input \
batch \
--output_folder ./output \
none \
--id mistralai/Mistral-7B-Instruct-v0.1-lol \
fine_tune \
--args \
model_name=my_model \
tokenizer_name=my_tokenizer \
num_train_epochs=3 \
per_device_train_batch_size=8 \
data_max_length=512
compute_metrics(eval_pred)
¶
Compute evaluation metrics for the model's predictions.
This method takes the model's predictions and ground truth labels, converts them to text, and then computes the BLEU score for evaluation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
eval_pred |
EvalPrediction
|
A named tuple containing |
required |
Returns:
Type | Description |
---|---|
Optional[Dict[str, float]]
|
Optional[Dict[str, float]]: A dictionary containing the BLEU score. Returns None if an exception occurs. |
Raises:
Type | Description |
---|---|
Exception
|
If the tokenizer is not initialized. |
load_dataset(dataset_path, max_length=512, **kwargs)
¶
Load an instruction tuning dataset from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
max_length |
int
|
The maximum length for tokenization. Defaults to 512. |
512
|
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Union[Dataset, Dict]
|
The loaded dataset. |
Raises:
Type | Description |
---|---|
Exception
|
If there was an error loading the dataset. |
Supported Data Formats and Structures:¶
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'instruction' and 'output' columns.
Parquet¶
Should contain 'instruction' and 'output' columns.
JSON¶
An array of dictionaries with 'instruction' and 'output' keys.
XML¶
Each 'record' element should contain 'instruction' and 'output' child elements.
YAML¶
Each document should be a dictionary with 'instruction' and 'output' keys.
TSV¶
Should contain 'instruction' and 'output' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'instruction' and 'output' columns.
SQLite (.db)¶
Should contain a table with 'instruction' and 'output' columns.
Feather¶
Should contain 'instruction' and 'output' columns.
prepare_train_features(examples)
¶
Tokenize the examples and prepare the features for training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
examples |
dict
|
A dictionary of examples. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
Dict
|
The processed features. |