Language Model¶
Bases: TextFineTuner
A bolt for fine-tuning Hugging Face models on language modeling tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The batch input data. |
required |
output |
OutputConfig
|
The output data. |
required |
state |
State
|
The state manager. |
required |
CLI Usage:
genius LanguageModelFineTuner rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/lm \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/lm \
postgres \
--postgres_host 127.0.0.1 \
--postgres_port 5432 \
--postgres_user postgres \
--postgres_password postgres \
--postgres_database geniusrise\
--postgres_table state \
--id mistralai/Mistral-7B-Instruct-v0.1-lol \
fine_tune \
--args \
model_name=my_model \
tokenizer_name=my_tokenizer \
num_train_epochs=3 \
per_device_train_batch_size=8 \
data_max_length=512
compute_metrics(eval_pred)
¶
Compute evaluation metrics for the model's predictions.
This method takes the model's predictions and ground truth labels, converts them to text, and then computes the BLEU score for evaluation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
eval_pred |
EvalPrediction
|
A named tuple containing |
required |
Returns:
Type | Description |
---|---|
Optional[Dict[str, float]]
|
Optional[Dict[str, float]]: A dictionary containing the BLEU score. Returns None if an exception occurs. |
Raises:
Type | Description |
---|---|
Exception
|
If the tokenizer is not initialized. |
data_collator(examples)
¶
Customize the data collator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
examples |
The examples to collate. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
The collated data. |
load_dataset(dataset_path, masked=False, max_length=512, **kwargs)
¶
Load a language modeling dataset from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
masked |
bool
|
Whether to use masked language modeling. Defaults to True. |
False
|
max_length |
int
|
The maximum length for tokenization. Defaults to 512. |
512
|
Returns:
Name | Type | Description |
---|---|---|
Dataset |
The loaded dataset. |
Raises:
Type | Description |
---|---|
Exception
|
If there was an error loading the dataset. |
Supported Data Formats and Structures:¶
Dataset files saved by Hugging Face datasets library¶
The directory should contain 'dataset_info.json' and other related files.
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'text' column.
Parquet¶
Should contain 'text' column.
JSON¶
An array of dictionaries with 'text' key.
XML¶
Each 'record' element should contain 'text' child element.
YAML¶
Each document should be a dictionary with 'text' key.
TSV¶
Should contain 'text' column separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'text' column.
SQLite (.db)¶
Should contain a table with 'text' column.
Feather¶
Should contain 'text' column.
prepare_train_features(examples)
¶
Tokenize the examples and prepare the features for training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
examples |
dict
|
A dictionary of examples. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
The processed features. |