Language Model¶

Bases: TextFineTuner

A bolt for fine-tuning Hugging Face models on language modeling tasks.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	The batch input data.	required
`output`	`OutputConfig`	The output data.	required
`state`	`State`	The state manager.	required

CLI Usage:

    genius LanguageModelFineTuner rise \
        batch \
            --input_s3_bucket geniusrise-test \
            --input_s3_folder input/lm \
        batch \
            --output_s3_bucket geniusrise-test \
            --output_s3_folder output/lm \
        postgres \
            --postgres_host 127.0.0.1 \
            --postgres_port 5432 \
            --postgres_user postgres \
            --postgres_password postgres \
            --postgres_database geniusrise\
            --postgres_table state \
        --id mistralai/Mistral-7B-Instruct-v0.1-lol \
        fine_tune \
            --args \
                model_name=my_model \
                tokenizer_name=my_tokenizer \
                num_train_epochs=3 \
                per_device_train_batch_size=8 \
                data_max_length=512

`compute_metrics(eval_pred)` ¶

Compute evaluation metrics for the model's predictions.

This method takes the model's predictions and ground truth labels, converts them to text, and then computes the BLEU score for evaluation.

Parameters:

Name	Type	Description	Default
`eval_pred`	`EvalPrediction`	A named tuple containing `predictions` and `label_ids`. - `predictions`: The logits predicted by the model of shape (batch_size, sequence_length, num_classes). - `label_ids`: The ground truth labels of shape (batch_size, sequence_length).	required

Returns:

Type	Description
`Optional[Dict[str, float]]`	Optional[Dict[str, float]]: A dictionary containing the BLEU score. Returns None if an exception occurs.

Raises:

Type	Description
`Exception`	If the tokenizer is not initialized.

`data_collator(examples)` ¶

Customize the data collator.

Parameters:

Name	Type	Description	Default
`examples`		The examples to collate.	required

Returns:

Name	Type	Description
`dict`		The collated data.

`load_dataset(dataset_path, masked=False, max_length=512, **kwargs)` ¶

Load a language modeling dataset from a directory.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	The path to the dataset directory.	required
`masked`	`bool`	Whether to use masked language modeling. Defaults to True.	`False`
`max_length`	`int`	The maximum length for tokenization. Defaults to 512.	`512`

Returns:

Name	Type	Description
`Dataset`		The loaded dataset.

Raises:

Type	Description
`Exception`	If there was an error loading the dataset.

Supported Data Formats and Structures:¶

Dataset files saved by Hugging Face datasets library¶

The directory should contain 'dataset_info.json' and other related files.

JSONL¶

Each line is a JSON object representing an example.

{"text": "The text content"}

CSV¶

Should contain 'text' column.

text
"The text content"

Parquet¶

Should contain 'text' column.

JSON¶

An array of dictionaries with 'text' key.

[{"text": "The text content"}]

XML¶

Each 'record' element should contain 'text' child element.

<record>
    <text>The text content</text>
</record>

YAML¶

Each document should be a dictionary with 'text' key.

- text: "The text content"

TSV¶

Should contain 'text' column separated by tabs.

Excel (.xls, .xlsx)¶

Should contain 'text' column.

SQLite (.db)¶

Should contain a table with 'text' column.

Feather¶

Should contain 'text' column.

`prepare_train_features(examples)` ¶

Tokenize the examples and prepare the features for training.

Parameters:

Name	Type	Description	Default
`examples`	`dict`	A dictionary of examples.	required

Returns:

Name	Type	Description
`dict`		The processed features.

Language Model¶

compute_metrics(eval_pred) ¶

data_collator(examples) ¶

load_dataset(dataset_path, masked=False, max_length=512, **kwargs) ¶

Supported Data Formats and Structures:¶

Dataset files saved by Hugging Face datasets library¶

JSONL¶

CSV¶

Parquet¶

JSON¶

XML¶

YAML¶

TSV¶

Excel (.xls, .xlsx)¶

SQLite (.db)¶

Feather¶

prepare_train_features(examples) ¶

`compute_metrics(eval_pred)` ¶

`data_collator(examples)` ¶

`load_dataset(dataset_path, masked=False, max_length=512, **kwargs)` ¶

`prepare_train_features(examples)` ¶