Base Fine Tuner¶
Bases: Bolt
A bolt for fine-tuning Hugging Face models.
This bolt uses the Hugging Face Transformers library to fine-tune a pre-trained model.
It uses the Trainer
class from the Transformers library to handle the training.
__init__(input, output, state, **kwargs)
¶
Initialize the bolt.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The batch input data. |
required |
output |
BatchOutput
|
The output data. |
required |
state |
State
|
The state manager. |
required |
evaluate |
bool
|
Whether to evaluate the model. Defaults to False. |
required |
**kwargs |
Additional keyword arguments. |
{}
|
compute_metrics(eval_pred)
¶
Compute metrics for evaluation. This class implements a simple classification evaluation, tasks should ideally override this.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
eval_pred |
EvalPrediction
|
The evaluation predictions. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
Optional[Dict[str, float]] | Dict[str, float]
|
The computed metrics. |
fine_tune(model_name, tokenizer_name, num_train_epochs, per_device_batch_size, model_class='AutoModel', tokenizer_class='AutoTokenizer', device_map='auto', precision='bfloat16', quantization=None, lora_config=None, use_accelerate=False, use_trl=False, accelerate_no_split_module_classes=[], compile=False, evaluate=False, save_steps=500, save_total_limit=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, map_data=None, use_huggingface_dataset=False, huggingface_dataset='', hf_repo_id=None, hf_commit_message=None, hf_token=None, hf_private=True, hf_create_pr=False, notification_email='', learning_rate=1e-05, **kwargs)
¶
Fine-tunes a pre-trained Hugging Face model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
The name of the pre-trained model. |
required |
tokenizer_name |
str
|
The name of the pre-trained tokenizer. |
required |
num_train_epochs |
int
|
The total number of training epochs to perform. |
required |
per_device_batch_size |
int
|
The batch size per device during training. |
required |
model_class |
str
|
The model class to use. Defaults to "AutoModel". |
'AutoModel'
|
tokenizer_class |
str
|
The tokenizer class to use. Defaults to "AutoTokenizer". |
'AutoTokenizer'
|
device_map |
str | dict
|
The device map for distributed training. Defaults to "auto". |
'auto'
|
precision |
str
|
The precision to use for training. Defaults to "bfloat16". |
'bfloat16'
|
quantization |
int
|
The quantization level to use for training. Defaults to None. |
None
|
lora_config |
dict
|
Configuration for PEFT LoRA optimization. Defaults to None. |
None
|
use_accelerate |
bool
|
Whether to use accelerate for distributed training. Defaults to False. |
False
|
use_trl |
bool
|
Whether to use TRL for training. Defaults to False. |
False
|
accelerate_no_split_module_classes |
List[str]
|
The module classes to not split during distributed training. Defaults to []. |
[]
|
evaluate |
bool
|
Whether to evaluate the model after training. Defaults to False. |
False
|
compile |
bool
|
Whether to compile the model before fine-tuning. Defaults to True. |
False
|
save_steps |
int
|
Number of steps between checkpoints. Defaults to 500. |
500
|
save_total_limit |
Optional[int]
|
Maximum number of checkpoints to keep. Older checkpoints are deleted. Defaults to None. |
None
|
load_best_model_at_end |
bool
|
Whether to load the best model (according to evaluation) at the end of training. Defaults to False. |
False
|
metric_for_best_model |
Optional[str]
|
The metric to use to compare models. Defaults to None. |
None
|
greater_is_better |
Optional[bool]
|
Whether a larger value of the metric indicates a better model. Defaults to None. |
None
|
use_huggingface_dataset |
bool
|
Whether to load a dataset from huggingface hub. |
False
|
huggingface_dataset |
str
|
The huggingface dataset to use. |
''
|
map_data |
Callable
|
A function to map data before training. Defaults to None. |
None
|
hf_repo_id |
str
|
The Hugging Face repo ID. Defaults to None. |
None
|
hf_commit_message |
str
|
The Hugging Face commit message. Defaults to None. |
None
|
hf_token |
str
|
The Hugging Face token. Defaults to None. |
None
|
hf_private |
bool
|
Whether to make the repo private. Defaults to True. |
True
|
hf_create_pr |
bool
|
Whether to create a pull request. Defaults to False. |
False
|
notification_email |
str
|
Whether to notify after job is complete. Defaults to None. |
''
|
learning_rate |
float
|
Learning rate for backpropagation. |
1e-05
|
**kwargs |
Additional keyword arguments to pass to the model. |
{}
|
Returns:
Type | Description |
---|---|
None |
load_dataset(dataset_path, **kwargs)
abstractmethod
¶
Load a dataset from a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset file. |
required |
split |
str
|
The split to load. Defaults to None. |
required |
**kwargs |
Additional keyword arguments to pass to the |
{}
|
Returns:
Type | Description |
---|---|
Dataset | DatasetDict | Optional[Dataset]
|
Union[Dataset, DatasetDict, None]: The loaded dataset. |
Raises:
Type | Description |
---|---|
NotImplementedError
|
This method should be overridden by subclasses. |
load_models(model_name, tokenizer_name, model_class='AutoModel', tokenizer_class='AutoTokenizer', device_map='auto', precision='bfloat16', quantization=None, lora_config=None, use_accelerate=False, accelerate_no_split_module_classes=[], **kwargs)
¶
Load the model and tokenizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
The name of the model to be loaded. |
required |
tokenizer_name |
str
|
The name of the tokenizer to be loaded. Defaults to None. |
required |
model_class |
str
|
The class of the model. Defaults to "AutoModel". |
'AutoModel'
|
tokenizer_class |
str
|
The class of the tokenizer. Defaults to "AutoTokenizer". |
'AutoTokenizer'
|
device |
Union[str, torch.device]
|
The device to be used. Defaults to "cuda". |
required |
precision |
str
|
The precision to be used. Choose from 'float32', 'float16', 'bfloat16'. Defaults to "float32". |
'bfloat16'
|
quantization |
Optional[int]
|
The quantization to be used. Defaults to None. |
None
|
lora_config |
Optional[dict]
|
The LoRA configuration to be used. Defaults to None. |
None
|
use_accelerate |
bool
|
Whether to use accelerate. Defaults to False. |
False
|
accelerate_no_split_module_classes |
List[str]
|
The list of no split module classes to be used. Defaults to []. |
[]
|
**kwargs |
Additional keyword arguments. |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported precision is chosen. |
Returns:
Type | Description |
---|---|
None |
preprocess_data(**kwargs)
¶
Load and preprocess the dataset
upload_to_hf_hub(hf_repo_id=None, hf_commit_message=None, hf_token=None, hf_private=None, hf_create_pr=None)
¶
Upload the model and tokenizer to Hugging Face Hub.