Natural Language Inference¶
Bases: TextFineTuner
A bolt for fine-tuning Hugging Face models for text classification tasks.
This class extends the TextFineTuner
and specializes in fine-tuning models for text classification.
It provides additional functionalities for loading and preprocessing text classification datasets in various formats.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The batch input data. |
required |
output |
OutputConfig
|
The output data. |
required |
state |
State
|
The state manager. |
required |
CLI Usage:
genius NLIFineTuner rise \
batch \
--input_folder ./input \
batch \
--output_folder ./output \
none \
--id MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7-lol
fine_tune \
--args \
model_name=my_model \
tokenizer_name=my_tokenizer \
num_train_epochs=3 \
per_device_train_batch_size=8
data_collator(examples)
¶
Customize the data collator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
examples |
Dict
|
The examples to collate. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
Dict
|
The collated data. |
load_dataset(dataset_path, **kwargs)
¶
Load a commonsense reasoning dataset from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
**kwargs |
Any
|
Additional keyword arguments. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Union[Dataset, DatasetDict, None]
|
The loaded dataset. |
Raises:
Type | Description |
---|---|
Exception
|
If there was an error loading the dataset. |
Supported Data Formats and Structures:¶
Hugging Face Dataset¶
Dataset files saved by the Hugging Face datasets library.
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'premise', 'hypothesis', and 'label' columns.
Parquet¶
Should contain 'premise', 'hypothesis', and 'label' columns.
JSON¶
An array of dictionaries with 'premise', 'hypothesis', and 'label' keys.
XML¶
Each 'record' element should contain 'premise', 'hypothesis', and 'label' child elements.
<record>
<premise>The premise text</premise>
<hypothesis>The hypothesis text</hypothesis>
<label>0</label>
</record>
YAML¶
Each document should be a dictionary with 'premise', 'hypothesis', and 'label' keys.
TSV¶
Should contain 'premise', 'hypothesis', and 'label' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'premise', 'hypothesis', and 'label' columns.
SQLite (.db)¶
Should contain a table with 'premise', 'hypothesis', and 'label' columns.
Feather¶
Should contain 'premise', 'hypothesis', and 'label' columns.
prepare_train_features(examples)
¶
Tokenize the examples and prepare the features for training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
examples |
dict
|
A dictionary of examples. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
Dict
|
The processed features. |