Skip to content

Natural Language Inference

Bases: TextFineTuner

A bolt for fine-tuning Hugging Face models for text classification tasks.

This class extends the TextFineTuner and specializes in fine-tuning models for text classification. It provides additional functionalities for loading and preprocessing text classification datasets in various formats.

Parameters:

Name Type Description Default
input BatchInput

The batch input data.

required
output OutputConfig

The output data.

required
state State

The state manager.

required

CLI Usage:

    genius NLIFineTuner rise \
        batch \
            --input_folder ./input \
        batch \
            --output_folder ./output \
        none \
        --id MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7-lol
        fine_tune \
            --args \
                model_name=my_model \
                tokenizer_name=my_tokenizer \
                num_train_epochs=3 \
                per_device_train_batch_size=8

data_collator(examples)

Customize the data collator.

Parameters:

Name Type Description Default
examples Dict

The examples to collate.

required

Returns:

Name Type Description
dict Dict

The collated data.

load_dataset(dataset_path, **kwargs)

Load a commonsense reasoning dataset from a directory.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset directory.

required
**kwargs Any

Additional keyword arguments.

{}

Returns:

Name Type Description
Dataset Union[Dataset, DatasetDict, None]

The loaded dataset.

Raises:

Type Description
Exception

If there was an error loading the dataset.

Supported Data Formats and Structures:

Hugging Face Dataset

Dataset files saved by the Hugging Face datasets library.

JSONL

Each line is a JSON object representing an example.

{"premise": "The premise text", "hypothesis": "The hypothesis text", "label": 0 or 1 or 2}

CSV

Should contain 'premise', 'hypothesis', and 'label' columns.

premise,hypothesis,label
"The premise text","The hypothesis text",0

Parquet

Should contain 'premise', 'hypothesis', and 'label' columns.

JSON

An array of dictionaries with 'premise', 'hypothesis', and 'label' keys.

[{"premise": "The premise text", "hypothesis": "The hypothesis text", "label": 0}]

XML

Each 'record' element should contain 'premise', 'hypothesis', and 'label' child elements.

<record>
    <premise>The premise text</premise>
    <hypothesis>The hypothesis text</hypothesis>
    <label>0</label>
</record>

YAML

Each document should be a dictionary with 'premise', 'hypothesis', and 'label' keys.

- premise: "The premise text"
  hypothesis: "The hypothesis text"
  label: 0

TSV

Should contain 'premise', 'hypothesis', and 'label' columns separated by tabs.

Excel (.xls, .xlsx)

Should contain 'premise', 'hypothesis', and 'label' columns.

SQLite (.db)

Should contain a table with 'premise', 'hypothesis', and 'label' columns.

Feather

Should contain 'premise', 'hypothesis', and 'label' columns.

prepare_train_features(examples)

Tokenize the examples and prepare the features for training.

Parameters:

Name Type Description Default
examples dict

A dictionary of examples.

required

Returns:

Name Type Description
dict Dict

The processed features.