Skip to content

Classification

Bases: TextFineTuner

A bolt for fine-tuning Hugging Face models for text classification tasks.

This class extends the TextFineTuner and specializes in fine-tuning models for text classification. It provides additional functionalities for loading and preprocessing text classification datasets in various formats.

Parameters:

Name Type Description Default
input BatchInput

The batch input data.

required
output OutputConfig

The output data.

required
state State

The state manager.

required

CLI Usage:

genius TextClassificationFineTuner rise \
    batch \
        --input_folder ./input \
    batch \
        --output_folder ./output \
    none \
    --id cardiffnlp/twitter-roberta-base-hate-multiclass-latest-lol \
        fine_tune \
            --args \
                model_name=my_model \
                tokenizer_name=my_tokenizer \
                num_train_epochs=3 \
                per_device_train_batch_size=8 \
                data_max_length=512

compute_metrics(eval_pred)

Compute metrics for evaluation. This class implements a simple classification evaluation, tasks should ideally override this.

Parameters:

Name Type Description Default
eval_pred EvalPrediction

The evaluation predictions.

required

Returns:

Name Type Description
dict Union[Optional[Dict[str, float]], Dict[str, float]]

The computed metrics.

load_dataset(dataset_path, max_length=512, **kwargs)

Load a classification dataset from a directory.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset directory.

required
max_length int

The maximum length for tokenization. Defaults to 512.

512

Returns:

Name Type Description
Dataset Optional[Dataset]

The loaded dataset.

Raises:

Type Description
Exception

If there was an error loading the dataset.

Supported Data Formats and Structures:

JSONL

Each line is a JSON object representing an example.

{"text": "The text content", "label": "The label"}

CSV

Should contain 'text' and 'label' columns.

text,label
"The text content","The label"

Parquet

Should contain 'text' and 'label' columns.

JSON

An array of dictionaries with 'text' and 'label' keys.

[{"text": "The text content", "label": "The label"}]

XML

Each 'record' element should contain 'text' and 'label' child elements.

<record>
    <text>The text content</text>
    <label>The label</label>
</record>

YAML

Each document should be a dictionary with 'text' and 'label' keys.

- text: "The text content"
label: "The label"

TSV

Should contain 'text' and 'label' columns separated by tabs.

Excel (.xls, .xlsx)

Should contain 'text' and 'label' columns.

SQLite (.db)

Should contain a table with 'text' and 'label' columns.

Feather

Should contain 'text' and 'label' columns.