Classification¶
Bases: TextFineTuner
A bolt for fine-tuning Hugging Face models for text classification tasks.
This class extends the TextFineTuner
and specializes in fine-tuning models for text classification.
It provides additional functionalities for loading and preprocessing text classification datasets in various formats.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The batch input data. |
required |
output |
OutputConfig
|
The output data. |
required |
state |
State
|
The state manager. |
required |
CLI Usage:
genius TextClassificationFineTuner rise \
batch \
--input_folder ./input \
batch \
--output_folder ./output \
none \
--id cardiffnlp/twitter-roberta-base-hate-multiclass-latest-lol \
fine_tune \
--args \
model_name=my_model \
tokenizer_name=my_tokenizer \
num_train_epochs=3 \
per_device_train_batch_size=8 \
data_max_length=512
compute_metrics(eval_pred)
¶
Compute metrics for evaluation. This class implements a simple classification evaluation, tasks should ideally override this.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
eval_pred |
EvalPrediction
|
The evaluation predictions. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
Union[Optional[Dict[str, float]], Dict[str, float]]
|
The computed metrics. |
load_dataset(dataset_path, max_length=512, **kwargs)
¶
Load a classification dataset from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
max_length |
int
|
The maximum length for tokenization. Defaults to 512. |
512
|
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Optional[Dataset]
|
The loaded dataset. |
Raises:
Type | Description |
---|---|
Exception
|
If there was an error loading the dataset. |
Supported Data Formats and Structures:¶
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'text' and 'label' columns.
Parquet¶
Should contain 'text' and 'label' columns.
JSON¶
An array of dictionaries with 'text' and 'label' keys.
XML¶
Each 'record' element should contain 'text' and 'label' child elements.
YAML¶
Each document should be a dictionary with 'text' and 'label' keys.
TSV¶
Should contain 'text' and 'label' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'text' and 'label' columns.
SQLite (.db)¶
Should contain a table with 'text' and 'label' columns.
Feather¶
Should contain 'text' and 'label' columns.