Skip to content

Summarization

Bases: TextFineTuner

A bolt for fine-tuning Hugging Face models on summarization tasks.

Parameters:

Name Type Description Default
input BatchInput

The batch input data.

required
output OutputConfig

The output data.

required
state State

The state manager.

required

CLI Usage:

    genius SummarizationFineTuner rise \
        batch \
            --input_folder ./input \
        batch \
            --output_folder ./output \
        none \
        fine_tune \
            --args \
                model_name=my_model \
                tokenizer_name=my_tokenizer \
                num_train_epochs=3 \
                per_device_train_batch_size=8

compute_metrics(pred)

Compute ROUGE metrics.

Parameters:

Name Type Description Default
pred EvalPrediction

The predicted results.

required

Returns:

Name Type Description
dict Dict[str, float]

A dictionary with ROUGE-1, ROUGE-2, and ROUGE-L scores.

data_collator(examples)

Customize the data collator.

Parameters:

Name Type Description Default
examples List[Dict[str, Union[str, List[int]]]]

The examples to collate.

required

Returns:

Name Type Description
dict Dict[str, Union[List[int], List[List[int]]]]

The collated data.

load_dataset(dataset_path, **kwargs)

Load a dataset from a directory.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset directory.

required
**kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
Optional[DatasetDict]

Dataset | DatasetDict: The loaded dataset.

Supported Data Formats and Structures:

JSONL

Each line is a JSON object representing an example.

{"text": "The text content", "summary": "The summary"}

CSV

Should contain 'text' and 'summary' columns.

text,summary
"The text content","The summary"

Parquet

Should contain 'text' and 'summary' columns.

JSON

An array of dictionaries with 'text' and 'summary' keys.

[{"text": "The text content", "summary": "The summary"}]

XML

Each 'record' element should contain 'text' and 'summary' child elements.

<record>
    <text>The text content</text>
    <summary>The summary</summary>
</record>

YAML

Each document should be a dictionary with 'text' and 'summary' keys.

- text: "The text content"
  summary: "The summary"

TSV

Should contain 'text' and 'summary' columns separated by tabs.

Excel (.xls, .xlsx)

Should contain 'text' and 'summary' columns.

SQLite (.db)

Should contain a table with 'text' and 'summary' columns.

Feather

Should contain 'text' and 'summary' columns.

prepare_train_features(examples)

Tokenize the examples and prepare the features for training.

Parameters:

Name Type Description Default
examples dict

A dictionary of examples.

required

Returns:

Name Type Description
dict Optional[Dict[str, List[int]]]

The processed features.