Summarization¶

Bases: TextFineTuner

A bolt for fine-tuning Hugging Face models on summarization tasks.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	The batch input data.	required
`output`	`OutputConfig`	The output data.	required
`state`	`State`	The state manager.	required

CLI Usage:

    genius SummarizationFineTuner rise \
        batch \
            --input_folder ./input \
        batch \
            --output_folder ./output \
        none \
        fine_tune \
            --args \
                model_name=my_model \
                tokenizer_name=my_tokenizer \
                num_train_epochs=3 \
                per_device_train_batch_size=8

`compute_metrics(pred)` ¶

Compute ROUGE metrics.

Parameters:

Name	Type	Description	Default
`pred`	`EvalPrediction`	The predicted results.	required

Returns:

Name	Type	Description
`dict`	`Dict[str, float]`	A dictionary with ROUGE-1, ROUGE-2, and ROUGE-L scores.

`data_collator(examples)` ¶

Customize the data collator.

Parameters:

Name	Type	Description	Default
`examples`	`List[Dict[str, Union[str, List[int]]]]`	The examples to collate.	required

Returns:

Name	Type	Description
`dict`	`Dict[str, Union[List[int], List[List[int]]]]`	The collated data.

`load_dataset(dataset_path, **kwargs)` ¶

Load a dataset from a directory.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	The path to the dataset directory.	required
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Returns:

Type	Description
`Optional[DatasetDict]`	Dataset \| DatasetDict: The loaded dataset.

Supported Data Formats and Structures:¶

JSONL¶

Each line is a JSON object representing an example.

{"text": "The text content", "summary": "The summary"}

CSV¶

Should contain 'text' and 'summary' columns.

text,summary
"The text content","The summary"

Parquet¶

Should contain 'text' and 'summary' columns.

JSON¶

An array of dictionaries with 'text' and 'summary' keys.

[{"text": "The text content", "summary": "The summary"}]

XML¶

Each 'record' element should contain 'text' and 'summary' child elements.

<record>
    <text>The text content</text>
    <summary>The summary</summary>
</record>

YAML¶

Each document should be a dictionary with 'text' and 'summary' keys.

- text: "The text content"
  summary: "The summary"

TSV¶

Should contain 'text' and 'summary' columns separated by tabs.

Excel (.xls, .xlsx)¶

Should contain 'text' and 'summary' columns.

SQLite (.db)¶

Should contain a table with 'text' and 'summary' columns.

Feather¶

Should contain 'text' and 'summary' columns.

`prepare_train_features(examples)` ¶

Tokenize the examples and prepare the features for training.

Parameters:

Name	Type	Description	Default
`examples`	`dict`	A dictionary of examples.	required

Returns:

Name	Type	Description
`dict`	`Optional[Dict[str, List[int]]]`	The processed features.

Summarization¶

compute_metrics(pred) ¶

data_collator(examples) ¶

load_dataset(dataset_path, **kwargs) ¶

Supported Data Formats and Structures:¶

JSONL¶

CSV¶

Parquet¶

JSON¶

XML¶

YAML¶

TSV¶

Excel (.xls, .xlsx)¶

SQLite (.db)¶

Feather¶

prepare_train_features(examples) ¶

`compute_metrics(pred)` ¶

`data_collator(examples)` ¶

`load_dataset(dataset_path, **kwargs)` ¶

`prepare_train_features(examples)` ¶