Summarization¶
Bases: TextFineTuner
A bolt for fine-tuning Hugging Face models on summarization tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The batch input data. |
required |
output |
OutputConfig
|
The output data. |
required |
state |
State
|
The state manager. |
required |
CLI Usage:
genius SummarizationFineTuner rise \
batch \
--input_folder ./input \
batch \
--output_folder ./output \
none \
fine_tune \
--args \
model_name=my_model \
tokenizer_name=my_tokenizer \
num_train_epochs=3 \
per_device_train_batch_size=8
compute_metrics(pred)
¶
Compute ROUGE metrics.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pred |
EvalPrediction
|
The predicted results. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
Dict[str, float]
|
A dictionary with ROUGE-1, ROUGE-2, and ROUGE-L scores. |
data_collator(examples)
¶
Customize the data collator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
examples |
List[Dict[str, Union[str, List[int]]]]
|
The examples to collate. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
Dict[str, Union[List[int], List[List[int]]]]
|
The collated data. |
load_dataset(dataset_path, **kwargs)
¶
Load a dataset from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
**kwargs |
Any
|
Additional keyword arguments. |
{}
|
Returns:
Type | Description |
---|---|
Optional[DatasetDict]
|
Dataset | DatasetDict: The loaded dataset. |
Supported Data Formats and Structures:¶
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'text' and 'summary' columns.
Parquet¶
Should contain 'text' and 'summary' columns.
JSON¶
An array of dictionaries with 'text' and 'summary' keys.
XML¶
Each 'record' element should contain 'text' and 'summary' child elements.
YAML¶
Each document should be a dictionary with 'text' and 'summary' keys.
TSV¶
Should contain 'text' and 'summary' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'text' and 'summary' columns.
SQLite (.db)¶
Should contain a table with 'text' and 'summary' columns.
Feather¶
Should contain 'text' and 'summary' columns.
prepare_train_features(examples)
¶
Tokenize the examples and prepare the features for training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
examples |
dict
|
A dictionary of examples. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
Optional[Dict[str, List[int]]]
|
The processed features. |