Summarization¶
Bases: TextBulk
SummarizationBulk is a class for managing bulk text summarization tasks using Hugging Face models. It is designed to handle large-scale summarization tasks efficiently and effectively, utilizing state-of-the-art machine learning models to provide high-quality summaries.
The class provides methods to load datasets, configure summarization models, and execute bulk summarization tasks.
Example CLI Usage:
genius SummarizationBulk rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/summz \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/summz \
postgres \
--postgres_host 127.0.0.1 \
--postgres_port 5432 \
--postgres_user postgres \
--postgres_password postgres \
--postgres_database geniusrise\
--postgres_table state \
--id facebook/bart-large-cnn-lol \
summarize \
--args \
model_name="facebook/bart-large-cnn" \
model_class="AutoModelForSeq2SeqLM" \
tokenizer_class="AutoTokenizer" \
use_cuda=True \
precision="float" \
quantization=0 \
device_map="cuda:0" \
max_memory=None \
torchscript=False \
generation_bos_token_id=0 \
generation_decoder_start_token_id=2 \
generation_early_stopping=true \
generation_eos_token_id=2 \
generation_forced_bos_token_id=0 \
generation_forced_eos_token_id=2 \
generation_length_penalty=2.0 \
generation_max_length=142 \
generation_min_length=56 \
generation_no_repeat_ngram_size=3 \
generation_num_beams=4 \
generation_pad_token_id=1 \
generation_do_sample=false
__init__(input, output, state, **kwargs)
¶
Initializes the SummarizationBulk class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The input data configuration. |
required |
output |
BatchOutput
|
The output data configuration. |
required |
state |
State
|
The state configuration. |
required |
**kwargs |
Additional keyword arguments. |
{}
|
load_dataset(dataset_path, max_length=512, **kwargs)
¶
Load a dataset from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
**kwargs |
Additional keyword arguments. |
{}
|
Returns:
Type | Description |
---|---|
Optional[Dataset]
|
Dataset | DatasetDict: The loaded dataset. |
Supported Data Formats and Structures:¶
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain a 'text' column.
Parquet¶
Should contain a 'text' column.
JSON¶
An array of dictionaries with a 'text' key.
XML¶
Each 'record' element should contain 'text' child element.
YAML¶
Each document should be a dictionary with a 'text' key.
TSV¶
Should contain a 'text' column.
Excel (.xls, .xlsx)¶
Should contain a 'text' column.
SQLite (.db)¶
Should contain a table with a 'text' column.
Feather¶
Should contain a 'text' column.
summarize(model_name, model_class='AutoModelForSeq2SeqLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, max_length=512, notification_email=None, **kwargs)
¶
Perform bulk summarization using the specified model and tokenizer. This method handles the entire summarization process including loading the model, processing input data, generating summarization, and saving the results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
Name or path of the translation model. |
required |
origin |
str
|
Source language ISO code. |
required |
target |
str
|
Target language ISO code. |
required |
max_length |
int
|
Maximum length of the tokens (default 512). |
512
|
model_class |
str
|
Class name of the model (default "AutoModelForSeq2SeqLM"). |
'AutoModelForSeq2SeqLM'
|
tokenizer_class |
str
|
Class name of the tokenizer (default "AutoTokenizer"). |
'AutoTokenizer'
|
use_cuda |
bool
|
Whether to use CUDA for model inference (default False). |
False
|
precision |
str
|
Precision for model computation (default "float16"). |
'float16'
|
quantization |
int
|
Level of quantization for optimizing model size and speed (default 0). |
0
|
device_map |
str | Dict | None
|
Specific device to use for computation (default "auto"). |
'auto'
|
max_memory |
Dict
|
Maximum memory configuration for devices. |
{0: '24GB'}
|
torchscript |
bool
|
Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False. |
False
|
compile |
bool
|
Whether to compile the model before fine-tuning. Defaults to True. |
False
|
awq_enabled |
bool
|
Whether to enable AWQ optimization (default False). |
False
|
flash_attention |
bool
|
Whether to use flash attention optimization (default False). |
False
|
batch_size |
int
|
Number of translations to process simultaneously (default 32). |
32
|
max_length |
int
|
Maximum lenght of the summary to be generated (default 512). |
512
|
**kwargs |
Any
|
Arbitrary keyword arguments for model and generation configurations. |
{}
|