Summarization¶

Bases: TextBulk

SummarizationBulk is a class for managing bulk text summarization tasks using Hugging Face models. It is designed to handle large-scale summarization tasks efficiently and effectively, utilizing state-of-the-art machine learning models to provide high-quality summaries.

The class provides methods to load datasets, configure summarization models, and execute bulk summarization tasks.

Example CLI Usage:

href="#__codelineno-0-1">genius SummarizationBulk rise \ batch \ --input_s3_bucket geniusrise-test \ --input_s3_folder input/summz \ batch \ --output_s3_bucket geniusrise-test \ --output_s3_folder output/summz \ postgres \ --postgres_host 127.0.0.1 \ --postgres_port 5432 \ --postgres_user postgres \ --postgres_password postgres \ --postgres_database geniusrise\ --postgres_table state \ --id facebook/bart-large-cnn-lol \ summarize \ --args \ model_name="facebook/bart-large-cnn" \ model_class="AutoModelForSeq2SeqLM" \ tokenizer_class="AutoTokenizer" \ use_cuda=True \ precision="float" \ quantization=0 \ device_map="cuda:0" \ max_memory=None \ torchscript=False \ generation_bos_token_id=0 \ generation_decoder_start_token_id=2 \ generation_early_stopping=true \ generation_eos_token_id=2 \ generation_forced_bos_token_id=0 \ generation_forced_eos_token_id=2 \ generation_length_penalty=2.0 \ generation_max_length=142 \ generation_min_length=56 \ generation_no_repeat_ngram_size=3 \ generation_num_beams=4 \ generation_pad_token_id=1 \ generation_do_sample=false

`init(input, output, state, **kwargs)` ¶

Initializes the SummarizationBulk class.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	The input data configuration.	required
`output`	`BatchOutput`	The output data configuration.	required
`state`	`State`	The state configuration.	required
`**kwargs`		Additional keyword arguments.	`{}`

`load_dataset(dataset_path, max_length=512, **kwargs)` ¶

Load a dataset from a directory.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	The path to the dataset directory.	required
`**kwargs`		Additional keyword arguments.	`{}`

Returns:

Type	Description
`Optional[Dataset]`	Dataset \| DatasetDict: The loaded dataset.

Supported Data Formats and Structures:¶

JSONL¶

Each line is a JSON object representing an example.

{"text": "The text content"}

CSV¶

Should contain a 'text' column.

text
"The text content"

Parquet¶

Should contain a 'text' column.

JSON¶

An array of dictionaries with a 'text' key.

[{"text": "The text content"}]

XML¶

Each 'record' element should contain 'text' child element.

<record>
    <text>The text content</text>
</record>

YAML¶

Each document should be a dictionary with a 'text' key.

- text: "The text content"

TSV¶

Should contain a 'text' column.

Excel (.xls, .xlsx)¶

Should contain a 'text' column.

SQLite (.db)¶

Should contain a table with a 'text' column.

Feather¶

Should contain a 'text' column.

`summarize(model_name, model_class='AutoModelForSeq2SeqLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, max_length=512, notification_email=None, **kwargs)` ¶

Perform bulk summarization using the specified model and tokenizer. This method handles the entire summarization process including loading the model, processing input data, generating summarization, and saving the results.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Name or path of the translation model.	required
`origin`	`str`	Source language ISO code.	required
`target`	`str`	Target language ISO code.	required
`max_length`	`int`	Maximum length of the tokens (default 512).	`512`
`model_class`	`str`	Class name of the model (default "AutoModelForSeq2SeqLM").	`'AutoModelForSeq2SeqLM'`
`tokenizer_class`	`str`	Class name of the tokenizer (default "AutoTokenizer").	`'AutoTokenizer'`
`use_cuda`	`bool`	Whether to use CUDA for model inference (default False).	`False`
`precision`	`str`	Precision for model computation (default "float16").	`'float16'`
`quantization`	`int`	Level of quantization for optimizing model size and speed (default 0).	`0`
`device_map`	`str \| Dict \| None`	Specific device to use for computation (default "auto").	`'auto'`
`max_memory`	`Dict`	Maximum memory configuration for devices.	`{0: '24GB'}`
`torchscript`	`bool`	Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False.	`False`
`compile`	`bool`	Whether to compile the model before fine-tuning. Defaults to True.	`False`
`awq_enabled`	`bool`	Whether to enable AWQ optimization (default False).	`False`
`flash_attention`	`bool`	Whether to use flash attention optimization (default False).	`False`
`batch_size`	`int`	Number of translations to process simultaneously (default 32).	`32`
`max_length`	`int`	Maximum lenght of the summary to be generated (default 512).	`512`
`**kwargs`	`Any`	Arbitrary keyword arguments for model and generation configurations.	`{}`

Summarization¶

__init__(input, output, state, **kwargs) ¶

load_dataset(dataset_path, max_length=512, **kwargs) ¶

Supported Data Formats and Structures:¶

JSONL¶

CSV¶

Parquet¶

JSON¶

XML¶

YAML¶

TSV¶

Excel (.xls, .xlsx)¶

SQLite (.db)¶

Feather¶

`init(input, output, state, **kwargs)` ¶

`load_dataset(dataset_path, max_length=512, **kwargs)` ¶