Skip to content

Summarization

Bases: TextBulk

SummarizationBulk is a class for managing bulk text summarization tasks using Hugging Face models. It is designed to handle large-scale summarization tasks efficiently and effectively, utilizing state-of-the-art machine learning models to provide high-quality summaries.

The class provides methods to load datasets, configure summarization models, and execute bulk summarization tasks.

Example CLI Usage:

genius SummarizationBulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/summz \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/summz \
    postgres \
        --postgres_host 127.0.0.1 \
        --postgres_port 5432 \
        --postgres_user postgres \
        --postgres_password postgres \
        --postgres_database geniusrise\
        --postgres_table state \
    --id facebook/bart-large-cnn-lol \
    summarize \
        --args \
            model_name="facebook/bart-large-cnn" \
            model_class="AutoModelForSeq2SeqLM" \
            tokenizer_class="AutoTokenizer" \
            use_cuda=True \
            precision="float" \
            quantization=0 \
            device_map="cuda:0" \
            max_memory=None \
            torchscript=False \
            generation_bos_token_id=0 \
            generation_decoder_start_token_id=2 \
            generation_early_stopping=true \
            generation_eos_token_id=2 \
            generation_forced_bos_token_id=0 \
            generation_forced_eos_token_id=2 \
            generation_length_penalty=2.0 \
            generation_max_length=142 \
            generation_min_length=56 \
            generation_no_repeat_ngram_size=3 \
            generation_num_beams=4 \
            generation_pad_token_id=1 \
            generation_do_sample=false

__init__(input, output, state, **kwargs)

Initializes the SummarizationBulk class.

Parameters:

Name Type Description Default
input BatchInput

The input data configuration.

required
output BatchOutput

The output data configuration.

required
state State

The state configuration.

required
**kwargs

Additional keyword arguments.

{}

load_dataset(dataset_path, max_length=512, **kwargs)

Load a dataset from a directory.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset directory.

required
**kwargs

Additional keyword arguments.

{}

Returns:

Type Description
Optional[Dataset]

Dataset | DatasetDict: The loaded dataset.

Supported Data Formats and Structures:

JSONL

Each line is a JSON object representing an example.

{"text": "The text content"}

CSV

Should contain a 'text' column.

text
"The text content"

Parquet

Should contain a 'text' column.

JSON

An array of dictionaries with a 'text' key.

[{"text": "The text content"}]

XML

Each 'record' element should contain 'text' child element.

<record>
    <text>The text content</text>
</record>

YAML

Each document should be a dictionary with a 'text' key.

- text: "The text content"

TSV

Should contain a 'text' column.

Excel (.xls, .xlsx)

Should contain a 'text' column.

SQLite (.db)

Should contain a table with a 'text' column.

Feather

Should contain a 'text' column.

summarize(model_name, model_class='AutoModelForSeq2SeqLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, max_length=512, notification_email=None, **kwargs)

Perform bulk summarization using the specified model and tokenizer. This method handles the entire summarization process including loading the model, processing input data, generating summarization, and saving the results.

Parameters:

Name Type Description Default
model_name str

Name or path of the translation model.

required
origin str

Source language ISO code.

required
target str

Target language ISO code.

required
max_length int

Maximum length of the tokens (default 512).

512
model_class str

Class name of the model (default "AutoModelForSeq2SeqLM").

'AutoModelForSeq2SeqLM'
tokenizer_class str

Class name of the tokenizer (default "AutoTokenizer").

'AutoTokenizer'
use_cuda bool

Whether to use CUDA for model inference (default False).

False
precision str

Precision for model computation (default "float16").

'float16'
quantization int

Level of quantization for optimizing model size and speed (default 0).

0
device_map str | Dict | None

Specific device to use for computation (default "auto").

'auto'
max_memory Dict

Maximum memory configuration for devices.

{0: '24GB'}
torchscript bool

Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False.

False
compile bool

Whether to compile the model before fine-tuning. Defaults to True.

False
awq_enabled bool

Whether to enable AWQ optimization (default False).

False
flash_attention bool

Whether to use flash attention optimization (default False).

False
batch_size int

Number of translations to process simultaneously (default 32).

32
max_length int

Maximum lenght of the summary to be generated (default 512).

512
**kwargs Any

Arbitrary keyword arguments for model and generation configurations.

{}