Translation¶

Bases: TextBulk

TranslationBulk is a class for managing bulk translations using Hugging Face models. It is designed to handle large-scale translation tasks efficiently and effectively, using state-of-the-art machine learning models to provide high-quality translations for various language pairs.

This class provides methods for loading datasets, configuring translation models, and executing bulk translation tasks.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	Configuration and data inputs for batch processing.	required
`output`	`BatchOutput`	Configuration for output data handling.	required
`state`	`State`	State management for translation tasks.	required
`**kwargs`		Arbitrary keyword arguments for extended functionality.	`{}`

Example CLI Usage for Bulk Translation Task:

genius TranslationBulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/trans \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/trans \
    postgres \
        --postgres_host 127.0.0.1 \
        --postgres_port 5432 \
        --postgres_user postgres \
        --postgres_password postgres \
        --postgres_database geniusrise\
        --postgres_table state \
    --id facebook/mbart-large-50-many-to-many-mmt-lol \
    translate \
        --args \
            model_name="facebook/mbart-large-50-many-to-many-mmt" \
            model_class="AutoModelForSeq2SeqLM" \
            tokenizer_class="AutoTokenizer" \
            origin="hi_IN" \
            target="en_XX" \
            use_cuda=True \
            precision="float" \
            quantization=0 \
            device_map="cuda:0" \
            max_memory=None \
            torchscript=False \
            generate_decoder_start_token_id=2 \
            generate_early_stopping=true \
            generate_eos_token_id=2 \
            generate_forced_eos_token_id=2 \
            generate_max_length=200 \
            generate_num_beams=5 \
            generate_pad_token_id=1

`load_dataset(dataset_path, max_length=512, origin='en', target='hi', **kwargs)` ¶

Load a dataset from a directory.

Supported Data Formats and Structures for Translation Tasks:¶

Note: All examples are assuming the source as "en", refer to the specific model for this parameter.

JSONL¶

Each line is a JSON object representing an example.

{
    "translation": {
        "en": "English text"
    }
}

CSV¶

Should contain 'en' column.

en
"English text"

Parquet¶

Should contain 'en' column.

JSON¶

An array of dictionaries with 'en' key.

[
    {
        "en": "English text"
    }
]

XML¶

Each 'record' element should contain 'en' child elements.

<record>
    <en>English text</en>
</record>

YAML¶

Each document should be a dictionary with 'en' key.

- en: "English text"

TSV¶

Should contain 'en' column separated by tabs.

Excel (.xls, .xlsx)¶

Should contain 'en' column.

SQLite (.db)¶

Should contain a table with 'en' column.

Feather¶

Should contain 'en' column.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	The path to the directory containing the dataset files.	required
`max_length`	`int`	The maximum length for tokenization. Defaults to 512.	`512`
`origin`	`str`	The origin language. Defaults to 'en'.	`'en'`
`target`	`str`	The target language. Defaults to 'hi'.	`'hi'`
`**kwargs`		Additional keyword arguments.	`{}`

Returns:

Name	Type	Description
`DatasetDict`	`Optional[Dataset]`	The loaded dataset.

`translate(model_name, origin, target, max_length=512, model_class='AutoModelForSeq2SeqLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, notification_email=None, **kwargs)` ¶

Perform bulk translation using the specified model and tokenizer. This method handles the entire translation process including loading the model, processing input data, generating translations, and saving the results.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Name or path of the translation model.	required
`origin`	`str`	Source language ISO code.	required
`target`	`str`	Target language ISO code.	required
`max_length`	`int`	Maximum length of the tokens (default 512).	`512`
`model_class`	`str`	Class name of the model (default "AutoModelForSeq2SeqLM").	`'AutoModelForSeq2SeqLM'`
`tokenizer_class`	`str`	Class name of the tokenizer (default "AutoTokenizer").	`'AutoTokenizer'`
`use_cuda`	`bool`	Whether to use CUDA for model inference (default False).	`False`
`precision`	`str`	Precision for model computation (default "float16").	`'float16'`
`quantization`	`int`	Level of quantization for optimizing model size and speed (default 0).	`0`
`device_map`	`str \| Dict \| None`	Specific device to use for computation (default "auto").	`'auto'`
`max_memory`	`Dict`	Maximum memory configuration for devices.	`{0: '24GB'}`
`torchscript`	`bool`	Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False.	`False`
`compile`	`bool`	Whether to compile the model before fine-tuning. Defaults to True.	`False`
`awq_enabled`	`bool`	Whether to enable AWQ optimization (default False).	`False`
`flash_attention`	`bool`	Whether to use flash attention optimization (default False).	`False`
`batch_size`	`int`	Number of translations to process simultaneously (default 32).	`32`
`**kwargs`	`Any`	Arbitrary keyword arguments for model and generation configurations.	`{}`

Translation¶

load_dataset(dataset_path, max_length=512, origin='en', target='hi', **kwargs) ¶

Supported Data Formats and Structures for Translation Tasks:¶

JSONL¶

CSV¶

Parquet¶

JSON¶

XML¶

YAML¶

TSV¶

Excel (.xls, .xlsx)¶

SQLite (.db)¶

Feather¶

`load_dataset(dataset_path, max_length=512, origin='en', target='hi', **kwargs)` ¶