Skip to content

Translation

Bases: TextBulk

TranslationBulk is a class for managing bulk translations using Hugging Face models. It is designed to handle large-scale translation tasks efficiently and effectively, using state-of-the-art machine learning models to provide high-quality translations for various language pairs.

This class provides methods for loading datasets, configuring translation models, and executing bulk translation tasks.

Parameters:

Name Type Description Default
input BatchInput

Configuration and data inputs for batch processing.

required
output BatchOutput

Configuration for output data handling.

required
state State

State management for translation tasks.

required
**kwargs

Arbitrary keyword arguments for extended functionality.

{}

Example CLI Usage for Bulk Translation Task:

genius TranslationBulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/trans \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/trans \
    postgres \
        --postgres_host 127.0.0.1 \
        --postgres_port 5432 \
        --postgres_user postgres \
        --postgres_password postgres \
        --postgres_database geniusrise\
        --postgres_table state \
    --id facebook/mbart-large-50-many-to-many-mmt-lol \
    translate \
        --args \
            model_name="facebook/mbart-large-50-many-to-many-mmt" \
            model_class="AutoModelForSeq2SeqLM" \
            tokenizer_class="AutoTokenizer" \
            origin="hi_IN" \
            target="en_XX" \
            use_cuda=True \
            precision="float" \
            quantization=0 \
            device_map="cuda:0" \
            max_memory=None \
            torchscript=False \
            generate_decoder_start_token_id=2 \
            generate_early_stopping=true \
            generate_eos_token_id=2 \
            generate_forced_eos_token_id=2 \
            generate_max_length=200 \
            generate_num_beams=5 \
            generate_pad_token_id=1

load_dataset(dataset_path, max_length=512, origin='en', target='hi', **kwargs)

Load a dataset from a directory.

Supported Data Formats and Structures for Translation Tasks:

Note: All examples are assuming the source as "en", refer to the specific model for this parameter.

JSONL

Each line is a JSON object representing an example.

{
    "translation": {
        "en": "English text"
    }
}

CSV

Should contain 'en' column.

en
"English text"

Parquet

Should contain 'en' column.

JSON

An array of dictionaries with 'en' key.

[
    {
        "en": "English text"
    }
]

XML

Each 'record' element should contain 'en' child elements.

<record>
    <en>English text</en>
</record>

YAML

Each document should be a dictionary with 'en' key.

- en: "English text"

TSV

Should contain 'en' column separated by tabs.

Excel (.xls, .xlsx)

Should contain 'en' column.

SQLite (.db)

Should contain a table with 'en' column.

Feather

Should contain 'en' column.

Parameters:

Name Type Description Default
dataset_path str

The path to the directory containing the dataset files.

required
max_length int

The maximum length for tokenization. Defaults to 512.

512
origin str

The origin language. Defaults to 'en'.

'en'
target str

The target language. Defaults to 'hi'.

'hi'
**kwargs

Additional keyword arguments.

{}

Returns:

Name Type Description
DatasetDict Optional[Dataset]

The loaded dataset.

translate(model_name, origin, target, max_length=512, model_class='AutoModelForSeq2SeqLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, notification_email=None, **kwargs)

Perform bulk translation using the specified model and tokenizer. This method handles the entire translation process including loading the model, processing input data, generating translations, and saving the results.

Parameters:

Name Type Description Default
model_name str

Name or path of the translation model.

required
origin str

Source language ISO code.

required
target str

Target language ISO code.

required
max_length int

Maximum length of the tokens (default 512).

512
model_class str

Class name of the model (default "AutoModelForSeq2SeqLM").

'AutoModelForSeq2SeqLM'
tokenizer_class str

Class name of the tokenizer (default "AutoTokenizer").

'AutoTokenizer'
use_cuda bool

Whether to use CUDA for model inference (default False).

False
precision str

Precision for model computation (default "float16").

'float16'
quantization int

Level of quantization for optimizing model size and speed (default 0).

0
device_map str | Dict | None

Specific device to use for computation (default "auto").

'auto'
max_memory Dict

Maximum memory configuration for devices.

{0: '24GB'}
torchscript bool

Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False.

False
compile bool

Whether to compile the model before fine-tuning. Defaults to True.

False
awq_enabled bool

Whether to enable AWQ optimization (default False).

False
flash_attention bool

Whether to use flash attention optimization (default False).

False
batch_size int

Number of translations to process simultaneously (default 32).

32
**kwargs Any

Arbitrary keyword arguments for model and generation configurations.

{}