Translation¶
Bases: TextBulk
TranslationBulk is a class for managing bulk translations using Hugging Face models. It is designed to handle large-scale translation tasks efficiently and effectively, using state-of-the-art machine learning models to provide high-quality translations for various language pairs.
This class provides methods for loading datasets, configuring translation models, and executing bulk translation tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
Configuration and data inputs for batch processing. |
required |
output |
BatchOutput
|
Configuration for output data handling. |
required |
state |
State
|
State management for translation tasks. |
required |
**kwargs |
Arbitrary keyword arguments for extended functionality. |
{}
|
Example CLI Usage for Bulk Translation Task:
genius TranslationBulk rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/trans \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/trans \
postgres \
--postgres_host 127.0.0.1 \
--postgres_port 5432 \
--postgres_user postgres \
--postgres_password postgres \
--postgres_database geniusrise\
--postgres_table state \
--id facebook/mbart-large-50-many-to-many-mmt-lol \
translate \
--args \
model_name="facebook/mbart-large-50-many-to-many-mmt" \
model_class="AutoModelForSeq2SeqLM" \
tokenizer_class="AutoTokenizer" \
origin="hi_IN" \
target="en_XX" \
use_cuda=True \
precision="float" \
quantization=0 \
device_map="cuda:0" \
max_memory=None \
torchscript=False \
generate_decoder_start_token_id=2 \
generate_early_stopping=true \
generate_eos_token_id=2 \
generate_forced_eos_token_id=2 \
generate_max_length=200 \
generate_num_beams=5 \
generate_pad_token_id=1
load_dataset(dataset_path, max_length=512, origin='en', target='hi', **kwargs)
¶
Load a dataset from a directory.
Supported Data Formats and Structures for Translation Tasks:¶
Note: All examples are assuming the source as "en", refer to the specific model for this parameter.
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'en' column.
Parquet¶
Should contain 'en' column.
JSON¶
An array of dictionaries with 'en' key.
XML¶
Each 'record' element should contain 'en' child elements.
YAML¶
Each document should be a dictionary with 'en' key.
TSV¶
Should contain 'en' column separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'en' column.
SQLite (.db)¶
Should contain a table with 'en' column.
Feather¶
Should contain 'en' column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the directory containing the dataset files. |
required |
max_length |
int
|
The maximum length for tokenization. Defaults to 512. |
512
|
origin |
str
|
The origin language. Defaults to 'en'. |
'en'
|
target |
str
|
The target language. Defaults to 'hi'. |
'hi'
|
**kwargs |
Additional keyword arguments. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
DatasetDict |
Optional[Dataset]
|
The loaded dataset. |
translate(model_name, origin, target, max_length=512, model_class='AutoModelForSeq2SeqLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, notification_email=None, **kwargs)
¶
Perform bulk translation using the specified model and tokenizer. This method handles the entire translation process including loading the model, processing input data, generating translations, and saving the results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
Name or path of the translation model. |
required |
origin |
str
|
Source language ISO code. |
required |
target |
str
|
Target language ISO code. |
required |
max_length |
int
|
Maximum length of the tokens (default 512). |
512
|
model_class |
str
|
Class name of the model (default "AutoModelForSeq2SeqLM"). |
'AutoModelForSeq2SeqLM'
|
tokenizer_class |
str
|
Class name of the tokenizer (default "AutoTokenizer"). |
'AutoTokenizer'
|
use_cuda |
bool
|
Whether to use CUDA for model inference (default False). |
False
|
precision |
str
|
Precision for model computation (default "float16"). |
'float16'
|
quantization |
int
|
Level of quantization for optimizing model size and speed (default 0). |
0
|
device_map |
str | Dict | None
|
Specific device to use for computation (default "auto"). |
'auto'
|
max_memory |
Dict
|
Maximum memory configuration for devices. |
{0: '24GB'}
|
torchscript |
bool
|
Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False. |
False
|
compile |
bool
|
Whether to compile the model before fine-tuning. Defaults to True. |
False
|
awq_enabled |
bool
|
Whether to enable AWQ optimization (default False). |
False
|
flash_attention |
bool
|
Whether to use flash attention optimization (default False). |
False
|
batch_size |
int
|
Number of translations to process simultaneously (default 32). |
32
|
**kwargs |
Any
|
Arbitrary keyword arguments for model and generation configurations. |
{}
|