Translation¶

Bases: TextFineTuner

A bolt for fine-tuning Hugging Face models on translation tasks.

Args:
    input (BatchInput): The batch input data.
    output (OutputConfig): The output data.
    state (State): The state manager.
    **kwargs: Arbitrary keyword arguments for extended functionality.

CLI Usage:

    genius TranslationFineTuner rise \
        batch \
            --input_s3_bucket geniusrise-test \
            --input_s3_folder input/trans \
        batch \
            --output_s3_bucket geniusrise-test \
            --output_s3_folder output/trans \
        postgres \
            --postgres_host 127.0.0.1 \
            --postgres_port 5432 \
            --postgres_user postgres \
            --postgres_password postgres \
            --postgres_database geniusrise\
            --postgres_table state \
        --id facebook/mbart-large-50-many-to-many-mmt-lol \
        fine_tune \
            --args \
                model_name=my_model \
                tokenizer_name=my_tokenizer \
                num_train_epochs=3 \
                per_device_train_batch_size=8 \
                data_max_length=512

`data_collator(examples)` ¶

Customize the data collator.

Parameters:

Name	Type	Description	Default
`examples`		The examples to collate.	required

Returns:

Name	Type	Description
`dict`		The collated data.

`load_dataset(dataset_path, max_length=512, origin='en', target='fr', **kwargs)` ¶

Load a dataset from a directory.

Supported Data Formats and Structures for Translation Tasks:¶

JSONL¶

Each line is a JSON object representing an example.

{
    "translation": {
        "en": "English text",
        "fr": "French text"
    }
}

CSV¶

Should contain 'en' and 'fr' columns.

en,fr
"English text","French text"

Parquet¶

Should contain 'en' and 'fr' columns.

JSON¶

An array of dictionaries with 'en' and 'fr' keys.

[
    {
        "en": "English text",
        "fr": "French text"
    }
]

XML¶

Each 'record' element should contain 'en' and 'fr' child elements.

<record>
    <en>English text</en>
    <fr>French text</fr>
</record>

YAML¶

Each document should be a dictionary with 'en' and 'fr' keys.

- en: "English text"
  fr: "French text"

TSV¶

Should contain 'en' and 'fr' columns separated by tabs.

Excel (.xls, .xlsx)¶

Should contain 'en' and 'fr' columns.

SQLite (.db)¶

Should contain a table with 'en' and 'fr' columns.

Feather¶

Should contain 'en' and 'fr' columns.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	The path to the directory containing the dataset files.	required
`max_length`	`int`	The maximum length for tokenization. Defaults to 512.	`512`
`origin`	`str`	The origin language. Defaults to 'en'.	`'en'`
`target`	`str`	The target language. Defaults to 'fr'.	`'fr'`
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Returns:

Name	Type	Description
`DatasetDict`	`Optional[DatasetDict]`	The loaded dataset.

`prepare_train_features(examples)` ¶

Tokenize the examples and prepare the features for training.

Parameters:

Name	Type	Description	Default
`examples`	`dict`	A dictionary of examples.	required

Returns:

Name	Type	Description
`dict`		The processed features.

Translation¶

data_collator(examples) ¶

load_dataset(dataset_path, max_length=512, origin='en', target='fr', **kwargs) ¶

Supported Data Formats and Structures for Translation Tasks:¶

JSONL¶

CSV¶

Parquet¶

JSON¶

XML¶

YAML¶

TSV¶

Excel (.xls, .xlsx)¶

SQLite (.db)¶

Feather¶

prepare_train_features(examples) ¶

`data_collator(examples)` ¶

`load_dataset(dataset_path, max_length=512, origin='en', target='fr', **kwargs)` ¶

`prepare_train_features(examples)` ¶