Translation¶
Bases: TextFineTuner
A bolt for fine-tuning Hugging Face models on translation tasks.
Args:
input (BatchInput): The batch input data.
output (OutputConfig): The output data.
state (State): The state manager.
**kwargs: Arbitrary keyword arguments for extended functionality.
CLI Usage:
genius TranslationFineTuner rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/trans \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/trans \
postgres \
--postgres_host 127.0.0.1 \
--postgres_port 5432 \
--postgres_user postgres \
--postgres_password postgres \
--postgres_database geniusrise\
--postgres_table state \
--id facebook/mbart-large-50-many-to-many-mmt-lol \
fine_tune \
--args \
model_name=my_model \
tokenizer_name=my_tokenizer \
num_train_epochs=3 \
per_device_train_batch_size=8 \
data_max_length=512
data_collator(examples)
¶
Customize the data collator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
examples |
The examples to collate. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
The collated data. |
load_dataset(dataset_path, max_length=512, origin='en', target='fr', **kwargs)
¶
Load a dataset from a directory.
Supported Data Formats and Structures for Translation Tasks:¶
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'en' and 'fr' columns.
Parquet¶
Should contain 'en' and 'fr' columns.
JSON¶
An array of dictionaries with 'en' and 'fr' keys.
XML¶
Each 'record' element should contain 'en' and 'fr' child elements.
YAML¶
Each document should be a dictionary with 'en' and 'fr' keys.
TSV¶
Should contain 'en' and 'fr' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'en' and 'fr' columns.
SQLite (.db)¶
Should contain a table with 'en' and 'fr' columns.
Feather¶
Should contain 'en' and 'fr' columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the directory containing the dataset files. |
required |
max_length |
int
|
The maximum length for tokenization. Defaults to 512. |
512
|
origin |
str
|
The origin language. Defaults to 'en'. |
'en'
|
target |
str
|
The target language. Defaults to 'fr'. |
'fr'
|
**kwargs |
Any
|
Additional keyword arguments. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
DatasetDict |
Optional[DatasetDict]
|
The loaded dataset. |
prepare_train_features(examples)
¶
Tokenize the examples and prepare the features for training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
examples |
dict
|
A dictionary of examples. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
The processed features. |