Skip to content

Named Entity Recognition Fine Tuner

Bases: OpenAIFineTuner

A bolt for fine-tuning OpenAI models on named entity recognition tasks.

This bolt extends the OpenAIFineTuner to handle the specifics of named entity recognition tasks.

Parameters:

Name Type Description Default
input BatchInput

The batch input data.

required
output BatchOutput

The output data.

required
state State

The state manager.

required

CLI Usage:

    genius HuggingFaceCommonsenseReasoningFineTuner rise \
        batch \
            --input_s3_bucket geniusrise-test \
            --input_s3_folder train \
        batch \
            --output_s3_bucket geniusrise-test \
            --output_s3_folder model \
        fine_tune \
            --args model_name=my_model tokenizer_name=my_tokenizer num_train_epochs=3 per_device_train_batch_size=8

YAML Configuration:

    version: "1"
    bolts:
        my_fine_tuner:
            name: "HuggingFaceCommonsenseReasoningFineTuner"
            method: "fine_tune"
            args:
                model_name: "my_model"
                tokenizer_name: "my_tokenizer"
                num_train_epochs: 3
                per_device_train_batch_size: 8
                data_max_length: 512
            input:
                type: "batch"
                args:
                    bucket: "my_bucket"
                    folder: "my_dataset"
            output:
                type: "batch"
                args:
                    bucket: "my_bucket"
                    folder: "my_model"
            deploy:
                type: k8s
                args:
                    kind: deployment
                    name: my_fine_tuner
                    context_name: arn:aws:eks:us-east-1:genius-dev:cluster/geniusrise-dev
                    namespace: geniusrise
                    image: geniusrise/geniusrise
                    kube_config_path: ~/.kube/config
Supported Data Formats
  • JSONL
  • CSV
  • Parquet
  • JSON
  • XML
  • YAML
  • TSV
  • Excel (.xls, .xlsx)
  • SQLite (.db)
  • Feather

load_dataset(dataset_path, **kwargs)

Load a named entity recognition dataset from a directory.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset directory.

required

Returns:

Name Type Description
DatasetDict Union[Dataset, DatasetDict, None]

The loaded dataset.

Raises:

Type Description
Exception

If there was an error loading the dataset.

Supported Data Formats and Structures:

Hugging Face Dataset

Dataset files saved by the Hugging Face datasets library.

JSONL

Each line is a JSON object representing an example.

{"tokens": ["token1", "token2", ...], "ner_tags": [0, 1, ...]}

CSV

Should contain 'tokens' and 'ner_tags' columns.

tokens,ner_tags
"['token1', 'token2', ...]", "[0, 1, ...]"

Parquet

Should contain 'tokens' and 'ner_tags' columns.

JSON

An array of dictionaries with 'tokens' and 'ner_tags' keys.

[{"tokens": ["token1", "token2", ...], "ner_tags": [0, 1, ...]}]

XML

Each 'record' element should contain 'tokens' and 'ner_tags' child elements.

<record>
    <tokens>token1 token2 ...</tokens>
    <ner_tags>0 1 ...</ner_tags>
</record>

YAML

Each document should be a dictionary with 'tokens' and 'ner_tags' keys.

- tokens: ["token1", "token2", ...]
  ner_tags: [0, 1, ...]

TSV

Should contain 'tokens' and 'ner_tags' columns separated by tabs.

Excel (.xls, .xlsx)

Should contain 'tokens' and 'ner_tags' columns.

SQLite (.db)

Should contain a table with 'tokens' and 'ner_tags' columns.

Feather

Should contain 'tokens' and 'ner_tags' columns.

prepare_fine_tuning_data(data, data_type)

Prepare the given data for fine-tuning.

Parameters:

Name Type Description Default
data Union[Dataset, DatasetDict, Optional[Dataset]]

The dataset to prepare.

required
data_type str

Either 'train' or 'eval' to specify the type of data.

required

Raises:

Type Description
ValueError

If data_type is not 'train' or 'eval'.