Skip to content

Named Entity Recognition

Bases: TextFineTuner

A bolt for fine-tuning Hugging Face models on named entity recognition tasks.

Parameters:

Name Type Description Default
input BatchInput

The batch input data.

required
output OutputConfig

The output data.

required
state State

The state manager.

required

CLI Usage:

    genius NamedEntityRecognitionFineTuner rise \
        batch \
            --input_folder ./input \
        batch \
            --output_folder ./output \
        none \
    --id dslim/bert-large-NER-lol \
        fine_tune \
            --args \
                model_name=my_model \
                tokenizer_name=my_tokenizer \
                num_train_epochs=3 \
                per_device_train_batch_size=8

data_collator(examples)

Customize the data collator.

Parameters:

Name Type Description Default
examples List[Dict[str, torch.Tensor]]

The examples to collate.

required

Returns:

Type Description
Dict[str, torch.Tensor]

Dict[str, torch.Tensor]: The collated data.

load_dataset(dataset_path, label_list=[], **kwargs)

Load a named entity recognition dataset from a directory.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset directory.

required
label_list List[str]

The list of labels for named entity recognition. Defaults to [].

[]

Returns:

Name Type Description
DatasetDict Union[Dataset, DatasetDict, None]

The loaded dataset.

Raises:

Type Description
Exception

If there was an error loading the dataset.

Supported Data Formats and Structures:

Hugging Face Dataset

Dataset files saved by the Hugging Face datasets library.

JSONL

Each line is a JSON object representing an example.

{"tokens": ["token1", "token2", ...], "ner_tags": [0, 1, ...]}

CSV

Should contain 'tokens' and 'ner_tags' columns.

tokens,ner_tags
"['token1', 'token2', ...]", "[0, 1, ...]"

Parquet

Should contain 'tokens' and 'ner_tags' columns.

JSON

An array of dictionaries with 'tokens' and 'ner_tags' keys.

[{"tokens": ["token1", "token2", ...], "ner_tags": [0, 1, ...]}]

XML

Each 'record' element should contain 'tokens' and 'ner_tags' child elements.

<record>
    <tokens>token1 token2 ...</tokens>
    <ner_tags>0 1 ...</ner_tags>
</record>

YAML

Each document should be a dictionary with 'tokens' and 'ner_tags' keys.

- tokens: ["token1", "token2", ...]
  ner_tags: [0, 1, ...]

TSV

Should contain 'tokens' and 'ner_tags' columns separated by tabs.

Excel (.xls, .xlsx)

Should contain 'tokens' and 'ner_tags' columns.

SQLite (.db)

Should contain a table with 'tokens' and 'ner_tags' columns.

Feather

Should contain 'tokens' and 'ner_tags' columns.

prepare_train_features(examples)

Tokenize the examples and prepare the features for training.

Parameters:

Name Type Description Default
examples Dict[str, Union[List[str], List[int]]]

A dictionary of examples.

required

Returns:

Type Description
Dict[str, Union[List[str], List[int]]]

Dict[str, Union[List[str], List[int]]]: The processed features.