Named Entity Recognition¶
            Bases: TextFineTuner
A bolt for fine-tuning Hugging Face models on named entity recognition tasks.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| input | BatchInput | The batch input data. | required | 
| output | OutputConfig | The output data. | required | 
| state | State | The state manager. | required | 
CLI Usage:
    genius NamedEntityRecognitionFineTuner rise \
        batch \
            --input_folder ./input \
        batch \
            --output_folder ./output \
        none \
    --id dslim/bert-large-NER-lol \
        fine_tune \
            --args \
                model_name=my_model \
                tokenizer_name=my_tokenizer \
                num_train_epochs=3 \
                per_device_train_batch_size=8
data_collator(examples)
¶
  Customize the data collator.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| examples | List[Dict[str, torch.Tensor]] | The examples to collate. | required | 
Returns:
| Type | Description | 
|---|---|
| Dict[str, torch.Tensor] | Dict[str, torch.Tensor]: The collated data. | 
load_dataset(dataset_path, label_list=[], **kwargs)
¶
  Load a named entity recognition dataset from a directory.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| dataset_path | str | The path to the dataset directory. | required | 
| label_list | List[str] | The list of labels for named entity recognition. Defaults to []. | [] | 
Returns:
| Name | Type | Description | 
|---|---|---|
| DatasetDict | Union[Dataset, DatasetDict, None] | The loaded dataset. | 
Raises:
| Type | Description | 
|---|---|
| Exception | If there was an error loading the dataset. | 
Supported Data Formats and Structures:¶
Hugging Face Dataset¶
Dataset files saved by the Hugging Face datasets library.
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'tokens' and 'ner_tags' columns.
Parquet¶
Should contain 'tokens' and 'ner_tags' columns.
JSON¶
An array of dictionaries with 'tokens' and 'ner_tags' keys.
XML¶
Each 'record' element should contain 'tokens' and 'ner_tags' child elements.
YAML¶
Each document should be a dictionary with 'tokens' and 'ner_tags' keys.
TSV¶
Should contain 'tokens' and 'ner_tags' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'tokens' and 'ner_tags' columns.
SQLite (.db)¶
Should contain a table with 'tokens' and 'ner_tags' columns.
Feather¶
Should contain 'tokens' and 'ner_tags' columns.
prepare_train_features(examples)
¶
  Tokenize the examples and prepare the features for training.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| examples | Dict[str, Union[List[str], List[int]]] | A dictionary of examples. | required | 
Returns:
| Type | Description | 
|---|---|
| Dict[str, Union[List[str], List[int]]] | Dict[str, Union[List[str], List[int]]]: The processed features. |