Named Entity Recognition¶
Bases: TextFineTuner
A bolt for fine-tuning Hugging Face models on named entity recognition tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The batch input data. |
required |
output |
OutputConfig
|
The output data. |
required |
state |
State
|
The state manager. |
required |
CLI Usage:
genius NamedEntityRecognitionFineTuner rise \
batch \
--input_folder ./input \
batch \
--output_folder ./output \
none \
--id dslim/bert-large-NER-lol \
fine_tune \
--args \
model_name=my_model \
tokenizer_name=my_tokenizer \
num_train_epochs=3 \
per_device_train_batch_size=8
data_collator(examples)
¶
Customize the data collator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
examples |
List[Dict[str, torch.Tensor]]
|
The examples to collate. |
required |
Returns:
Type | Description |
---|---|
Dict[str, torch.Tensor]
|
Dict[str, torch.Tensor]: The collated data. |
load_dataset(dataset_path, label_list=[], **kwargs)
¶
Load a named entity recognition dataset from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
label_list |
List[str]
|
The list of labels for named entity recognition. Defaults to []. |
[]
|
Returns:
Name | Type | Description |
---|---|---|
DatasetDict |
Union[Dataset, DatasetDict, None]
|
The loaded dataset. |
Raises:
Type | Description |
---|---|
Exception
|
If there was an error loading the dataset. |
Supported Data Formats and Structures:¶
Hugging Face Dataset¶
Dataset files saved by the Hugging Face datasets library.
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'tokens' and 'ner_tags' columns.
Parquet¶
Should contain 'tokens' and 'ner_tags' columns.
JSON¶
An array of dictionaries with 'tokens' and 'ner_tags' keys.
XML¶
Each 'record' element should contain 'tokens' and 'ner_tags' child elements.
YAML¶
Each document should be a dictionary with 'tokens' and 'ner_tags' keys.
TSV¶
Should contain 'tokens' and 'ner_tags' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'tokens' and 'ner_tags' columns.
SQLite (.db)¶
Should contain a table with 'tokens' and 'ner_tags' columns.
Feather¶
Should contain 'tokens' and 'ner_tags' columns.
prepare_train_features(examples)
¶
Tokenize the examples and prepare the features for training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
examples |
Dict[str, Union[List[str], List[int]]]
|
A dictionary of examples. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Union[List[str], List[int]]]
|
Dict[str, Union[List[str], List[int]]]: The processed features. |