Named Entity Recognition¶
Bases: TextBulk
NamedEntityRecognitionBulk is a class designed for bulk processing of Named Entity Recognition (NER) tasks. It leverages state-of-the-art NER models from Hugging Face's transformers library to identify and classify entities such as person names, locations, organizations, and other types of entities from a large corpus of text.
This class provides functionalities to load large datasets, configure NER models, and perform entity recognition in bulk, making it suitable for processing large volumes of text data efficiently.
Attributes:
Name | Type | Description |
---|---|---|
model |
Any
|
The NER model loaded for entity recognition tasks. |
tokenizer |
Any
|
The tokenizer used for text pre-processing in alignment with the model. |
Example CLI Usage:
genius NamedEntityRecognitionBulk rise \
batch \
--input_folder ./input \
batch \
--output_folder ./output \
none \
--id dslim/bert-large-NER-lol \
recognize_entities \
--args \
model_name="dslim/bert-large-NER" \
model_class="AutoModelForTokenClassification" \
tokenizer_class="AutoTokenizer" \
use_cuda=True \
precision="float" \
quantization=0 \
device_map="cuda:0" \
max_memory=None \
torchscript=False
__init__(input, output, state, **kwargs)
¶
Initializes the NamedEntityRecognitionBulk class with specified input, output, and state configurations. Sets up the NER model and tokenizer for bulk entity recognition tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The input data configuration. |
required |
output |
BatchOutput
|
The output data configuration. |
required |
state |
State
|
The state management for the API. |
required |
**kwargs |
Any
|
Additional keyword arguments for extended functionality. |
{}
|
load_dataset(dataset_path, **kwargs)
¶
Loads a dataset from the specified directory path. The method supports various data formats and structures, ensuring that the dataset is properly formatted for NER tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
**kwargs |
Any
|
Additional keyword arguments to handle specific dataset loading scenarios. |
{}
|
Returns:
Type | Description |
---|---|
Optional[Dataset]
|
Optional[Dataset]: The loaded dataset or None if an error occurs during loading. |
Supported Data Formats and Structures:¶
Hugging Face Dataset¶
Dataset files saved by the Hugging Face datasets library.
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'tokens' columns.
Parquet¶
Should contain 'tokens' columns.
JSON¶
An array of dictionaries with 'tokens' keys.
XML¶
Each 'record' element should contain 'tokens' child elements.
YAML¶
Each document should be a dictionary with 'tokens' keys.
TSV¶
Should contain 'tokens' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'tokens' columns.
SQLite (.db)¶
Should contain a table with 'tokens' columns.
Feather¶
Should contain 'tokens' columns.
recognize_entities(model_name, max_length=512, model_class='AutoModelForSeq2SeqLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, notification_email=None, **kwargs)
¶
Performs bulk named entity recognition on the loaded dataset. The method processes the text in batches, applying the NER model to recognize entities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
The name or path of the NER model. |
required |
max_length |
int
|
The maximum sequence length for the tokenizer. |
512
|
model_class |
str
|
The class of the model, defaults to "AutoModelForTokenClassification". |
'AutoModelForSeq2SeqLM'
|
tokenizer_class |
str
|
The class of the tokenizer, defaults to "AutoTokenizer". |
'AutoTokenizer'
|
use_cuda |
bool
|
Whether to use CUDA for model inference, defaults to False. |
False
|
precision |
str
|
Model computation precision, defaults to "float16". |
'float16'
|
quantization |
int
|
Level of quantization for model size and speed optimization, defaults to 0. |
0
|
device_map |
str | Dict | None
|
Specific device configuration for computation, defaults to "auto". |
'auto'
|
max_memory |
Dict
|
Maximum memory configuration for the devices. |
{0: '24GB'}
|
torchscript |
bool
|
Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False. |
False
|
compile |
bool
|
Whether to compile the model before fine-tuning. Defaults to True. |
False
|
awq_enabled |
bool
|
Whether to enable AWQ optimization, defaults to False. |
False
|
flash_attention |
bool
|
Whether to use flash attention optimization, defaults to False. |
False
|
batch_size |
int
|
Number of documents to process simultaneously, defaults to 32. |
32
|
**kwargs |
Any
|
Arbitrary keyword arguments for additional configuration. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
None |
None
|
The method processes the dataset and saves the predictions without returning any value. |