Skip to content

Named Entity Recognition

Bases: TextBulk

NamedEntityRecognitionBulk is a class designed for bulk processing of Named Entity Recognition (NER) tasks. It leverages state-of-the-art NER models from Hugging Face's transformers library to identify and classify entities such as person names, locations, organizations, and other types of entities from a large corpus of text.

This class provides functionalities to load large datasets, configure NER models, and perform entity recognition in bulk, making it suitable for processing large volumes of text data efficiently.

Attributes:

Name Type Description
model Any

The NER model loaded for entity recognition tasks.

tokenizer Any

The tokenizer used for text pre-processing in alignment with the model.

Example CLI Usage:

genius NamedEntityRecognitionBulk rise \
    batch \
        --input_folder ./input \
    batch \
        --output_folder ./output \
    none \
    --id dslim/bert-large-NER-lol \
    recognize_entities \
        --args \
            model_name="dslim/bert-large-NER" \
            model_class="AutoModelForTokenClassification" \
            tokenizer_class="AutoTokenizer" \
            use_cuda=True \
            precision="float" \
            quantization=0 \
            device_map="cuda:0" \
            max_memory=None \
            torchscript=False

__init__(input, output, state, **kwargs)

Initializes the NamedEntityRecognitionBulk class with specified input, output, and state configurations. Sets up the NER model and tokenizer for bulk entity recognition tasks.

Parameters:

Name Type Description Default
input BatchInput

The input data configuration.

required
output BatchOutput

The output data configuration.

required
state State

The state management for the API.

required
**kwargs Any

Additional keyword arguments for extended functionality.

{}

load_dataset(dataset_path, **kwargs)

Loads a dataset from the specified directory path. The method supports various data formats and structures, ensuring that the dataset is properly formatted for NER tasks.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset directory.

required
**kwargs Any

Additional keyword arguments to handle specific dataset loading scenarios.

{}

Returns:

Type Description
Optional[Dataset]

Optional[Dataset]: The loaded dataset or None if an error occurs during loading.

Supported Data Formats and Structures:

Hugging Face Dataset

Dataset files saved by the Hugging Face datasets library.

JSONL

Each line is a JSON object representing an example.

{"tokens": ["token1", "token2", ...]}

CSV

Should contain 'tokens' columns.

tokens
"['token1', 'token2', ...]"

Parquet

Should contain 'tokens' columns.

JSON

An array of dictionaries with 'tokens' keys.

[{"tokens": ["token1", "token2", ...]}]

XML

Each 'record' element should contain 'tokens' child elements.

<record>
    <tokens>token1 token2 ...</tokens>
</record>

YAML

Each document should be a dictionary with 'tokens' keys.

- tokens: ["token1", "token2", ...]

TSV

Should contain 'tokens' columns separated by tabs.

Excel (.xls, .xlsx)

Should contain 'tokens' columns.

SQLite (.db)

Should contain a table with 'tokens' columns.

Feather

Should contain 'tokens' columns.

recognize_entities(model_name, max_length=512, model_class='AutoModelForSeq2SeqLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, notification_email=None, **kwargs)

Performs bulk named entity recognition on the loaded dataset. The method processes the text in batches, applying the NER model to recognize entities.

Parameters:

Name Type Description Default
model_name str

The name or path of the NER model.

required
max_length int

The maximum sequence length for the tokenizer.

512
model_class str

The class of the model, defaults to "AutoModelForTokenClassification".

'AutoModelForSeq2SeqLM'
tokenizer_class str

The class of the tokenizer, defaults to "AutoTokenizer".

'AutoTokenizer'
use_cuda bool

Whether to use CUDA for model inference, defaults to False.

False
precision str

Model computation precision, defaults to "float16".

'float16'
quantization int

Level of quantization for model size and speed optimization, defaults to 0.

0
device_map str | Dict | None

Specific device configuration for computation, defaults to "auto".

'auto'
max_memory Dict

Maximum memory configuration for the devices.

{0: '24GB'}
torchscript bool

Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False.

False
compile bool

Whether to compile the model before fine-tuning. Defaults to True.

False
awq_enabled bool

Whether to enable AWQ optimization, defaults to False.

False
flash_attention bool

Whether to use flash attention optimization, defaults to False.

False
batch_size int

Number of documents to process simultaneously, defaults to 32.

32
**kwargs Any

Arbitrary keyword arguments for additional configuration.

{}

Returns:

Name Type Description
None None

The method processes the dataset and saves the predictions without returning any value.