Named Entity Recognition Fine Tuner¶
Bases: OpenAIFineTuner
A bolt for fine-tuning OpenAI models on named entity recognition tasks.
This bolt extends the OpenAIFineTuner to handle the specifics of named entity recognition tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The batch input data. |
required |
output |
BatchOutput
|
The output data. |
required |
state |
State
|
The state manager. |
required |
CLI Usage:
genius HuggingFaceCommonsenseReasoningFineTuner rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder train \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder model \
fine_tune \
--args model_name=my_model tokenizer_name=my_tokenizer num_train_epochs=3 per_device_train_batch_size=8
YAML Configuration:
version: "1"
bolts:
my_fine_tuner:
name: "HuggingFaceCommonsenseReasoningFineTuner"
method: "fine_tune"
args:
model_name: "my_model"
tokenizer_name: "my_tokenizer"
num_train_epochs: 3
per_device_train_batch_size: 8
data_max_length: 512
input:
type: "batch"
args:
bucket: "my_bucket"
folder: "my_dataset"
output:
type: "batch"
args:
bucket: "my_bucket"
folder: "my_model"
deploy:
type: k8s
args:
kind: deployment
name: my_fine_tuner
context_name: arn:aws:eks:us-east-1:genius-dev:cluster/geniusrise-dev
namespace: geniusrise
image: geniusrise/geniusrise
kube_config_path: ~/.kube/config
Supported Data Formats
- JSONL
- CSV
- Parquet
- JSON
- XML
- YAML
- TSV
- Excel (.xls, .xlsx)
- SQLite (.db)
- Feather
load_dataset(dataset_path, **kwargs)
¶
Load a named entity recognition dataset from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
Returns:
Name | Type | Description |
---|---|---|
DatasetDict |
Union[Dataset, DatasetDict, None]
|
The loaded dataset. |
Raises:
Type | Description |
---|---|
Exception
|
If there was an error loading the dataset. |
Supported Data Formats and Structures:¶
Hugging Face Dataset¶
Dataset files saved by the Hugging Face datasets library.
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'tokens' and 'ner_tags' columns.
Parquet¶
Should contain 'tokens' and 'ner_tags' columns.
JSON¶
An array of dictionaries with 'tokens' and 'ner_tags' keys.
XML¶
Each 'record' element should contain 'tokens' and 'ner_tags' child elements.
YAML¶
Each document should be a dictionary with 'tokens' and 'ner_tags' keys.
TSV¶
Should contain 'tokens' and 'ner_tags' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'tokens' and 'ner_tags' columns.
SQLite (.db)¶
Should contain a table with 'tokens' and 'ner_tags' columns.
Feather¶
Should contain 'tokens' and 'ner_tags' columns.
prepare_fine_tuning_data(data, data_type)
¶
Prepare the given data for fine-tuning.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Union[Dataset, DatasetDict, Optional[Dataset]]
|
The dataset to prepare. |
required |
data_type |
str
|
Either 'train' or 'eval' to specify the type of data. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If data_type is not 'train' or 'eval'. |