Natural Language Inference¶
Bases: TextBulk
The NLIBulk class provides functionality for large-scale natural language inference (NLI) processing using Hugging Face transformers. It allows users to load datasets, configure models, and perform inference on batches of premise-hypothesis pairs.
Attributes:
Name | Type | Description |
---|---|---|
input |
BatchInput
|
Configuration and data inputs for the batch process. |
output |
BatchOutput
|
Configurations for output data handling. |
state |
State
|
State management for the inference task. |
Example CLI Usage:
genius NLIBulk rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/nli \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/nli \
postgres \
--postgres_host 127.0.0.1 \
--postgres_port 5432 \
--postgres_user postgres \
--postgres_password postgres \
--postgres_database geniusrise\
--postgres_table state \
--id MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7-lol \
infer \
--args \
model_name="MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7" \
model_class="AutoModelForSequenceClassification" \
tokenizer_class="AutoTokenizer" \
use_cuda=True \
precision="float" \
quantization=0 \
device_map="cuda:0" \
max_memory=None \
torchscript=False
__init__(input, output, state, **kwargs)
¶
Initializes the NLIBulk class with the specified input, output, and state configurations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The input data. |
required |
output |
BatchOutput
|
The output data. |
required |
state |
State
|
The state data. |
required |
**kwargs |
Additional keyword arguments. |
{}
|
infer(model_name, max_length=512, model_class='AutoModelForSeq2SeqLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, notification_email=None, **kwargs)
¶
Performs NLI inference on a loaded dataset using the specified model. The method processes the data in batches and saves the results to the configured output path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
Name or path of the NLI model. |
required |
max_length |
int
|
Maximum length of the sequences for tokenization purposes. Defaults to 512. |
512
|
model_class |
str
|
Class name of the model (e.g., "AutoModelForSequenceClassification"). Defaults to "AutoModelForSeq2SeqLM". |
'AutoModelForSeq2SeqLM'
|
tokenizer_class |
str
|
Class name of the tokenizer (e.g., "AutoTokenizer"). Defaults to "AutoTokenizer". |
'AutoTokenizer'
|
use_cuda |
bool
|
Whether to use CUDA for model inference. Defaults to False. |
False
|
precision |
str
|
Precision for model computation (e.g., "float16"). Defaults to "float16". |
'float16'
|
quantization |
int
|
Level of quantization for optimizing model size and speed. Defaults to 0. |
0
|
device_map |
str | Dict | None
|
Specific device to use for computation. Defaults to "auto". |
'auto'
|
max_memory |
Dict
|
Maximum memory configuration for devices. Defaults to {0: "24GB"}. |
{0: '24GB'}
|
torchscript |
bool
|
Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False. |
False
|
compile |
bool
|
Whether to compile the model before fine-tuning. Defaults to True. |
False
|
awq_enabled |
bool
|
Whether to enable AWQ optimization. Defaults to False. |
False
|
flash_attention |
bool
|
Whether to use flash attention optimization. Defaults to False. |
False
|
batch_size |
int
|
Number of premise-hypothesis pairs to process simultaneously. Defaults to 32. |
32
|
**kwargs |
Any
|
Arbitrary keyword arguments for model and generation configurations. |
{}
|
```
load_dataset(dataset_path, max_length=512, **kwargs)
¶
Load a commonsense reasoning dataset from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory or file. |
required |
max_length |
int
|
Maximum length of text sequences for tokenization purposes. Defaults to 512. |
512
|
**kwargs |
Additional keyword arguments. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Optional[Dataset]
|
The loaded dataset. |
Raises:
Type | Description |
---|---|
Exception
|
If there was an error loading the dataset. |
Supported Data Formats and Structures:¶
Hugging Face Dataset¶
Dataset files saved by the Hugging Face datasets library.
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'premise' and 'hypothesis' columns.
Parquet¶
Should contain 'premise' and 'hypothesis' columns.
JSON¶
An array of dictionaries with 'premise' and 'hypothesis' keys.
XML¶
Each 'record' element should contain 'premise' and 'hypothesis' child elements.
YAML¶
Each document should be a dictionary with 'premise' and 'hypothesis' keys.
TSV¶
Should contain 'premise' and 'hypothesis' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'premise' and 'hypothesis' columns.
SQLite (.db)¶
Should contain a table with 'premise' and 'hypothesis' columns.
Feather¶
Should contain 'premise' and 'hypothesis' columns.