Skip to content

Natural Language Inference

Bases: TextBulk

The NLIBulk class provides functionality for large-scale natural language inference (NLI) processing using Hugging Face transformers. It allows users to load datasets, configure models, and perform inference on batches of premise-hypothesis pairs.

Attributes:

Name Type Description
input BatchInput

Configuration and data inputs for the batch process.

output BatchOutput

Configurations for output data handling.

state State

State management for the inference task.

Example CLI Usage:

genius NLIBulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/nli \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/nli \
    postgres \
        --postgres_host 127.0.0.1 \
        --postgres_port 5432 \
        --postgres_user postgres \
        --postgres_password postgres \
        --postgres_database geniusrise\
        --postgres_table state \
    --id MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7-lol \
    infer \
        --args \
            model_name="MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7" \
            model_class="AutoModelForSequenceClassification" \
            tokenizer_class="AutoTokenizer" \
            use_cuda=True \
            precision="float" \
            quantization=0 \
            device_map="cuda:0" \
            max_memory=None \
            torchscript=False

__init__(input, output, state, **kwargs)

Initializes the NLIBulk class with the specified input, output, and state configurations.

Parameters:

Name Type Description Default
input BatchInput

The input data.

required
output BatchOutput

The output data.

required
state State

The state data.

required
**kwargs

Additional keyword arguments.

{}

infer(model_name, max_length=512, model_class='AutoModelForSeq2SeqLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, notification_email=None, **kwargs)

Performs NLI inference on a loaded dataset using the specified model. The method processes the data in batches and saves the results to the configured output path.

Parameters:

Name Type Description Default
model_name str

Name or path of the NLI model.

required
max_length int

Maximum length of the sequences for tokenization purposes. Defaults to 512.

512
model_class str

Class name of the model (e.g., "AutoModelForSequenceClassification"). Defaults to "AutoModelForSeq2SeqLM".

'AutoModelForSeq2SeqLM'
tokenizer_class str

Class name of the tokenizer (e.g., "AutoTokenizer"). Defaults to "AutoTokenizer".

'AutoTokenizer'
use_cuda bool

Whether to use CUDA for model inference. Defaults to False.

False
precision str

Precision for model computation (e.g., "float16"). Defaults to "float16".

'float16'
quantization int

Level of quantization for optimizing model size and speed. Defaults to 0.

0
device_map str | Dict | None

Specific device to use for computation. Defaults to "auto".

'auto'
max_memory Dict

Maximum memory configuration for devices. Defaults to {0: "24GB"}.

{0: '24GB'}
torchscript bool

Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False.

False
compile bool

Whether to compile the model before fine-tuning. Defaults to True.

False
awq_enabled bool

Whether to enable AWQ optimization. Defaults to False.

False
flash_attention bool

Whether to use flash attention optimization. Defaults to False.

False
batch_size int

Number of premise-hypothesis pairs to process simultaneously. Defaults to 32.

32
**kwargs Any

Arbitrary keyword arguments for model and generation configurations.

{}

```

load_dataset(dataset_path, max_length=512, **kwargs)

Load a commonsense reasoning dataset from a directory.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset directory or file.

required
max_length int

Maximum length of text sequences for tokenization purposes. Defaults to 512.

512
**kwargs

Additional keyword arguments.

{}

Returns:

Name Type Description
Dataset Optional[Dataset]

The loaded dataset.

Raises:

Type Description
Exception

If there was an error loading the dataset.

Supported Data Formats and Structures:

Hugging Face Dataset

Dataset files saved by the Hugging Face datasets library.

JSONL

Each line is a JSON object representing an example.

{"premise": "The premise text", "hypothesis": "The hypothesis text"}

CSV

Should contain 'premise' and 'hypothesis' columns.

premise,hypothesis
"The premise text","The hypothesis text"

Parquet

Should contain 'premise' and 'hypothesis' columns.

JSON

An array of dictionaries with 'premise' and 'hypothesis' keys.

[{"premise": "The premise text", "hypothesis": "The hypothesis text"}]

XML

Each 'record' element should contain 'premise' and 'hypothesis' child elements.

<record>
    <premise>The premise text</premise>
    <hypothesis>The hypothesis text</hypothesis>
</record>

YAML

Each document should be a dictionary with 'premise' and 'hypothesis' keys.

- premise: "The premise text"
  hypothesis: "The hypothesis text"

TSV

Should contain 'premise' and 'hypothesis' columns separated by tabs.

Excel (.xls, .xlsx)

Should contain 'premise' and 'hypothesis' columns.

SQLite (.db)

Should contain a table with 'premise' and 'hypothesis' columns.

Feather

Should contain 'premise' and 'hypothesis' columns.