Natural Language Inference¶

Bases: TextBulk

The NLIBulk class provides functionality for large-scale natural language inference (NLI) processing using Hugging Face transformers. It allows users to load datasets, configure models, and perform inference on batches of premise-hypothesis pairs.

Attributes:

Name	Type	Description
`input`	`BatchInput`	Configuration and data inputs for the batch process.
`output`	`BatchOutput`	Configurations for output data handling.
`state`	`State`	State management for the inference task.

Example CLI Usage:

href="#__codelineno-0-1">genius NLIBulk rise \ batch \ --input_s3_bucket geniusrise-test \ --input_s3_folder input/nli \ batch \ --output_s3_bucket geniusrise-test \ --output_s3_folder output/nli \ postgres \ --postgres_host 127.0.0.1 \ --postgres_port 5432 \ --postgres_user postgres \ --postgres_password postgres \ --postgres_database geniusrise\ --postgres_table state \ --id MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7-lol \ infer \ --args \ model_name="MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7" \ model_class="AutoModelForSequenceClassification" \ tokenizer_class="AutoTokenizer" \ use_cuda=True \ precision="float" \ quantization=0 \ device_map="cuda:0" \ max_memory=None \ torchscript=False

`init(input, output, state, **kwargs)` ¶

Initializes the NLIBulk class with the specified input, output, and state configurations.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	The input data.	required
`output`	`BatchOutput`	The output data.	required
`state`	`State`	The state data.	required
`**kwargs`		Additional keyword arguments.	`{}`

`infer(model_name, max_length=512, model_class='AutoModelForSeq2SeqLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, batch_size=32, notification_email=None, **kwargs)` ¶

Performs NLI inference on a loaded dataset using the specified model. The method processes the data in batches and saves the results to the configured output path.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Name or path of the NLI model.	required
`max_length`	`int`	Maximum length of the sequences for tokenization purposes. Defaults to 512.	`512`
`model_class`	`str`	Class name of the model (e.g., "AutoModelForSequenceClassification"). Defaults to "AutoModelForSeq2SeqLM".	`'AutoModelForSeq2SeqLM'`
`tokenizer_class`	`str`	Class name of the tokenizer (e.g., "AutoTokenizer"). Defaults to "AutoTokenizer".	`'AutoTokenizer'`
`use_cuda`	`bool`	Whether to use CUDA for model inference. Defaults to False.	`False`
`precision`	`str`	Precision for model computation (e.g., "float16"). Defaults to "float16".	`'float16'`
`quantization`	`int`	Level of quantization for optimizing model size and speed. Defaults to 0.	`0`
`device_map`	`str \| Dict \| None`	Specific device to use for computation. Defaults to "auto".	`'auto'`
`max_memory`	`Dict`	Maximum memory configuration for devices. Defaults to {0: "24GB"}.	`{0: '24GB'}`
`torchscript`	`bool`	Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False.	`False`
`compile`	`bool`	Whether to compile the model before fine-tuning. Defaults to True.	`False`
`awq_enabled`	`bool`	Whether to enable AWQ optimization. Defaults to False.	`False`
`flash_attention`	`bool`	Whether to use flash attention optimization. Defaults to False.	`False`
`batch_size`	`int`	Number of premise-hypothesis pairs to process simultaneously. Defaults to 32.	`32`
`**kwargs`	`Any`	Arbitrary keyword arguments for model and generation configurations.	`{}`

```

`load_dataset(dataset_path, max_length=512, **kwargs)` ¶

Load a commonsense reasoning dataset from a directory.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	The path to the dataset directory or file.	required
`max_length`	`int`	Maximum length of text sequences for tokenization purposes. Defaults to 512.	`512`
`**kwargs`		Additional keyword arguments.	`{}`

Returns:

Name	Type	Description
`Dataset`	`Optional[Dataset]`	The loaded dataset.

Raises:

Type	Description
`Exception`	If there was an error loading the dataset.

Supported Data Formats and Structures:¶

Hugging Face Dataset¶

Dataset files saved by the Hugging Face datasets library.

JSONL¶

Each line is a JSON object representing an example.

{"premise": "The premise text", "hypothesis": "The hypothesis text"}

CSV¶

Should contain 'premise' and 'hypothesis' columns.

premise,hypothesis
"The premise text","The hypothesis text"

Parquet¶

Should contain 'premise' and 'hypothesis' columns.

JSON¶

An array of dictionaries with 'premise' and 'hypothesis' keys.

[{"premise": "The premise text", "hypothesis": "The hypothesis text"}]

XML¶

Each 'record' element should contain 'premise' and 'hypothesis' child elements.

<record>
    <premise>The premise text</premise>
    <hypothesis>The hypothesis text</hypothesis>
</record>

YAML¶

Each document should be a dictionary with 'premise' and 'hypothesis' keys.

- premise: "The premise text"
  hypothesis: "The hypothesis text"

TSV¶

Should contain 'premise' and 'hypothesis' columns separated by tabs.

Excel (.xls, .xlsx)¶

Should contain 'premise' and 'hypothesis' columns.

SQLite (.db)¶

Should contain a table with 'premise' and 'hypothesis' columns.

Feather¶

Should contain 'premise' and 'hypothesis' columns.

Natural Language Inference¶

__init__(input, output, state, **kwargs) ¶

load_dataset(dataset_path, max_length=512, **kwargs) ¶

Supported Data Formats and Structures:¶

Hugging Face Dataset¶

JSONL¶

CSV¶

Parquet¶

JSON¶

XML¶

YAML¶

TSV¶

Excel (.xls, .xlsx)¶

SQLite (.db)¶

Feather¶

`init(input, output, state, **kwargs)` ¶

`load_dataset(dataset_path, max_length=512, **kwargs)` ¶