Skip to content

Instruction Tuning

Bases: TextBulk

InstructionBulk is a class designed to perform bulk text generation tasks using Hugging Face's instruction-tuned language models. It is optimized for large-scale text generation, providing an efficient interface to use state-of-the-art machine learning models for generating text based on a set of instructions or prompts.

Attributes:

Name Type Description
model Any

The loaded, pre-trained instruction-tuned language model.

tokenizer Any

The tokenizer for processing text compatible with the model.

Methods

load_dataset(dataset_path: str, max_length: int = 1024, **kwargs) -> Optional[Dataset]: Loads a dataset for text generation tasks from the specified directory.

perform(model_name: str, **kwargs: Any) -> None: Performs bulk text generation using the specified model and tokenizer.

Example CLI Usage:

genius InstructionBulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/chat \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/chat \
    postgres \
        --postgres_host 127.0.0.1 \
        --postgres_port 5432 \
        --postgres_user postgres \
        --postgres_password postgres \
        --postgres_database geniusrise\
        --postgres_table state \
    --id mistralai/Mistral-7B-Instruct-v0.1-lol \
    perform \
        --args \
            model_name="mistralai/Mistral-7B-Instruct-v0.1" \
            model_class="AutoModelForCausalLM" \
            tokenizer_class="AutoTokenizer" \
            use_cuda=True \
            precision="bfloat16" \
            quantization=0 \
            device_map="auto" \
            max_memory=None \
            torchscript=False \
            decoding_strategy="generate" \
            generation_max_new_tokens=100 \
            generation_do_sample=true

or using VLLM:

genius InstructionBulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/chat \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/chat \
    none \
    --id mistralai/Mistral-7B-Instruct-v0.1 \
    perform_vllm \
        --args \
            model_name="mistralai/Mistral-7B-Instruct-v0.1" \
            use_cuda=True \
            precision="bfloat16" \
            quantization=0 \
            device_map="auto" \
            generation_temperature=0.7 \
            generation_top_p=1.0 \
            generation_n=1 \
            generation_max_tokens=50 \
            generation_stream=false \
            generation_presence_penalty=0.0 \
            generation_frequency_penalty=0.0

or using llama.cpp:

genius InstructionBulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/chat \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/chat \
    none \
    --id mistralai/Mistral-7B-Instruct-v0.1 \
    perform_llama_cpp \
        --args \
            model="TheBloke/Mistral-7B-Instruct-v0.2-GGUF" \
            filename="mistral-7b-instruct-v0.2.Q4_K_M.gguf" \
            n_gpu_layers=35  \
            generation_temperature=0.7 \
            generation_top_p=0.95 \
            generation_top_k=40 \
            generation_max_tokens=50 \
            generation_repeat_penalty=0.1

__init__(input, output, state, **kwargs)

Initializes the InstructionBulk class with input, output, and state configurations for bulk text generation.

Parameters:

Name Type Description Default
input BatchInput

Configuration for input data handling.

required
output BatchOutput

Configuration for output data handling.

required
state State

State management for the text generation task.

required
**kwargs

Additional keyword arguments for extended functionalities.

{}

load_dataset(dataset_path, max_length=1024, **kwargs)

Loads a dataset from the specified path. This method supports various data formats including JSON, CSV, Parquet, and others. It's designed to facilitate the bulk processing of text data for generation tasks.

Parameters:

Name Type Description Default
dataset_path str

Path to the directory containing the dataset files.

required
max_length int

Maximum token length for text processing (default is 1024).

1024
**kwargs

Additional keyword arguments for dataset loading.

{}

Returns:

Type Description
Optional[Dataset]

Optional[Dataset]: A Dataset object if loading is successful; otherwise, None.

Raises:

Type Description
Exception

If an error occurs during dataset loading.

Supported Data Formats and Structures:

JSONL

Each line is a JSON object representing an example.

{"instruction": "The instruction"}

CSV

Should contain 'instruction' columns.

instruction
"The instruction"

Parquet

Should contain 'instruction' columns.

JSON

An array of dictionaries with 'instruction' keys.

[{"instruction": "The instruction"}]

XML

Each 'record' element should contain 'instruction' child elements.

<record>
    <instruction>The instruction</instruction>
</record>

YAML

Each document should be a dictionary with 'instruction' keys.

- instruction: "The instruction"

TSV

Should contain 'instruction' columns separated by tabs.

Excel (.xls, .xlsx)

Should contain 'instruction' columns.

SQLite (.db)

Should contain a table with 'instruction' columns.

Feather

Should contain 'instruction' columns.

perform(model_name, model_class='AutoModelForCausalLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, decoding_strategy='generate', notification_email=None, **kwargs)

Performs text generation in bulk using a specified instruction-tuned model. This method handles the entire process, including model loading, prompt processing, text generation, and saving the results.

Parameters:

Name Type Description Default
model_name str

The name or path of the instruction-tuned model.

required
model_class str

The class of the language model. Defaults to "AutoModelForCausalLM".

'AutoModelForCausalLM'
tokenizer_class str

The class of the tokenizer. Defaults to "AutoTokenizer".

'AutoTokenizer'
use_cuda bool

Whether to use CUDA for model inference. Defaults to False.

False
precision str

Precision for model computation. Defaults to "float16".

'float16'
quantization int

Level of quantization for optimizing model size and speed. Defaults to 0.

0
device_map str | Dict | None

Specific device to use for computation. Defaults to "auto".

'auto'
max_memory Dict

Maximum memory configuration for devices. Defaults to {0: "24GB"}.

{0: '24GB'}
torchscript bool

Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False.

False
compile bool

Whether to compile the model before fine-tuning. Defaults to True.

False
awq_enabled bool

Whether to enable AWQ optimization. Defaults to False.

False
flash_attention bool

Whether to use flash attention optimization. Defaults to False.

False
decoding_strategy str

Strategy for decoding the completion. Defaults to "generate".

'generate'
**kwargs Any

Configuration and additional arguments for text generation such as model class, tokenizer class, precision, device map, and other generation-related parameters.

{}
Note

Additional arguments are passed directly to the model and tokenizer initialization and the generation method.

perform_llama_cpp(model, filename=None, local_dir=None, n_gpu_layers=0, split_mode=llama_cpp.LLAMA_SPLIT_LAYER, main_gpu=0, tensor_split=None, vocab_only=False, use_mmap=True, use_mlock=False, kv_overrides=None, seed=llama_cpp.LLAMA_DEFAULT_SEED, n_ctx=512, n_batch=512, n_threads=None, n_threads_batch=None, rope_scaling_type=llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED, rope_freq_base=0.0, rope_freq_scale=0.0, yarn_ext_factor=-1.0, yarn_attn_factor=1.0, yarn_beta_fast=32.0, yarn_beta_slow=1.0, yarn_orig_ctx=0, mul_mat_q=True, logits_all=False, embedding=False, offload_kqv=True, last_n_tokens_size=64, lora_base=None, lora_scale=1.0, lora_path=None, numa=False, chat_format=None, chat_handler=None, draft_model=None, tokenizer=None, verbose=True, notification_email=None, **kwargs)

Performs bulk text generation using the LLaMA model with llama.cpp backend. This method handles the entire process, including model loading, prompt processing, text generation, and saving the results.

Parameters:

Name Type Description Default
model str

Path or identifier for the LLaMA model.

required
filename Optional[str]

Optional filename or glob pattern to match the model file.

None
local_dir Optional[Union[str, os.PathLike[str]]]

Local directory to save the model files.

None
n_gpu_layers int

Number of layers to offload to GPU.

0
split_mode int

Split mode for distributing model across GPUs.

llama_cpp.LLAMA_SPLIT_LAYER
main_gpu int

Main GPU index.

0
tensor_split Optional[List[float]]

Configuration for tensor splitting across GPUs.

None
vocab_only bool

Whether to load only the vocabulary.

False
use_mmap bool

Use memory-mapped files for model loading.

True
use_mlock bool

Lock model data in RAM to prevent swapping.

False
kv_overrides Optional[Dict[str, Union[bool, int, float]]]

Key-value pairs for overriding model config.

None
seed int

Seed for random number generation.

llama_cpp.LLAMA_DEFAULT_SEED
n_ctx int

Number of context tokens for generation.

512
n_batch int

Batch size for processing.

512
n_threads Optional[int]

Number of threads for generation.

None
n_threads_batch Optional[int]

Number of threads for batch processing.

None
rope_scaling_type Optional[int]

Scaling type for RoPE.

llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED
rope_freq_base float

Base frequency for RoPE.

0.0
rope_freq_scale float

Frequency scaling for RoPE.

0.0
yarn_ext_factor float

YaRN extrapolation factor.

-1.0
yarn_attn_factor float

YaRN attention factor.

1.0
yarn_beta_fast float

YaRN beta fast parameter.

32.0
yarn_beta_slow float

YaRN beta slow parameter.

1.0
yarn_orig_ctx int

Original context size for YaRN.

0
mul_mat_q bool

Multiply matrices for queries.

True
logits_all bool

Return logits for all tokens.

False
embedding bool

Enable embedding mode.

False
offload_kqv bool

Offload K, Q, V matrices to GPU.

True
last_n_tokens_size int

Size for the last_n_tokens buffer.

64
lora_base Optional[str]

Base model path for LoRA.

None
lora_scale float

Scale factor for LoRA adjustments.

1.0
lora_path Optional[str]

Path for LoRA adjustments.

None
numa Union[bool, int]

NUMA configuration.

False
chat_format Optional[str]

Chat format configuration.

None
chat_handler Optional[llama_cpp.llama_chat_format.LlamaChatCompletionHandler]

Handler for chat completions.

None
draft_model Optional[llama_cpp.LlamaDraftModel]

Draft model for speculative decoding.

None
tokenizer Optional[PreTrainedTokenizerBase]

Custom tokenizer instance.

None
verbose bool

Enable verbose logging.

True
notification_email Optional[str]

Email to send notifications upon completion.

None
**kwargs

Additional arguments for model loading and text generation.

{}

perform_vllm(model_name, use_cuda=False, precision='float16', quantization=0, device_map='auto', vllm_tokenizer_mode='auto', vllm_download_dir=None, vllm_load_format='auto', vllm_seed=42, vllm_max_model_len=1024, vllm_enforce_eager=False, vllm_max_context_len_to_capture=8192, vllm_block_size=16, vllm_gpu_memory_utilization=0.9, vllm_swap_space=4, vllm_sliding_window=None, vllm_pipeline_parallel_size=1, vllm_tensor_parallel_size=1, vllm_worker_use_ray=False, vllm_max_parallel_loading_workers=None, vllm_disable_custom_all_reduce=False, vllm_max_num_batched_tokens=None, vllm_max_num_seqs=64, vllm_max_paddings=512, vllm_max_lora_rank=None, vllm_max_loras=None, vllm_max_cpu_loras=None, vllm_lora_extra_vocab_size=0, vllm_placement_group=None, vllm_log_stats=False, notification_email=None, batch_size=32, **kwargs)

Performs bulk text generation using the Versatile Language Learning Model (VLLM) with specified parameters for fine-tuning model behavior, including quantization and parallel processing settings. This method is designed to process large datasets efficiently by leveraging VLLM capabilities for generating high-quality text completions based on provided prompts.

Parameters:

Name Type Description Default
model_name str

The name or path of the VLLM model to use for text generation.

required
use_cuda bool

Flag indicating whether to use CUDA for GPU acceleration.

False
precision str

Precision of computations, can be "float16", "bfloat16", etc.

'float16'
quantization int

Level of quantization for model weights, 0 for none.

0
device_map str | Dict | None

Specific device(s) to use for model inference.

'auto'
vllm_tokenizer_mode str

Mode of the tokenizer ("auto", "fast", or "slow").

'auto'
vllm_download_dir Optional[str]

Directory to download and load the model and tokenizer.

None
vllm_load_format str

Format to load the model, e.g., "auto", "pt".

'auto'
vllm_seed int

Seed for random number generation.

42
vllm_max_model_len int

Maximum sequence length the model can handle.

1024
vllm_enforce_eager bool

Enforce eager execution instead of using optimization techniques.

False
vllm_max_context_len_to_capture int

Maximum context length for CUDA graph capture.

8192
vllm_block_size int

Block size for caching mechanism.

16
vllm_gpu_memory_utilization float

Fraction of GPU memory to use.

0.9
vllm_swap_space int

Amount of swap space to use in GiB.

4
vllm_sliding_window Optional[int]

Size of the sliding window for processing.

None
vllm_pipeline_parallel_size int

Number of pipeline parallel groups.

1
vllm_tensor_parallel_size int

Number of tensor parallel groups.

1
vllm_worker_use_ray bool

Whether to use Ray for model workers.

False
vllm_max_parallel_loading_workers Optional[int]

Maximum number of workers for parallel loading.

None
vllm_disable_custom_all_reduce bool

Disable custom all-reduce kernel and fall back to NCCL.

False
vllm_max_num_batched_tokens Optional[int]

Maximum number of tokens to be processed in a single iteration.

None
vllm_max_num_seqs int

Maximum number of sequences to be processed in a single iteration.

64
vllm_max_paddings int

Maximum number of paddings to be added to a batch.

512
vllm_max_lora_rank Optional[int]

Maximum rank for LoRA adjustments.

None
vllm_max_loras Optional[int]

Maximum number of LoRA adjustments.

None
vllm_max_cpu_loras Optional[int]

Maximum number of LoRA adjustments stored on CPU.

None
vllm_lora_extra_vocab_size int

Additional vocabulary size for LoRA.

0
vllm_placement_group Optional[dict]

Ray placement group for distributed execution.

None
vllm_log_stats bool

Whether to log statistics during model operation.

False
notification_email Optional[str]

Email to send notifications upon completion.

None
batch_size int

Number of prompts to process in each batch for efficient memory usage.

32
**kwargs Any

Additional keyword arguments for generation settings like temperature, top_p, etc.

{}

This method automates the loading of large datasets, generation of text completions, and saving results, facilitating efficient and scalable text generation tasks.