Instruction Tuning¶
Bases: TextBulk
InstructionBulk is a class designed to perform bulk text generation tasks using Hugging Face's instruction-tuned language models. It is optimized for large-scale text generation, providing an efficient interface to use state-of-the-art machine learning models for generating text based on a set of instructions or prompts.
Attributes:
Name | Type | Description |
---|---|---|
model |
Any
|
The loaded, pre-trained instruction-tuned language model. |
tokenizer |
Any
|
The tokenizer for processing text compatible with the model. |
Methods
load_dataset(dataset_path: str, max_length: int = 1024, **kwargs) -> Optional[Dataset]: Loads a dataset for text generation tasks from the specified directory.
perform(model_name: str, **kwargs: Any) -> None: Performs bulk text generation using the specified model and tokenizer.
Example CLI Usage:
genius InstructionBulk rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/chat \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/chat \
postgres \
--postgres_host 127.0.0.1 \
--postgres_port 5432 \
--postgres_user postgres \
--postgres_password postgres \
--postgres_database geniusrise\
--postgres_table state \
--id mistralai/Mistral-7B-Instruct-v0.1-lol \
perform \
--args \
model_name="mistralai/Mistral-7B-Instruct-v0.1" \
model_class="AutoModelForCausalLM" \
tokenizer_class="AutoTokenizer" \
use_cuda=True \
precision="bfloat16" \
quantization=0 \
device_map="auto" \
max_memory=None \
torchscript=False \
decoding_strategy="generate" \
generation_max_new_tokens=100 \
generation_do_sample=true
or using VLLM:
genius InstructionBulk rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/chat \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/chat \
none \
--id mistralai/Mistral-7B-Instruct-v0.1 \
perform_vllm \
--args \
model_name="mistralai/Mistral-7B-Instruct-v0.1" \
use_cuda=True \
precision="bfloat16" \
quantization=0 \
device_map="auto" \
generation_temperature=0.7 \
generation_top_p=1.0 \
generation_n=1 \
generation_max_tokens=50 \
generation_stream=false \
generation_presence_penalty=0.0 \
generation_frequency_penalty=0.0
or using llama.cpp:
genius InstructionBulk rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/chat \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/chat \
none \
--id mistralai/Mistral-7B-Instruct-v0.1 \
perform_llama_cpp \
--args \
model="TheBloke/Mistral-7B-Instruct-v0.2-GGUF" \
filename="mistral-7b-instruct-v0.2.Q4_K_M.gguf" \
n_gpu_layers=35 \
generation_temperature=0.7 \
generation_top_p=0.95 \
generation_top_k=40 \
generation_max_tokens=50 \
generation_repeat_penalty=0.1
__init__(input, output, state, **kwargs)
¶
Initializes the InstructionBulk class with input, output, and state configurations for bulk text generation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
Configuration for input data handling. |
required |
output |
BatchOutput
|
Configuration for output data handling. |
required |
state |
State
|
State management for the text generation task. |
required |
**kwargs |
Additional keyword arguments for extended functionalities. |
{}
|
load_dataset(dataset_path, max_length=1024, **kwargs)
¶
Loads a dataset from the specified path. This method supports various data formats including JSON, CSV, Parquet, and others. It's designed to facilitate the bulk processing of text data for generation tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
Path to the directory containing the dataset files. |
required |
max_length |
int
|
Maximum token length for text processing (default is 1024). |
1024
|
**kwargs |
Additional keyword arguments for dataset loading. |
{}
|
Returns:
Type | Description |
---|---|
Optional[Dataset]
|
Optional[Dataset]: A Dataset object if loading is successful; otherwise, None. |
Raises:
Type | Description |
---|---|
Exception
|
If an error occurs during dataset loading. |
Supported Data Formats and Structures:¶
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'instruction' columns.
Parquet¶
Should contain 'instruction' columns.
JSON¶
An array of dictionaries with 'instruction' keys.
XML¶
Each 'record' element should contain 'instruction' child elements.
YAML¶
Each document should be a dictionary with 'instruction' keys.
TSV¶
Should contain 'instruction' columns separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'instruction' columns.
SQLite (.db)¶
Should contain a table with 'instruction' columns.
Feather¶
Should contain 'instruction' columns.
perform(model_name, model_class='AutoModelForCausalLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, decoding_strategy='generate', notification_email=None, **kwargs)
¶
Performs text generation in bulk using a specified instruction-tuned model. This method handles the entire process, including model loading, prompt processing, text generation, and saving the results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
The name or path of the instruction-tuned model. |
required |
model_class |
str
|
The class of the language model. Defaults to "AutoModelForCausalLM". |
'AutoModelForCausalLM'
|
tokenizer_class |
str
|
The class of the tokenizer. Defaults to "AutoTokenizer". |
'AutoTokenizer'
|
use_cuda |
bool
|
Whether to use CUDA for model inference. Defaults to False. |
False
|
precision |
str
|
Precision for model computation. Defaults to "float16". |
'float16'
|
quantization |
int
|
Level of quantization for optimizing model size and speed. Defaults to 0. |
0
|
device_map |
str | Dict | None
|
Specific device to use for computation. Defaults to "auto". |
'auto'
|
max_memory |
Dict
|
Maximum memory configuration for devices. Defaults to {0: "24GB"}. |
{0: '24GB'}
|
torchscript |
bool
|
Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False. |
False
|
compile |
bool
|
Whether to compile the model before fine-tuning. Defaults to True. |
False
|
awq_enabled |
bool
|
Whether to enable AWQ optimization. Defaults to False. |
False
|
flash_attention |
bool
|
Whether to use flash attention optimization. Defaults to False. |
False
|
decoding_strategy |
str
|
Strategy for decoding the completion. Defaults to "generate". |
'generate'
|
**kwargs |
Any
|
Configuration and additional arguments for text generation such as model class, tokenizer class, precision, device map, and other generation-related parameters. |
{}
|
Note
Additional arguments are passed directly to the model and tokenizer initialization and the generation method.
perform_llama_cpp(model, filename=None, local_dir=None, n_gpu_layers=0, split_mode=llama_cpp.LLAMA_SPLIT_LAYER, main_gpu=0, tensor_split=None, vocab_only=False, use_mmap=True, use_mlock=False, kv_overrides=None, seed=llama_cpp.LLAMA_DEFAULT_SEED, n_ctx=512, n_batch=512, n_threads=None, n_threads_batch=None, rope_scaling_type=llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED, rope_freq_base=0.0, rope_freq_scale=0.0, yarn_ext_factor=-1.0, yarn_attn_factor=1.0, yarn_beta_fast=32.0, yarn_beta_slow=1.0, yarn_orig_ctx=0, mul_mat_q=True, logits_all=False, embedding=False, offload_kqv=True, last_n_tokens_size=64, lora_base=None, lora_scale=1.0, lora_path=None, numa=False, chat_format=None, chat_handler=None, draft_model=None, tokenizer=None, verbose=True, notification_email=None, **kwargs)
¶
Performs bulk text generation using the LLaMA model with llama.cpp backend. This method handles the entire process, including model loading, prompt processing, text generation, and saving the results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model |
str
|
Path or identifier for the LLaMA model. |
required |
filename |
Optional[str]
|
Optional filename or glob pattern to match the model file. |
None
|
local_dir |
Optional[Union[str, os.PathLike[str]]]
|
Local directory to save the model files. |
None
|
n_gpu_layers |
int
|
Number of layers to offload to GPU. |
0
|
split_mode |
int
|
Split mode for distributing model across GPUs. |
llama_cpp.LLAMA_SPLIT_LAYER
|
main_gpu |
int
|
Main GPU index. |
0
|
tensor_split |
Optional[List[float]]
|
Configuration for tensor splitting across GPUs. |
None
|
vocab_only |
bool
|
Whether to load only the vocabulary. |
False
|
use_mmap |
bool
|
Use memory-mapped files for model loading. |
True
|
use_mlock |
bool
|
Lock model data in RAM to prevent swapping. |
False
|
kv_overrides |
Optional[Dict[str, Union[bool, int, float]]]
|
Key-value pairs for overriding model config. |
None
|
seed |
int
|
Seed for random number generation. |
llama_cpp.LLAMA_DEFAULT_SEED
|
n_ctx |
int
|
Number of context tokens for generation. |
512
|
n_batch |
int
|
Batch size for processing. |
512
|
n_threads |
Optional[int]
|
Number of threads for generation. |
None
|
n_threads_batch |
Optional[int]
|
Number of threads for batch processing. |
None
|
rope_scaling_type |
Optional[int]
|
Scaling type for RoPE. |
llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED
|
rope_freq_base |
float
|
Base frequency for RoPE. |
0.0
|
rope_freq_scale |
float
|
Frequency scaling for RoPE. |
0.0
|
yarn_ext_factor |
float
|
YaRN extrapolation factor. |
-1.0
|
yarn_attn_factor |
float
|
YaRN attention factor. |
1.0
|
yarn_beta_fast |
float
|
YaRN beta fast parameter. |
32.0
|
yarn_beta_slow |
float
|
YaRN beta slow parameter. |
1.0
|
yarn_orig_ctx |
int
|
Original context size for YaRN. |
0
|
mul_mat_q |
bool
|
Multiply matrices for queries. |
True
|
logits_all |
bool
|
Return logits for all tokens. |
False
|
embedding |
bool
|
Enable embedding mode. |
False
|
offload_kqv |
bool
|
Offload K, Q, V matrices to GPU. |
True
|
last_n_tokens_size |
int
|
Size for the last_n_tokens buffer. |
64
|
lora_base |
Optional[str]
|
Base model path for LoRA. |
None
|
lora_scale |
float
|
Scale factor for LoRA adjustments. |
1.0
|
lora_path |
Optional[str]
|
Path for LoRA adjustments. |
None
|
numa |
Union[bool, int]
|
NUMA configuration. |
False
|
chat_format |
Optional[str]
|
Chat format configuration. |
None
|
chat_handler |
Optional[llama_cpp.llama_chat_format.LlamaChatCompletionHandler]
|
Handler for chat completions. |
None
|
draft_model |
Optional[llama_cpp.LlamaDraftModel]
|
Draft model for speculative decoding. |
None
|
tokenizer |
Optional[PreTrainedTokenizerBase]
|
Custom tokenizer instance. |
None
|
verbose |
bool
|
Enable verbose logging. |
True
|
notification_email |
Optional[str]
|
Email to send notifications upon completion. |
None
|
**kwargs |
Additional arguments for model loading and text generation. |
{}
|
perform_vllm(model_name, use_cuda=False, precision='float16', quantization=0, device_map='auto', vllm_tokenizer_mode='auto', vllm_download_dir=None, vllm_load_format='auto', vllm_seed=42, vllm_max_model_len=1024, vllm_enforce_eager=False, vllm_max_context_len_to_capture=8192, vllm_block_size=16, vllm_gpu_memory_utilization=0.9, vllm_swap_space=4, vllm_sliding_window=None, vllm_pipeline_parallel_size=1, vllm_tensor_parallel_size=1, vllm_worker_use_ray=False, vllm_max_parallel_loading_workers=None, vllm_disable_custom_all_reduce=False, vllm_max_num_batched_tokens=None, vllm_max_num_seqs=64, vllm_max_paddings=512, vllm_max_lora_rank=None, vllm_max_loras=None, vllm_max_cpu_loras=None, vllm_lora_extra_vocab_size=0, vllm_placement_group=None, vllm_log_stats=False, notification_email=None, batch_size=32, **kwargs)
¶
Performs bulk text generation using the Versatile Language Learning Model (VLLM) with specified parameters for fine-tuning model behavior, including quantization and parallel processing settings. This method is designed to process large datasets efficiently by leveraging VLLM capabilities for generating high-quality text completions based on provided prompts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
The name or path of the VLLM model to use for text generation. |
required |
use_cuda |
bool
|
Flag indicating whether to use CUDA for GPU acceleration. |
False
|
precision |
str
|
Precision of computations, can be "float16", "bfloat16", etc. |
'float16'
|
quantization |
int
|
Level of quantization for model weights, 0 for none. |
0
|
device_map |
str | Dict | None
|
Specific device(s) to use for model inference. |
'auto'
|
vllm_tokenizer_mode |
str
|
Mode of the tokenizer ("auto", "fast", or "slow"). |
'auto'
|
vllm_download_dir |
Optional[str]
|
Directory to download and load the model and tokenizer. |
None
|
vllm_load_format |
str
|
Format to load the model, e.g., "auto", "pt". |
'auto'
|
vllm_seed |
int
|
Seed for random number generation. |
42
|
vllm_max_model_len |
int
|
Maximum sequence length the model can handle. |
1024
|
vllm_enforce_eager |
bool
|
Enforce eager execution instead of using optimization techniques. |
False
|
vllm_max_context_len_to_capture |
int
|
Maximum context length for CUDA graph capture. |
8192
|
vllm_block_size |
int
|
Block size for caching mechanism. |
16
|
vllm_gpu_memory_utilization |
float
|
Fraction of GPU memory to use. |
0.9
|
vllm_swap_space |
int
|
Amount of swap space to use in GiB. |
4
|
vllm_sliding_window |
Optional[int]
|
Size of the sliding window for processing. |
None
|
vllm_pipeline_parallel_size |
int
|
Number of pipeline parallel groups. |
1
|
vllm_tensor_parallel_size |
int
|
Number of tensor parallel groups. |
1
|
vllm_worker_use_ray |
bool
|
Whether to use Ray for model workers. |
False
|
vllm_max_parallel_loading_workers |
Optional[int]
|
Maximum number of workers for parallel loading. |
None
|
vllm_disable_custom_all_reduce |
bool
|
Disable custom all-reduce kernel and fall back to NCCL. |
False
|
vllm_max_num_batched_tokens |
Optional[int]
|
Maximum number of tokens to be processed in a single iteration. |
None
|
vllm_max_num_seqs |
int
|
Maximum number of sequences to be processed in a single iteration. |
64
|
vllm_max_paddings |
int
|
Maximum number of paddings to be added to a batch. |
512
|
vllm_max_lora_rank |
Optional[int]
|
Maximum rank for LoRA adjustments. |
None
|
vllm_max_loras |
Optional[int]
|
Maximum number of LoRA adjustments. |
None
|
vllm_max_cpu_loras |
Optional[int]
|
Maximum number of LoRA adjustments stored on CPU. |
None
|
vllm_lora_extra_vocab_size |
int
|
Additional vocabulary size for LoRA. |
0
|
vllm_placement_group |
Optional[dict]
|
Ray placement group for distributed execution. |
None
|
vllm_log_stats |
bool
|
Whether to log statistics during model operation. |
False
|
notification_email |
Optional[str]
|
Email to send notifications upon completion. |
None
|
batch_size |
int
|
Number of prompts to process in each batch for efficient memory usage. |
32
|
**kwargs |
Any
|
Additional keyword arguments for generation settings like temperature, top_p, etc. |
{}
|
This method automates the loading of large datasets, generation of text completions, and saving results, facilitating efficient and scalable text generation tasks.