Language Model¶
Bases: TextBulk
LanguageModelBulk is designed for large-scale text generation using Hugging Face language models in a bulk processing manner. It's particularly useful for tasks such as bulk content creation, summarization, or any other scenario where large datasets need to be processed with a language model.
Attributes:
Name | Type | Description |
---|---|---|
model |
Any
|
The loaded language model used for text generation. |
tokenizer |
Any
|
The tokenizer corresponding to the language model, used for processing input text. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
Configuration for the input data. |
required |
output |
BatchOutput
|
Configuration for the output data. |
required |
state |
State
|
State management for the API. |
required |
**kwargs |
Any
|
Arbitrary keyword arguments for extended functionality. |
{}
|
CLI Usage Example:
genius LanguageModelBulk rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/lm \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/lm \
postgres \
--postgres_host 127.0.0.1 \
--postgres_port 5432 \
--postgres_user postgres \
--postgres_password postgres \
--postgres_database geniusrise\
--postgres_table state \
--id mistralai/Mistral-7B-Instruct-v0.1-lol \
complete \
--args \
model_name="mistralai/Mistral-7B-Instruct-v0.1" \
model_class="AutoModelForCausalLM" \
tokenizer_class="AutoTokenizer" \
use_cuda=True \
precision="bfloat16" \
quantization=0 \
device_map="auto" \
max_memory=None \
torchscript=False \
decoding_strategy="generate" \
generation_max_new_tokens=100 \
generation_do_sample=true
or using VLLM:
genius LanguageModelBulk rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/lm \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/lm \
none \
--id mistralai/Mistral-7B-v0.1 \
complete_vllm \
--args \
model_name="mistralai/Mistral-7B-v0.1" \
use_cuda=True \
precision="bfloat16" \
quantization=0 \
device_map="auto" \
vllm_enforce_eager=True \
generation_temperature=0.7 \
generation_top_p=1.0 \
generation_n=1 \
generation_max_tokens=50 \
generation_stream=false \
generation_presence_penalty=0.0 \
generation_frequency_penalty=0.0
or using llama.cpp:
genius LanguageModelBulk rise \
batch \
--input_s3_bucket geniusrise-test \
--input_s3_folder input/chat \
batch \
--output_s3_bucket geniusrise-test \
--output_s3_folder output/chat \
none \
complete_llama_cpp \
--args \
model="TheBloke/Mistral-7B-v0.1-GGUF" \
filename="mistral-7b-v0.1.Q4_K_M.gguf" \
n_gpu_layers=35 \
n_ctx=32768 \
generation_temperature=0.7 \
generation_top_p=0.95 \
generation_top_k=40 \
generation_max_tokens=50 \
generation_repeat_penalty=0.1
__init__(input, output, state, **kwargs)
¶
Initializes the LanguageModelBulk object with the specified configurations for input, output, and state.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
Configuration and data inputs for the bulk process. |
required |
output |
BatchOutput
|
Configurations for output data handling. |
required |
state |
State
|
State management for the bulk process. |
required |
**kwargs |
Any
|
Additional keyword arguments for extended configurations. |
{}
|
complete(model_name, model_class='AutoModelForCausalLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, decoding_strategy='generate', notification_email=None, **kwargs)
¶
Performs text completion on the loaded dataset using the specified model and tokenizer. The method handles the entire process, including model loading, text generation, and saving the results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
The name of the language model to use for text completion. |
required |
model_class |
str
|
The class of the language model. Defaults to "AutoModelForCausalLM". |
'AutoModelForCausalLM'
|
tokenizer_class |
str
|
The class of the tokenizer. Defaults to "AutoTokenizer". |
'AutoTokenizer'
|
use_cuda |
bool
|
Whether to use CUDA for model inference. Defaults to False. |
False
|
precision |
str
|
Precision for model computation. Defaults to "float16". |
'float16'
|
quantization |
int
|
Level of quantization for optimizing model size and speed. Defaults to 0. |
0
|
device_map |
str | Dict | None
|
Specific device to use for computation. Defaults to "auto". |
'auto'
|
max_memory |
Dict
|
Maximum memory configuration for devices. Defaults to {0: "24GB"}. |
{0: '24GB'}
|
torchscript |
bool
|
Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False. |
False
|
compile |
bool
|
Whether to compile the model before fine-tuning. Defaults to True. |
False
|
awq_enabled |
bool
|
Whether to enable AWQ optimization. Defaults to False. |
False
|
flash_attention |
bool
|
Whether to use flash attention optimization. Defaults to False. |
False
|
decoding_strategy |
str
|
Strategy for decoding the completion. Defaults to "generate". |
'generate'
|
**kwargs |
Any
|
Additional keyword arguments for text generation. |
{}
|
complete_llama_cpp(model, filename=None, local_dir=None, n_gpu_layers=0, split_mode=llama_cpp.LLAMA_SPLIT_LAYER, main_gpu=0, tensor_split=None, vocab_only=False, use_mmap=True, use_mlock=False, kv_overrides=None, seed=llama_cpp.LLAMA_DEFAULT_SEED, n_ctx=512, n_batch=512, n_threads=None, n_threads_batch=None, rope_scaling_type=llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED, rope_freq_base=0.0, rope_freq_scale=0.0, yarn_ext_factor=-1.0, yarn_attn_factor=1.0, yarn_beta_fast=32.0, yarn_beta_slow=1.0, yarn_orig_ctx=0, mul_mat_q=True, logits_all=False, embedding=False, offload_kqv=True, last_n_tokens_size=64, lora_base=None, lora_scale=1.0, lora_path=None, numa=False, chat_format=None, chat_handler=None, draft_model=None, tokenizer=None, verbose=True, notification_email=None, **kwargs)
¶
Performs bulk text generation using the LLaMA model with llama.cpp backend. This method handles the entire process, including model loading, prompt processing, text generation, and saving the results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model |
str
|
Path or identifier for the LLaMA model. |
required |
filename |
Optional[str]
|
Optional filename or glob pattern to match the model file. |
None
|
local_dir |
Optional[Union[str, os.PathLike[str]]]
|
Local directory to save the model files. |
None
|
n_gpu_layers |
int
|
Number of layers to offload to GPU. |
0
|
split_mode |
int
|
Split mode for distributing model across GPUs. |
llama_cpp.LLAMA_SPLIT_LAYER
|
main_gpu |
int
|
Main GPU index. |
0
|
tensor_split |
Optional[List[float]]
|
Configuration for tensor splitting across GPUs. |
None
|
vocab_only |
bool
|
Whether to load only the vocabulary. |
False
|
use_mmap |
bool
|
Use memory-mapped files for model loading. |
True
|
use_mlock |
bool
|
Lock model data in RAM to prevent swapping. |
False
|
kv_overrides |
Optional[Dict[str, Union[bool, int, float]]]
|
Key-value pairs for overriding model config. |
None
|
seed |
int
|
Seed for random number generation. |
llama_cpp.LLAMA_DEFAULT_SEED
|
n_ctx |
int
|
Number of context tokens for generation. |
512
|
n_batch |
int
|
Batch size for processing. |
512
|
n_threads |
Optional[int]
|
Number of threads for generation. |
None
|
n_threads_batch |
Optional[int]
|
Number of threads for batch processing. |
None
|
rope_scaling_type |
Optional[int]
|
Scaling type for RoPE. |
llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED
|
rope_freq_base |
float
|
Base frequency for RoPE. |
0.0
|
rope_freq_scale |
float
|
Frequency scaling for RoPE. |
0.0
|
yarn_ext_factor |
float
|
YaRN extrapolation factor. |
-1.0
|
yarn_attn_factor |
float
|
YaRN attention factor. |
1.0
|
yarn_beta_fast |
float
|
YaRN beta fast parameter. |
32.0
|
yarn_beta_slow |
float
|
YaRN beta slow parameter. |
1.0
|
yarn_orig_ctx |
int
|
Original context size for YaRN. |
0
|
mul_mat_q |
bool
|
Multiply matrices for queries. |
True
|
logits_all |
bool
|
Return logits for all tokens. |
False
|
embedding |
bool
|
Enable embedding mode. |
False
|
offload_kqv |
bool
|
Offload K, Q, V matrices to GPU. |
True
|
last_n_tokens_size |
int
|
Size for the last_n_tokens buffer. |
64
|
lora_base |
Optional[str]
|
Base model path for LoRA. |
None
|
lora_scale |
float
|
Scale factor for LoRA adjustments. |
1.0
|
lora_path |
Optional[str]
|
Path for LoRA adjustments. |
None
|
numa |
Union[bool, int]
|
NUMA configuration. |
False
|
chat_format |
Optional[str]
|
Chat format configuration. |
None
|
chat_handler |
Optional[llama_cpp.llama_chat_format.LlamaChatCompletionHandler]
|
Handler for chat completions. |
None
|
draft_model |
Optional[llama_cpp.LlamaDraftModel]
|
Draft model for speculative decoding. |
None
|
tokenizer |
Optional[PreTrainedTokenizerBase]
|
Custom tokenizer instance. |
None
|
verbose |
bool
|
Enable verbose logging. |
True
|
notification_email |
Optional[str]
|
Email to send notifications upon completion. |
None
|
**kwargs |
Additional arguments for model loading and text generation. |
{}
|
complete_vllm(model_name, use_cuda=False, precision='float16', quantization=0, device_map='auto', vllm_tokenizer_mode='auto', vllm_download_dir=None, vllm_load_format='auto', vllm_seed=42, vllm_max_model_len=1024, vllm_enforce_eager=False, vllm_max_context_len_to_capture=8192, vllm_block_size=16, vllm_gpu_memory_utilization=0.9, vllm_swap_space=4, vllm_sliding_window=None, vllm_pipeline_parallel_size=1, vllm_tensor_parallel_size=1, vllm_worker_use_ray=False, vllm_max_parallel_loading_workers=None, vllm_disable_custom_all_reduce=False, vllm_max_num_batched_tokens=None, vllm_max_num_seqs=64, vllm_max_paddings=512, vllm_max_lora_rank=None, vllm_max_loras=None, vllm_max_cpu_loras=None, vllm_lora_extra_vocab_size=0, vllm_placement_group=None, vllm_log_stats=False, notification_email=None, batch_size=32, **kwargs)
¶
Performs bulk text generation using the Versatile Language Learning Model (VLLM) with specified parameters for fine-tuning model behavior, including quantization and parallel processing settings. This method is designed to process large datasets efficiently by leveraging VLLM capabilities for generating high-quality text completions based on provided prompts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
The name or path of the VLLM model to use for text generation. |
required |
use_cuda |
bool
|
Flag indicating whether to use CUDA for GPU acceleration. |
False
|
precision |
str
|
Precision of computations, can be "float16", "bfloat16", etc. |
'float16'
|
quantization |
int
|
Level of quantization for model weights, 0 for none. |
0
|
device_map |
str | Dict | None
|
Specific device(s) to use for model inference. |
'auto'
|
vllm_tokenizer_mode |
str
|
Mode of the tokenizer ("auto", "fast", or "slow"). |
'auto'
|
vllm_download_dir |
Optional[str]
|
Directory to download and load the model and tokenizer. |
None
|
vllm_load_format |
str
|
Format to load the model, e.g., "auto", "pt". |
'auto'
|
vllm_seed |
int
|
Seed for random number generation. |
42
|
vllm_max_model_len |
int
|
Maximum sequence length the model can handle. |
1024
|
vllm_enforce_eager |
bool
|
Enforce eager execution instead of using optimization techniques. |
False
|
vllm_max_context_len_to_capture |
int
|
Maximum context length for CUDA graph capture. |
8192
|
vllm_block_size |
int
|
Block size for caching mechanism. |
16
|
vllm_gpu_memory_utilization |
float
|
Fraction of GPU memory to use. |
0.9
|
vllm_swap_space |
int
|
Amount of swap space to use in GiB. |
4
|
vllm_sliding_window |
Optional[int]
|
Size of the sliding window for processing. |
None
|
vllm_pipeline_parallel_size |
int
|
Number of pipeline parallel groups. |
1
|
vllm_tensor_parallel_size |
int
|
Number of tensor parallel groups. |
1
|
vllm_worker_use_ray |
bool
|
Whether to use Ray for model workers. |
False
|
vllm_max_parallel_loading_workers |
Optional[int]
|
Maximum number of workers for parallel loading. |
None
|
vllm_disable_custom_all_reduce |
bool
|
Disable custom all-reduce kernel and fall back to NCCL. |
False
|
vllm_max_num_batched_tokens |
Optional[int]
|
Maximum number of tokens to be processed in a single iteration. |
None
|
vllm_max_num_seqs |
int
|
Maximum number of sequences to be processed in a single iteration. |
64
|
vllm_max_paddings |
int
|
Maximum number of paddings to be added to a batch. |
512
|
vllm_max_lora_rank |
Optional[int]
|
Maximum rank for LoRA adjustments. |
None
|
vllm_max_loras |
Optional[int]
|
Maximum number of LoRA adjustments. |
None
|
vllm_max_cpu_loras |
Optional[int]
|
Maximum number of LoRA adjustments stored on CPU. |
None
|
vllm_lora_extra_vocab_size |
int
|
Additional vocabulary size for LoRA. |
0
|
vllm_placement_group |
Optional[dict]
|
Ray placement group for distributed execution. |
None
|
vllm_log_stats |
bool
|
Whether to log statistics during model operation. |
False
|
notification_email |
Optional[str]
|
Email to send notifications upon completion. |
None
|
batch_size |
int
|
Number of prompts to process in each batch for efficient memory usage. |
32
|
**kwargs |
Any
|
Additional keyword arguments for generation settings like temperature, top_p, etc. |
{}
|
This method automates the loading of large datasets, generation of text completions, and saving results, facilitating efficient and scalable text generation tasks.
load_dataset(dataset_path, max_length=512, **kwargs)
¶
Load a completion dataset from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
str
|
The path to the dataset directory. |
required |
max_length |
int
|
The maximum length for tokenization. Defaults to 512. |
512
|
**kwargs |
Additional keyword arguments to pass to the underlying dataset loading functions. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Optional[Dataset]
|
The loaded dataset. |
Raises:
Type | Description |
---|---|
Exception
|
If there was an error loading the dataset. |
Supported Data Formats and Structures:¶
Dataset files saved by Hugging Face datasets library¶
The directory should contain 'dataset_info.json' and other related files.
JSONL¶
Each line is a JSON object representing an example.
CSV¶
Should contain 'text' column.
Parquet¶
Should contain 'text' column.
JSON¶
An array of dictionaries with 'text' key.
XML¶
Each 'record' element should contain 'text' child element.
YAML¶
Each document should be a dictionary with 'text' key.
TSV¶
Should contain 'text' column separated by tabs.
Excel (.xls, .xlsx)¶
Should contain 'text' column.
SQLite (.db)¶
Should contain a table with 'text' column.
Feather¶
Should contain 'text' column.