Base Fine Tuner¶
Bases: Bolt
TextBulk is a foundational class for enabling bulk processing of text with various generation models. It primarily focuses on using Hugging Face models to provide a robust and efficient framework for large-scale text generation tasks. The class supports various decoding strategies to generate text that can be tailored to specific needs or preferences.
Attributes:
Name | Type | Description |
---|---|---|
model |
AutoModelForCausalLM
|
The language model for text generation. |
tokenizer |
AutoTokenizer
|
The tokenizer for preparing input data for the model. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
Configuration and data inputs for the batch process. |
required |
output |
BatchOutput
|
Configurations for output data handling. |
required |
state |
State
|
State management for the Bolt. |
required |
**kwargs |
Arbitrary keyword arguments for extended configurations. |
{}
|
Methods
text(**kwargs: Any) -> Dict[str, Any]: Provides an API endpoint for text generation functionality. Accepts various parameters for customizing the text generation process.
generate(prompt: str, decoding_strategy: str = "generate", **generation_params: Any) -> dict: Generates text based on the provided prompt and parameters. Supports multiple decoding strategies for diverse applications.
The class serves as a versatile tool for text generation, supporting various models and configurations. It can be extended or used as is for efficient text generation tasks.
__init__(input, output, state, **kwargs)
¶
Initializes the TextBulk with configurations and sets up logging. It prepares the environment for text generation tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The input data configuration for the text generation task. |
required |
output |
BatchOutput
|
The output data configuration for the results of the text generation. |
required |
state |
State
|
The state configuration for the Bolt, managing its operational status. |
required |
**kwargs |
Additional keyword arguments for extended functionality and model configurations. |
{}
|
generate(prompt, decoding_strategy='generate', **generation_params)
¶
Generate text completion for the given prompt using the specified decoding strategy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt |
str
|
The prompt to generate text completion for. |
required |
decoding_strategy |
str
|
The decoding strategy to use. Defaults to "generate". |
'generate'
|
**generation_params |
Any
|
Additional parameters to pass to the decoding strategy. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The generated text completion. |
Raises:
Type | Description |
---|---|
Exception
|
If an error occurs during generation. |
Supported decoding strategies and their additional parameters
- "generate": Uses the model's default generation method. (Parameters: max_length, num_beams, etc.)
- "greedy_search": Generates text using a greedy search decoding strategy. Parameters: max_length, eos_token_id, pad_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus.
- "contrastive_search": Generates text using contrastive search decoding strategy. Parameters: top_k, penalty_alpha, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, sequential.
- "sample": Generates text using a sampling decoding strategy. Parameters: do_sample, temperature, top_k, top_p, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus.
- "beam_search": Generates text using beam search decoding strategy. Parameters: num_beams, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus.
- "beam_sample": Generates text using beam search with sampling decoding strategy. Parameters: num_beams, temperature, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus.
- "group_beam_search": Generates text using group beam search decoding strategy. Parameters: num_beams, diversity_penalty, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus.
- "constrained_beam_search": Generates text using constrained beam search decoding strategy. Parameters: num_beams, max_length, constraints, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus.
All generation parameters
- max_length: Maximum length the generated tokens can have
- max_new_tokens: Maximum number of tokens to generate, ignoring prompt tokens
- min_length: Minimum length of the sequence to be generated
- min_new_tokens: Minimum number of tokens to generate, ignoring prompt tokens
- early_stopping: Stopping condition for beam-based methods
- max_time: Maximum time allowed for computation in seconds
- do_sample: Whether to use sampling for generation
- num_beams: Number of beams for beam search
- num_beam_groups: Number of groups for beam search to ensure diversity
- penalty_alpha: Balances model confidence and degeneration penalty in contrastive search
- use_cache: Whether the model should use past key/values attentions to speed up decoding
- temperature: Modulates next token probabilities
- top_k: Number of highest probability tokens to keep for top-k-filtering
- top_p: Smallest set of most probable tokens with cumulative probability >= top_p
- typical_p: Conditional probability of predicting a target token next
- epsilon_cutoff: Tokens with a conditional probability > epsilon_cutoff will be sampled
- eta_cutoff: Eta sampling, a hybrid of locally typical sampling and epsilon sampling
- diversity_penalty: Penalty subtracted from a beam's score if it generates a token same as any other group
- repetition_penalty: Penalty for repetition of ngrams
- encoder_repetition_penalty: Penalty on sequences not in the original input
- length_penalty: Exponential penalty to the length for beam-based generation
- no_repeat_ngram_size: All ngrams of this size can only occur once
- bad_words_ids: List of token ids that are not allowed to be generated
- force_words_ids: List of token ids that must be generated
- renormalize_logits: Renormalize the logits after applying all logits processors
- constraints: Custom constraints for generation
- forced_bos_token_id: Token ID to force as the first generated token
- forced_eos_token_id: Token ID to force as the last generated token
- remove_invalid_values: Remove possible NaN and inf outputs
- exponential_decay_length_penalty: Exponentially increasing length penalty after a certain number of tokens
- suppress_tokens: Tokens that will be suppressed during generation
- begin_suppress_tokens: Tokens that will be suppressed at the beginning of generation
- forced_decoder_ids: Mapping from generation indices to token indices that will be forced
- sequence_bias: Maps a sequence of tokens to its bias term
- guidance_scale: Guidance scale for classifier free guidance (CFG)
- low_memory: Switch to sequential topk for contrastive search to reduce peak memory
- num_return_sequences: Number of independently computed returned sequences for each batch element
- output_attentions: Whether to return the attentions tensors of all layers
- output_hidden_states: Whether to return the hidden states of all layers
- output_scores: Whether to return the prediction scores
- return_dict_in_generate: Whether to return a ModelOutput instead of a plain tuple
- pad_token_id: The id of the padding token
- bos_token_id: The id of the beginning-of-sequence token
- eos_token_id: The id of the end-of-sequence token
- max_length: The maximum length of the sequence to be generated
- eos_token_id: End-of-sequence token ID
- pad_token_id: Padding token ID
- output_attentions: Return attention tensors of all attention layers if True
- output_hidden_states: Return hidden states of all layers if True
- output_scores: Return prediction scores if True
- return_dict_in_generate: Return a ModelOutput instead of a plain tuple if True
- synced_gpus: Continue running the while loop until max_length for ZeRO stage 3 if True
- top_k: Size of the candidate set for re-ranking in contrastive search
- penalty_alpha: Degeneration penalty; active when larger than 0
- eos_token_id: End-of-sequence token ID(s)
- sequential: Switch to sequential topk hidden state computation to reduce memory if True
- do_sample: Use sampling for generation if True
- temperature: Temperature for sampling
- top_p: Cumulative probability for top-p-filtering
- diversity_penalty: Penalty for reducing similarity across different beam groups
- constraints: List of constraints to apply during beam search
- synced_gpus: Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
load_models(model_name, tokenizer_name, model_revision=None, tokenizer_revision=None, model_class='AutoModelForCausalLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, better_transformers=False, **model_args)
¶
Loads and configures the specified model and tokenizer for text generation. It ensures the models are optimized for inference.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
The name or path of the model to load. |
required |
tokenizer_name |
str
|
The name or path of the tokenizer to load. |
required |
model_revision |
Optional[str]
|
The specific model revision to load (e.g., a commit hash). |
None
|
tokenizer_revision |
Optional[str]
|
The specific tokenizer revision to load (e.g., a commit hash). |
None
|
model_class |
str
|
The class of the model to be loaded. |
'AutoModelForCausalLM'
|
tokenizer_class |
str
|
The class of the tokenizer to be loaded. |
'AutoTokenizer'
|
use_cuda |
bool
|
Flag to utilize CUDA for GPU acceleration. |
False
|
precision |
str
|
The desired precision for computations ("float32", "float16", etc.). |
'float16'
|
quantization |
int
|
The bit level for model quantization (0 for none, 8 for 8-bit quantization). |
0
|
device_map |
str | Dict | None
|
The specific device(s) to use for model operations. |
'auto'
|
max_memory |
Dict
|
A dictionary defining the maximum memory to allocate for the model. |
{0: '24GB'}
|
torchscript |
bool
|
Flag to enable TorchScript for model optimization. |
False
|
compile |
bool
|
Flag to enable JIT compilation of the model. |
False
|
awq_enabled |
bool
|
Flag to enable AWQ (Adaptive Weight Quantization). |
False
|
flash_attention |
bool
|
Flag to enable Flash Attention optimization for faster processing. |
False
|
better_transformers |
bool
|
Flag to enable Better Transformers optimization for faster processing. |
False
|
**model_args |
Any
|
Additional arguments to pass to the model during its loading. |
{}
|
Returns:
Type | Description |
---|---|
Tuple[AutoModelForCausalLM, AutoTokenizer]
|
Tuple[AutoModelForCausalLM, AutoTokenizer]: The loaded model and tokenizer ready for text generation. |
load_models_llama_cpp(model, filename, local_dir=None, n_gpu_layers=0, split_mode=llama_cpp.LLAMA_SPLIT_LAYER, main_gpu=0, tensor_split=None, vocab_only=False, use_mmap=True, use_mlock=False, kv_overrides=None, seed=llama_cpp.LLAMA_DEFAULT_SEED, n_ctx=512, n_batch=512, n_threads=None, n_threads_batch=None, rope_scaling_type=llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED, rope_freq_base=0.0, rope_freq_scale=0.0, yarn_ext_factor=-1.0, yarn_attn_factor=1.0, yarn_beta_fast=32.0, yarn_beta_slow=1.0, yarn_orig_ctx=0, mul_mat_q=True, logits_all=False, embedding=False, offload_kqv=True, last_n_tokens_size=64, lora_base=None, lora_scale=1.0, lora_path=None, numa=False, chat_format=None, chat_handler=None, draft_model=None, tokenizer=None, verbose=True, **kwargs)
¶
Initializes and loads LLaMA model with llama.cpp backend, along with an optional tokenizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model |
str
|
Huggingface ID to the LLaMA model. |
required |
filename |
Optional[str]
|
A filename or glob pattern to match the model file in the repo. |
required |
local_dir |
Optional[Union[str, os.PathLike[str]]]
|
The local directory to save the model to. |
None
|
n_gpu_layers |
int
|
Number of layers to offload to GPU. Default is 0. |
0
|
split_mode |
int
|
Split mode for distributing model across GPUs. |
llama_cpp.LLAMA_SPLIT_LAYER
|
main_gpu |
int
|
Main GPU index. |
0
|
tensor_split |
Optional[List[float]]
|
Tensor split configuration. |
None
|
vocab_only |
bool
|
Whether to load vocabulary only. |
False
|
use_mmap |
bool
|
Use memory-mapped files for model loading. |
True
|
use_mlock |
bool
|
Lock model data in RAM. |
False
|
kv_overrides |
Optional[Dict[str, Union[bool, int, float]]]
|
Key-value pairs for model overrides. |
None
|
seed |
int
|
Random seed for initialization. |
llama_cpp.LLAMA_DEFAULT_SEED
|
n_ctx |
int
|
Number of context tokens. |
512
|
n_batch |
int
|
Batch size for processing prompts. |
512
|
n_threads |
Optional[int]
|
Number of threads for generation. |
None
|
n_threads_batch |
Optional[int]
|
Number of threads for batch processing. |
None
|
rope_scaling_type |
Optional[int]
|
RoPE scaling type. |
llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED
|
rope_freq_base |
float
|
Base frequency for RoPE. |
0.0
|
rope_freq_scale |
float
|
Frequency scaling for RoPE. |
0.0
|
yarn_ext_factor |
float
|
YaRN extrapolation mix factor. |
-1.0
|
yarn_attn_factor |
float
|
YaRN attention factor. |
1.0
|
yarn_beta_fast |
float
|
YaRN beta fast parameter. |
32.0
|
yarn_beta_slow |
float
|
YaRN beta slow parameter. |
1.0
|
yarn_orig_ctx |
int
|
Original context size for YaRN. |
0
|
mul_mat_q |
bool
|
Whether to multiply matrices for queries. |
True
|
logits_all |
bool
|
Return logits for all tokens. |
False
|
embedding |
bool
|
Enable embedding mode only. |
False
|
offload_kqv |
bool
|
Offload K, Q, V matrices to GPU. |
True
|
last_n_tokens_size |
int
|
Size for the last_n_tokens buffer. |
64
|
lora_base |
Optional[str]
|
Base model path for LoRA. |
None
|
lora_scale |
float
|
Scale factor for LoRA adjustments. |
1.0
|
lora_path |
Optional[str]
|
Path to LoRA adjustments. |
None
|
numa |
Union[bool, int]
|
NUMA configuration. |
False
|
chat_format |
Optional[str]
|
Chat format configuration. |
None
|
chat_handler |
Optional[llama_cpp.LlamaChatCompletionHandler]
|
Handler for chat completions. |
None
|
draft_model |
Optional[llama_cpp.LlamaDraftModel]
|
Draft model for speculative decoding. |
None
|
tokenizer |
Optional[PreTrainedTokenizerBase]
|
Custom tokenizer instance. |
None
|
verbose |
bool
|
Enable verbose logging. |
True
|
**kwargs |
Additional keyword arguments. |
{}
|
Returns:
Type | Description |
---|---|
Tuple[LlamaCPP, Optional[PreTrainedTokenizerBase]]
|
Tuple[LlamaCPP, Optional[PreTrainedTokenizerBase]]: The loaded LLaMA model and tokenizer. |
load_models_vllm(model, tokenizer, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', seed=42, revision=None, tokenizer_revision=None, max_model_len=1024, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, block_size=16, gpu_memory_utilization=0.9, swap_space=4, cache_dtype='auto', sliding_window=None, pipeline_parallel_size=1, tensor_parallel_size=1, worker_use_ray=False, max_parallel_loading_workers=None, disable_custom_all_reduce=False, max_num_batched_tokens=None, max_num_seqs=64, max_paddings=512, device='cuda', max_lora_rank=None, max_loras=None, max_cpu_loras=None, lora_dtype=None, lora_extra_vocab_size=0, placement_group=None, log_stats=False, batched_inference=False)
¶
Initializes and loads models using VLLM configurations with specific parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model |
str
|
Name or path of the Hugging Face model to use. |
required |
tokenizer |
str
|
Name or path of the Hugging Face tokenizer to use. |
required |
tokenizer_mode |
str
|
Tokenizer mode. "auto" will use the fast tokenizer if available, "slow" will always use the slow tokenizer. |
'auto'
|
trust_remote_code |
bool
|
Trust remote code (e.g., from Hugging Face) when downloading the model and tokenizer. |
True
|
download_dir |
Optional[str]
|
Directory to download and load the weights, default to the default cache directory of Hugging Face. |
None
|
load_format |
str
|
The format of the model weights to load. Options include "auto", "pt", "safetensors", "npcache", "dummy". |
'auto'
|
dtype |
Union[str, torch.dtype]
|
Data type for model weights and activations. Options include "auto", torch.float32, torch.float16, etc. |
'auto'
|
seed |
int
|
Random seed for reproducibility. |
42
|
revision |
Optional[str]
|
The specific model version to use. Can be a branch name, a tag name, or a commit id. |
None
|
code_revision |
Optional[str]
|
The specific revision to use for the model code on Hugging Face Hub. |
required |
tokenizer_revision |
Optional[str]
|
The specific tokenizer version to use. |
None
|
max_model_len |
Optional[int]
|
Maximum length of a sequence (including prompt and output). If None, will be derived from the model. |
1024
|
quantization |
Optional[str]
|
Quantization method that was used to quantize the model weights. If None, we assume the model weights are not quantized. |
None
|
enforce_eager |
bool
|
Whether to enforce eager execution. If True, disables CUDA graph and always execute the model in eager mode. |
False
|
max_context_len_to_capture |
Optional[int]
|
Maximum context length covered by CUDA graphs. When larger, falls back to eager mode. |
8192
|
block_size |
int
|
Size of a cache block in number of tokens. |
16
|
gpu_memory_utilization |
float
|
Fraction of GPU memory to use for the VLLM execution. |
0.9
|
swap_space |
int
|
Size of the CPU swap space per GPU (in GiB). |
4
|
cache_dtype |
str
|
Data type for KV cache storage. |
'auto'
|
sliding_window |
Optional[int]
|
Configuration for sliding window if applicable. |
None
|
pipeline_parallel_size |
int
|
Number of pipeline parallel groups. |
1
|
tensor_parallel_size |
int
|
Number of tensor parallel groups. |
1
|
worker_use_ray |
bool
|
Whether to use Ray for model workers. Required if either pipeline_parallel_size or tensor_parallel_size is greater than 1. |
False
|
max_parallel_loading_workers |
Optional[int]
|
Maximum number of workers for loading the model in parallel to avoid RAM OOM. |
None
|
disable_custom_all_reduce |
bool
|
Disable custom all-reduce kernel and fall back to NCCL. |
False
|
max_num_batched_tokens |
Optional[int]
|
Maximum number of tokens to be processed in a single iteration. |
None
|
max_num_seqs |
int
|
Maximum number of sequences to be processed in a single iteration. |
64
|
max_paddings |
int
|
Maximum number of paddings to be added to a batch. |
512
|
device |
str
|
Device configuration, typically "cuda" or "cpu". |
'cuda'
|
max_lora_rank |
Optional[int]
|
Maximum rank for LoRA adjustments. |
None
|
max_loras |
Optional[int]
|
Maximum number of LoRA adjustments. |
None
|
max_cpu_loras |
Optional[int]
|
Maximum number of LoRA adjustments stored on CPU. |
None
|
lora_dtype |
Optional[torch.dtype]
|
Data type for LoRA parameters. |
None
|
lora_extra_vocab_size |
Optional[int]
|
Additional vocabulary size for LoRA. |
0
|
placement_group |
Optional[PlacementGroup]
|
Ray placement group for distributed execution. Required for distributed execution. |
None
|
log_stats |
bool
|
Whether to log statistics during model operation. |
False
|
Returns:
Name | Type | Description |
---|---|---|
LLMEngine |
AsyncLLMEngine | LLM
|
An instance of the LLMEngine class initialized with the given configurations. |