Skip to content

Base Fine Tuner

Bases: Bolt

TextBulk is a foundational class for enabling bulk processing of text with various generation models. It primarily focuses on using Hugging Face models to provide a robust and efficient framework for large-scale text generation tasks. The class supports various decoding strategies to generate text that can be tailored to specific needs or preferences.

Attributes:

Name Type Description
model AutoModelForCausalLM

The language model for text generation.

tokenizer AutoTokenizer

The tokenizer for preparing input data for the model.

Parameters:

Name Type Description Default
input BatchInput

Configuration and data inputs for the batch process.

required
output BatchOutput

Configurations for output data handling.

required
state State

State management for the Bolt.

required
**kwargs

Arbitrary keyword arguments for extended configurations.

{}
Methods

text(**kwargs: Any) -> Dict[str, Any]: Provides an API endpoint for text generation functionality. Accepts various parameters for customizing the text generation process.

generate(prompt: str, decoding_strategy: str = "generate", **generation_params: Any) -> dict: Generates text based on the provided prompt and parameters. Supports multiple decoding strategies for diverse applications.

The class serves as a versatile tool for text generation, supporting various models and configurations. It can be extended or used as is for efficient text generation tasks.

__init__(input, output, state, **kwargs)

Initializes the TextBulk with configurations and sets up logging. It prepares the environment for text generation tasks.

Parameters:

Name Type Description Default
input BatchInput

The input data configuration for the text generation task.

required
output BatchOutput

The output data configuration for the results of the text generation.

required
state State

The state configuration for the Bolt, managing its operational status.

required
**kwargs

Additional keyword arguments for extended functionality and model configurations.

{}

generate(prompt, decoding_strategy='generate', **generation_params)

Generate text completion for the given prompt using the specified decoding strategy.

Parameters:

Name Type Description Default
prompt str

The prompt to generate text completion for.

required
decoding_strategy str

The decoding strategy to use. Defaults to "generate".

'generate'
**generation_params Any

Additional parameters to pass to the decoding strategy.

{}

Returns:

Name Type Description
str str

The generated text completion.

Raises:

Type Description
Exception

If an error occurs during generation.

Supported decoding strategies and their additional parameters
  • "generate": Uses the model's default generation method. (Parameters: max_length, num_beams, etc.)
  • "greedy_search": Generates text using a greedy search decoding strategy. Parameters: max_length, eos_token_id, pad_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus.
  • "contrastive_search": Generates text using contrastive search decoding strategy. Parameters: top_k, penalty_alpha, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, sequential.
  • "sample": Generates text using a sampling decoding strategy. Parameters: do_sample, temperature, top_k, top_p, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus.
  • "beam_search": Generates text using beam search decoding strategy. Parameters: num_beams, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus.
  • "beam_sample": Generates text using beam search with sampling decoding strategy. Parameters: num_beams, temperature, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus.
  • "group_beam_search": Generates text using group beam search decoding strategy. Parameters: num_beams, diversity_penalty, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus.
  • "constrained_beam_search": Generates text using constrained beam search decoding strategy. Parameters: num_beams, max_length, constraints, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus.
All generation parameters
  • max_length: Maximum length the generated tokens can have
  • max_new_tokens: Maximum number of tokens to generate, ignoring prompt tokens
  • min_length: Minimum length of the sequence to be generated
  • min_new_tokens: Minimum number of tokens to generate, ignoring prompt tokens
  • early_stopping: Stopping condition for beam-based methods
  • max_time: Maximum time allowed for computation in seconds
  • do_sample: Whether to use sampling for generation
  • num_beams: Number of beams for beam search
  • num_beam_groups: Number of groups for beam search to ensure diversity
  • penalty_alpha: Balances model confidence and degeneration penalty in contrastive search
  • use_cache: Whether the model should use past key/values attentions to speed up decoding
  • temperature: Modulates next token probabilities
  • top_k: Number of highest probability tokens to keep for top-k-filtering
  • top_p: Smallest set of most probable tokens with cumulative probability >= top_p
  • typical_p: Conditional probability of predicting a target token next
  • epsilon_cutoff: Tokens with a conditional probability > epsilon_cutoff will be sampled
  • eta_cutoff: Eta sampling, a hybrid of locally typical sampling and epsilon sampling
  • diversity_penalty: Penalty subtracted from a beam's score if it generates a token same as any other group
  • repetition_penalty: Penalty for repetition of ngrams
  • encoder_repetition_penalty: Penalty on sequences not in the original input
  • length_penalty: Exponential penalty to the length for beam-based generation
  • no_repeat_ngram_size: All ngrams of this size can only occur once
  • bad_words_ids: List of token ids that are not allowed to be generated
  • force_words_ids: List of token ids that must be generated
  • renormalize_logits: Renormalize the logits after applying all logits processors
  • constraints: Custom constraints for generation
  • forced_bos_token_id: Token ID to force as the first generated token
  • forced_eos_token_id: Token ID to force as the last generated token
  • remove_invalid_values: Remove possible NaN and inf outputs
  • exponential_decay_length_penalty: Exponentially increasing length penalty after a certain number of tokens
  • suppress_tokens: Tokens that will be suppressed during generation
  • begin_suppress_tokens: Tokens that will be suppressed at the beginning of generation
  • forced_decoder_ids: Mapping from generation indices to token indices that will be forced
  • sequence_bias: Maps a sequence of tokens to its bias term
  • guidance_scale: Guidance scale for classifier free guidance (CFG)
  • low_memory: Switch to sequential topk for contrastive search to reduce peak memory
  • num_return_sequences: Number of independently computed returned sequences for each batch element
  • output_attentions: Whether to return the attentions tensors of all layers
  • output_hidden_states: Whether to return the hidden states of all layers
  • output_scores: Whether to return the prediction scores
  • return_dict_in_generate: Whether to return a ModelOutput instead of a plain tuple
  • pad_token_id: The id of the padding token
  • bos_token_id: The id of the beginning-of-sequence token
  • eos_token_id: The id of the end-of-sequence token
  • max_length: The maximum length of the sequence to be generated
  • eos_token_id: End-of-sequence token ID
  • pad_token_id: Padding token ID
  • output_attentions: Return attention tensors of all attention layers if True
  • output_hidden_states: Return hidden states of all layers if True
  • output_scores: Return prediction scores if True
  • return_dict_in_generate: Return a ModelOutput instead of a plain tuple if True
  • synced_gpus: Continue running the while loop until max_length for ZeRO stage 3 if True
  • top_k: Size of the candidate set for re-ranking in contrastive search
  • penalty_alpha: Degeneration penalty; active when larger than 0
  • eos_token_id: End-of-sequence token ID(s)
  • sequential: Switch to sequential topk hidden state computation to reduce memory if True
  • do_sample: Use sampling for generation if True
  • temperature: Temperature for sampling
  • top_p: Cumulative probability for top-p-filtering
  • diversity_penalty: Penalty for reducing similarity across different beam groups
  • constraints: List of constraints to apply during beam search
  • synced_gpus: Whether to continue running the while loop until max_length (needed for ZeRO stage 3)

load_models(model_name, tokenizer_name, model_revision=None, tokenizer_revision=None, model_class='AutoModelForCausalLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, better_transformers=False, **model_args)

Loads and configures the specified model and tokenizer for text generation. It ensures the models are optimized for inference.

Parameters:

Name Type Description Default
model_name str

The name or path of the model to load.

required
tokenizer_name str

The name or path of the tokenizer to load.

required
model_revision Optional[str]

The specific model revision to load (e.g., a commit hash).

None
tokenizer_revision Optional[str]

The specific tokenizer revision to load (e.g., a commit hash).

None
model_class str

The class of the model to be loaded.

'AutoModelForCausalLM'
tokenizer_class str

The class of the tokenizer to be loaded.

'AutoTokenizer'
use_cuda bool

Flag to utilize CUDA for GPU acceleration.

False
precision str

The desired precision for computations ("float32", "float16", etc.).

'float16'
quantization int

The bit level for model quantization (0 for none, 8 for 8-bit quantization).

0
device_map str | Dict | None

The specific device(s) to use for model operations.

'auto'
max_memory Dict

A dictionary defining the maximum memory to allocate for the model.

{0: '24GB'}
torchscript bool

Flag to enable TorchScript for model optimization.

False
compile bool

Flag to enable JIT compilation of the model.

False
awq_enabled bool

Flag to enable AWQ (Adaptive Weight Quantization).

False
flash_attention bool

Flag to enable Flash Attention optimization for faster processing.

False
better_transformers bool

Flag to enable Better Transformers optimization for faster processing.

False
**model_args Any

Additional arguments to pass to the model during its loading.

{}

Returns:

Type Description
Tuple[AutoModelForCausalLM, AutoTokenizer]

Tuple[AutoModelForCausalLM, AutoTokenizer]: The loaded model and tokenizer ready for text generation.

load_models_llama_cpp(model, filename, local_dir=None, n_gpu_layers=0, split_mode=llama_cpp.LLAMA_SPLIT_LAYER, main_gpu=0, tensor_split=None, vocab_only=False, use_mmap=True, use_mlock=False, kv_overrides=None, seed=llama_cpp.LLAMA_DEFAULT_SEED, n_ctx=512, n_batch=512, n_threads=None, n_threads_batch=None, rope_scaling_type=llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED, rope_freq_base=0.0, rope_freq_scale=0.0, yarn_ext_factor=-1.0, yarn_attn_factor=1.0, yarn_beta_fast=32.0, yarn_beta_slow=1.0, yarn_orig_ctx=0, mul_mat_q=True, logits_all=False, embedding=False, offload_kqv=True, last_n_tokens_size=64, lora_base=None, lora_scale=1.0, lora_path=None, numa=False, chat_format=None, chat_handler=None, draft_model=None, tokenizer=None, verbose=True, **kwargs)

Initializes and loads LLaMA model with llama.cpp backend, along with an optional tokenizer.

Parameters:

Name Type Description Default
model str

Huggingface ID to the LLaMA model.

required
filename Optional[str]

A filename or glob pattern to match the model file in the repo.

required
local_dir Optional[Union[str, os.PathLike[str]]]

The local directory to save the model to.

None
n_gpu_layers int

Number of layers to offload to GPU. Default is 0.

0
split_mode int

Split mode for distributing model across GPUs.

llama_cpp.LLAMA_SPLIT_LAYER
main_gpu int

Main GPU index.

0
tensor_split Optional[List[float]]

Tensor split configuration.

None
vocab_only bool

Whether to load vocabulary only.

False
use_mmap bool

Use memory-mapped files for model loading.

True
use_mlock bool

Lock model data in RAM.

False
kv_overrides Optional[Dict[str, Union[bool, int, float]]]

Key-value pairs for model overrides.

None
seed int

Random seed for initialization.

llama_cpp.LLAMA_DEFAULT_SEED
n_ctx int

Number of context tokens.

512
n_batch int

Batch size for processing prompts.

512
n_threads Optional[int]

Number of threads for generation.

None
n_threads_batch Optional[int]

Number of threads for batch processing.

None
rope_scaling_type Optional[int]

RoPE scaling type.

llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED
rope_freq_base float

Base frequency for RoPE.

0.0
rope_freq_scale float

Frequency scaling for RoPE.

0.0
yarn_ext_factor float

YaRN extrapolation mix factor.

-1.0
yarn_attn_factor float

YaRN attention factor.

1.0
yarn_beta_fast float

YaRN beta fast parameter.

32.0
yarn_beta_slow float

YaRN beta slow parameter.

1.0
yarn_orig_ctx int

Original context size for YaRN.

0
mul_mat_q bool

Whether to multiply matrices for queries.

True
logits_all bool

Return logits for all tokens.

False
embedding bool

Enable embedding mode only.

False
offload_kqv bool

Offload K, Q, V matrices to GPU.

True
last_n_tokens_size int

Size for the last_n_tokens buffer.

64
lora_base Optional[str]

Base model path for LoRA.

None
lora_scale float

Scale factor for LoRA adjustments.

1.0
lora_path Optional[str]

Path to LoRA adjustments.

None
numa Union[bool, int]

NUMA configuration.

False
chat_format Optional[str]

Chat format configuration.

None
chat_handler Optional[llama_cpp.LlamaChatCompletionHandler]

Handler for chat completions.

None
draft_model Optional[llama_cpp.LlamaDraftModel]

Draft model for speculative decoding.

None
tokenizer Optional[PreTrainedTokenizerBase]

Custom tokenizer instance.

None
verbose bool

Enable verbose logging.

True
**kwargs

Additional keyword arguments.

{}

Returns:

Type Description
Tuple[LlamaCPP, Optional[PreTrainedTokenizerBase]]

Tuple[LlamaCPP, Optional[PreTrainedTokenizerBase]]: The loaded LLaMA model and tokenizer.

load_models_vllm(model, tokenizer, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', seed=42, revision=None, tokenizer_revision=None, max_model_len=1024, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, block_size=16, gpu_memory_utilization=0.9, swap_space=4, cache_dtype='auto', sliding_window=None, pipeline_parallel_size=1, tensor_parallel_size=1, worker_use_ray=False, max_parallel_loading_workers=None, disable_custom_all_reduce=False, max_num_batched_tokens=None, max_num_seqs=64, max_paddings=512, device='cuda', max_lora_rank=None, max_loras=None, max_cpu_loras=None, lora_dtype=None, lora_extra_vocab_size=0, placement_group=None, log_stats=False, batched_inference=False)

Initializes and loads models using VLLM configurations with specific parameters.

Parameters:

Name Type Description Default
model str

Name or path of the Hugging Face model to use.

required
tokenizer str

Name or path of the Hugging Face tokenizer to use.

required
tokenizer_mode str

Tokenizer mode. "auto" will use the fast tokenizer if available, "slow" will always use the slow tokenizer.

'auto'
trust_remote_code bool

Trust remote code (e.g., from Hugging Face) when downloading the model and tokenizer.

True
download_dir Optional[str]

Directory to download and load the weights, default to the default cache directory of Hugging Face.

None
load_format str

The format of the model weights to load. Options include "auto", "pt", "safetensors", "npcache", "dummy".

'auto'
dtype Union[str, torch.dtype]

Data type for model weights and activations. Options include "auto", torch.float32, torch.float16, etc.

'auto'
seed int

Random seed for reproducibility.

42
revision Optional[str]

The specific model version to use. Can be a branch name, a tag name, or a commit id.

None
code_revision Optional[str]

The specific revision to use for the model code on Hugging Face Hub.

required
tokenizer_revision Optional[str]

The specific tokenizer version to use.

None
max_model_len Optional[int]

Maximum length of a sequence (including prompt and output). If None, will be derived from the model.

1024
quantization Optional[str]

Quantization method that was used to quantize the model weights. If None, we assume the model weights are not quantized.

None
enforce_eager bool

Whether to enforce eager execution. If True, disables CUDA graph and always execute the model in eager mode.

False
max_context_len_to_capture Optional[int]

Maximum context length covered by CUDA graphs. When larger, falls back to eager mode.

8192
block_size int

Size of a cache block in number of tokens.

16
gpu_memory_utilization float

Fraction of GPU memory to use for the VLLM execution.

0.9
swap_space int

Size of the CPU swap space per GPU (in GiB).

4
cache_dtype str

Data type for KV cache storage.

'auto'
sliding_window Optional[int]

Configuration for sliding window if applicable.

None
pipeline_parallel_size int

Number of pipeline parallel groups.

1
tensor_parallel_size int

Number of tensor parallel groups.

1
worker_use_ray bool

Whether to use Ray for model workers. Required if either pipeline_parallel_size or tensor_parallel_size is greater than 1.

False
max_parallel_loading_workers Optional[int]

Maximum number of workers for loading the model in parallel to avoid RAM OOM.

None
disable_custom_all_reduce bool

Disable custom all-reduce kernel and fall back to NCCL.

False
max_num_batched_tokens Optional[int]

Maximum number of tokens to be processed in a single iteration.

None
max_num_seqs int

Maximum number of sequences to be processed in a single iteration.

64
max_paddings int

Maximum number of paddings to be added to a batch.

512
device str

Device configuration, typically "cuda" or "cpu".

'cuda'
max_lora_rank Optional[int]

Maximum rank for LoRA adjustments.

None
max_loras Optional[int]

Maximum number of LoRA adjustments.

None
max_cpu_loras Optional[int]

Maximum number of LoRA adjustments stored on CPU.

None
lora_dtype Optional[torch.dtype]

Data type for LoRA parameters.

None
lora_extra_vocab_size Optional[int]

Additional vocabulary size for LoRA.

0
placement_group Optional[PlacementGroup]

Ray placement group for distributed execution. Required for distributed execution.

None
log_stats bool

Whether to log statistics during model operation.

False

Returns:

Name Type Description
LLMEngine AsyncLLMEngine | LLM

An instance of the LLMEngine class initialized with the given configurations.