Skip to content

Base Fine Tuner

Bases: TextBulk

A class representing a Hugging Face API for generating text using a pre-trained language model.

Attributes:

Name Type Description
model Any

The pre-trained language model.

tokenizer Any

The tokenizer used to preprocess input text.

model_name str

The name of the pre-trained language model.

model_revision Optional[str]

The revision of the pre-trained language model.

tokenizer_name str

The name of the tokenizer used to preprocess input text.

tokenizer_revision Optional[str]

The revision of the tokenizer used to preprocess input text.

model_class str

The name of the class of the pre-trained language model.

tokenizer_class str

The name of the class of the tokenizer used to preprocess input text.

use_cuda bool

Whether to use a GPU for inference.

quantization int

The level of quantization to use for the pre-trained language model.

precision str

The precision to use for the pre-trained language model.

device_map str | Dict | None

The mapping of devices to use for inference.

max_memory Dict[int, str]

The maximum memory to use for inference.

torchscript bool

Whether to use a TorchScript-optimized version of the pre-trained language model.

model_args Any

Additional arguments to pass to the pre-trained language model.

Methods

text(**kwargs: Any) -> Dict[str, Any]: Generates text based on the given prompt and decoding strategy.

listen(model_name: str, model_class: str = "AutoModelForCausalLM", tokenizer_class: str = "AutoTokenizer", use_cuda: bool = False, precision: str = "float16", quantization: int = 0, device_map: str | Dict | None = "auto", max_memory={0: "24GB"}, torchscript: bool = True, endpoint: str = "", port: int = 3000, cors_domain: str = "http://localhost:3000", username: Optional[str] = None, password: Optional[str] = None, *model_args: Any) -> None: Starts a CherryPy server to listen for requests to generate text.

__init__(input, output, state)

Initializes a new instance of the TextAPI class.

Parameters:

Name Type Description Default
input BatchInput

The input data to process.

required
output BatchOutput

The output data to process.

required
state State

The state of the API.

required

listen(model_name, model_class='AutoModelForCausalLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, concurrent_queries=False, use_vllm=False, use_llama_cpp=False, vllm_tokenizer_mode='auto', vllm_download_dir=None, vllm_load_format='auto', vllm_seed=42, vllm_max_model_len=1024, vllm_enforce_eager=False, vllm_max_context_len_to_capture=8192, vllm_block_size=16, vllm_gpu_memory_utilization=0.9, vllm_swap_space=4, vllm_sliding_window=None, vllm_pipeline_parallel_size=1, vllm_tensor_parallel_size=1, vllm_worker_use_ray=False, vllm_max_parallel_loading_workers=None, vllm_disable_custom_all_reduce=False, vllm_max_num_batched_tokens=None, vllm_max_num_seqs=64, vllm_max_paddings=512, vllm_max_lora_rank=None, vllm_max_loras=None, vllm_max_cpu_loras=None, vllm_lora_extra_vocab_size=0, vllm_placement_group=None, vllm_log_stats=False, llama_cpp_filename=None, llama_cpp_n_gpu_layers=0, llama_cpp_split_mode=llama_cpp.LLAMA_SPLIT_LAYER, llama_cpp_tensor_split=None, llama_cpp_vocab_only=False, llama_cpp_use_mmap=True, llama_cpp_use_mlock=False, llama_cpp_kv_overrides=None, llama_cpp_seed=llama_cpp.LLAMA_DEFAULT_SEED, llama_cpp_n_ctx=2048, llama_cpp_n_batch=512, llama_cpp_n_threads=None, llama_cpp_n_threads_batch=None, llama_cpp_rope_scaling_type=llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED, llama_cpp_rope_freq_base=0.0, llama_cpp_rope_freq_scale=0.0, llama_cpp_yarn_ext_factor=-1.0, llama_cpp_yarn_attn_factor=1.0, llama_cpp_yarn_beta_fast=32.0, llama_cpp_yarn_beta_slow=1.0, llama_cpp_yarn_orig_ctx=0, llama_cpp_mul_mat_q=True, llama_cpp_logits_all=False, llama_cpp_embedding=False, llama_cpp_offload_kqv=True, llama_cpp_last_n_tokens_size=64, llama_cpp_lora_base=None, llama_cpp_lora_scale=1.0, llama_cpp_lora_path=None, llama_cpp_numa=False, llama_cpp_chat_format=None, llama_cpp_draft_model=None, llama_cpp_verbose=True, endpoint='*', port=3000, cors_domain='http://localhost:3000', username=None, password=None, **model_args)

Starts a CherryPy server to listen for requests to generate text.

Parameters:

Name Type Description Default
model_name str

Name or identifier of the pre-trained model to be used.

required
model_class str

Class name of the model to be used from the transformers library.

'AutoModelForCausalLM'
tokenizer_class str

Class name of the tokenizer to be used from the transformers library.

'AutoTokenizer'
use_cuda bool

Flag to enable CUDA for GPU acceleration.

False
precision str

Specifies the precision configuration for PyTorch tensors, e.g., "float16".

'float16'
quantization int

Level of model quantization to reduce model size and inference time.

0
device_map Union[str, Dict, None]

Maps model layers to specific devices for distributed inference.

'auto'
max_memory Dict[int, str]

Maximum memory allocation for the model on each device.

{0: '24GB'}
torchscript bool

Enables the use of TorchScript for model optimization.

False
compile bool

Enables model compilation for further optimization.

False
awq_enabled bool

Enables Adaptive Weight Quantization (AWQ) for model optimization.

False
flash_attention bool

Utilizes Flash Attention optimizations for faster processing.

False
concurrent_queries bool

Allows the server to handle multiple requests concurrently if True.

False
use_vllm bool

Flag to use Very Large Language Models (VLLM) integration.

False
use_llama_cpp bool

Flag to use llama.cpp integration for language model inference.

False
llama_cpp_filename Optional[str]

The filename of the model file for llama.cpp.

None
llama_cpp_n_gpu_layers int

Number of layers to offload to GPU in llama.cpp configuration.

0
llama_cpp_split_mode int

Defines how the model is split across multiple GPUs in llama.cpp.

llama_cpp.LLAMA_SPLIT_LAYER
llama_cpp_tensor_split Optional[List[float]]

Custom tensor split configuration for llama.cpp.

None
llama_cpp_vocab_only bool

Loads only the vocabulary part of the model in llama.cpp.

False
llama_cpp_use_mmap bool

Enables memory-mapped files for model loading in llama.cpp.

True
llama_cpp_use_mlock bool

Locks the model in RAM to prevent swapping in llama.cpp.

False
llama_cpp_kv_overrides Optional[Dict[str, Union[bool, int, float]]]

Key-value pairs for overriding default llama.cpp model parameters.

None
llama_cpp_seed int

Seed for random number generation in llama.cpp.

llama_cpp.LLAMA_DEFAULT_SEED
llama_cpp_n_ctx int

The number of context tokens for the model in llama.cpp.

2048
llama_cpp_n_batch int

Batch size for processing prompts in llama.cpp.

512
llama_cpp_n_threads Optional[int]

Number of threads for generation in llama.cpp.

None
llama_cpp_n_threads_batch Optional[int]

Number of threads for batch processing in llama.cpp.

None
llama_cpp_rope_scaling_type Optional[int]

Specifies the RoPE (Rotary Positional Embeddings) scaling type in llama.cpp.

llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED
llama_cpp_rope_freq_base float

Base frequency for RoPE in llama.cpp.

0.0
llama_cpp_rope_freq_scale float

Frequency scaling factor for RoPE in llama.cpp.

0.0
llama_cpp_yarn_ext_factor float

Extrapolation mix factor for YaRN in llama.cpp.

-1.0
llama_cpp_yarn_attn_factor float

Attention factor for YaRN in llama.cpp.

1.0
llama_cpp_yarn_beta_fast float

Beta fast parameter for YaRN in llama.cpp.

32.0
llama_cpp_yarn_beta_slow float

Beta slow parameter for YaRN in llama.cpp.

1.0
llama_cpp_yarn_orig_ctx int

Original context size for YaRN in llama.cpp.

0
llama_cpp_mul_mat_q bool

Flag to enable matrix multiplication for queries in llama.cpp.

True
llama_cpp_logits_all bool

Returns logits for all tokens when set to True in llama.cpp.

False
llama_cpp_embedding bool

Enables embedding mode only in llama.cpp.

False
llama_cpp_offload_kqv bool

Offloads K, Q, V matrices to GPU in llama.cpp.

True
llama_cpp_last_n_tokens_size int

Size for the last_n_tokens buffer in llama.cpp.

64
llama_cpp_lora_base Optional[str]

Base model path for LoRA adjustments in llama.cpp.

None
llama_cpp_lora_scale float

Scale factor for LoRA adjustments in llama.cpp.

1.0
llama_cpp_lora_path Optional[str]

Path to LoRA adjustments file in llama.cpp.

None
llama_cpp_numa Union[bool, int]

NUMA configuration for llama.cpp.

False
llama_cpp_chat_format Optional[str]

Specifies the chat format for llama.cpp.

None
llama_cpp_draft_model Optional[llama_cpp.LlamaDraftModel]

Draft model for speculative decoding in llama.cpp.

None
endpoint str

Network interface to bind the server to.

'*'
port int

Port number to listen on for incoming requests.

3000
cors_domain str

Specifies the domain to allow for Cross-Origin Resource Sharing (CORS).

'http://localhost:3000'
username Optional[str]

Username for basic authentication, if required.

None
password Optional[str]

Password for basic authentication, if required.

None
**model_args Any

Additional arguments to pass to the pre-trained language model or llama.cpp configuration.

{}

text(**kwargs)

Generates text based on the given prompt and decoding strategy.

Parameters:

Name Type Description Default
**kwargs Any

Additional arguments to pass to the pre-trained language model.

{}

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: A dictionary containing the prompt, arguments, and generated text.

validate_password(realm, username, password)

Validate the username and password against expected values.

Parameters:

Name Type Description Default
realm str

The authentication realm.

required
username str

The provided username.

required
password str

The provided password.

required

Returns:

Name Type Description
bool

True if credentials are valid, False otherwise.