Base Fine Tuner¶
Bases: TextBulk
A class representing a Hugging Face API for generating text using a pre-trained language model.
Attributes:
Name | Type | Description |
---|---|---|
model |
Any
|
The pre-trained language model. |
tokenizer |
Any
|
The tokenizer used to preprocess input text. |
model_name |
str
|
The name of the pre-trained language model. |
model_revision |
Optional[str]
|
The revision of the pre-trained language model. |
tokenizer_name |
str
|
The name of the tokenizer used to preprocess input text. |
tokenizer_revision |
Optional[str]
|
The revision of the tokenizer used to preprocess input text. |
model_class |
str
|
The name of the class of the pre-trained language model. |
tokenizer_class |
str
|
The name of the class of the tokenizer used to preprocess input text. |
use_cuda |
bool
|
Whether to use a GPU for inference. |
quantization |
int
|
The level of quantization to use for the pre-trained language model. |
precision |
str
|
The precision to use for the pre-trained language model. |
device_map |
str | Dict | None
|
The mapping of devices to use for inference. |
max_memory |
Dict[int, str]
|
The maximum memory to use for inference. |
torchscript |
bool
|
Whether to use a TorchScript-optimized version of the pre-trained language model. |
model_args |
Any
|
Additional arguments to pass to the pre-trained language model. |
Methods
text(**kwargs: Any) -> Dict[str, Any]: Generates text based on the given prompt and decoding strategy.
listen(model_name: str, model_class: str = "AutoModelForCausalLM", tokenizer_class: str = "AutoTokenizer", use_cuda: bool = False, precision: str = "float16", quantization: int = 0, device_map: str | Dict | None = "auto", max_memory={0: "24GB"}, torchscript: bool = True, endpoint: str = "", port: int = 3000, cors_domain: str = "http://localhost:3000", username: Optional[str] = None, password: Optional[str] = None, *model_args: Any) -> None: Starts a CherryPy server to listen for requests to generate text.
__init__(input, output, state)
¶
Initializes a new instance of the TextAPI class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The input data to process. |
required |
output |
BatchOutput
|
The output data to process. |
required |
state |
State
|
The state of the API. |
required |
listen(model_name, model_class='AutoModelForCausalLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, concurrent_queries=False, use_vllm=False, use_llama_cpp=False, vllm_tokenizer_mode='auto', vllm_download_dir=None, vllm_load_format='auto', vllm_seed=42, vllm_max_model_len=1024, vllm_enforce_eager=False, vllm_max_context_len_to_capture=8192, vllm_block_size=16, vllm_gpu_memory_utilization=0.9, vllm_swap_space=4, vllm_sliding_window=None, vllm_pipeline_parallel_size=1, vllm_tensor_parallel_size=1, vllm_worker_use_ray=False, vllm_max_parallel_loading_workers=None, vllm_disable_custom_all_reduce=False, vllm_max_num_batched_tokens=None, vllm_max_num_seqs=64, vllm_max_paddings=512, vllm_max_lora_rank=None, vllm_max_loras=None, vllm_max_cpu_loras=None, vllm_lora_extra_vocab_size=0, vllm_placement_group=None, vllm_log_stats=False, llama_cpp_filename=None, llama_cpp_n_gpu_layers=0, llama_cpp_split_mode=llama_cpp.LLAMA_SPLIT_LAYER, llama_cpp_tensor_split=None, llama_cpp_vocab_only=False, llama_cpp_use_mmap=True, llama_cpp_use_mlock=False, llama_cpp_kv_overrides=None, llama_cpp_seed=llama_cpp.LLAMA_DEFAULT_SEED, llama_cpp_n_ctx=2048, llama_cpp_n_batch=512, llama_cpp_n_threads=None, llama_cpp_n_threads_batch=None, llama_cpp_rope_scaling_type=llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED, llama_cpp_rope_freq_base=0.0, llama_cpp_rope_freq_scale=0.0, llama_cpp_yarn_ext_factor=-1.0, llama_cpp_yarn_attn_factor=1.0, llama_cpp_yarn_beta_fast=32.0, llama_cpp_yarn_beta_slow=1.0, llama_cpp_yarn_orig_ctx=0, llama_cpp_mul_mat_q=True, llama_cpp_logits_all=False, llama_cpp_embedding=False, llama_cpp_offload_kqv=True, llama_cpp_last_n_tokens_size=64, llama_cpp_lora_base=None, llama_cpp_lora_scale=1.0, llama_cpp_lora_path=None, llama_cpp_numa=False, llama_cpp_chat_format=None, llama_cpp_draft_model=None, llama_cpp_verbose=True, endpoint='*', port=3000, cors_domain='http://localhost:3000', username=None, password=None, **model_args)
¶
Starts a CherryPy server to listen for requests to generate text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
Name or identifier of the pre-trained model to be used. |
required |
model_class |
str
|
Class name of the model to be used from the transformers library. |
'AutoModelForCausalLM'
|
tokenizer_class |
str
|
Class name of the tokenizer to be used from the transformers library. |
'AutoTokenizer'
|
use_cuda |
bool
|
Flag to enable CUDA for GPU acceleration. |
False
|
precision |
str
|
Specifies the precision configuration for PyTorch tensors, e.g., "float16". |
'float16'
|
quantization |
int
|
Level of model quantization to reduce model size and inference time. |
0
|
device_map |
Union[str, Dict, None]
|
Maps model layers to specific devices for distributed inference. |
'auto'
|
max_memory |
Dict[int, str]
|
Maximum memory allocation for the model on each device. |
{0: '24GB'}
|
torchscript |
bool
|
Enables the use of TorchScript for model optimization. |
False
|
compile |
bool
|
Enables model compilation for further optimization. |
False
|
awq_enabled |
bool
|
Enables Adaptive Weight Quantization (AWQ) for model optimization. |
False
|
flash_attention |
bool
|
Utilizes Flash Attention optimizations for faster processing. |
False
|
concurrent_queries |
bool
|
Allows the server to handle multiple requests concurrently if True. |
False
|
use_vllm |
bool
|
Flag to use Very Large Language Models (VLLM) integration. |
False
|
use_llama_cpp |
bool
|
Flag to use llama.cpp integration for language model inference. |
False
|
llama_cpp_filename |
Optional[str]
|
The filename of the model file for llama.cpp. |
None
|
llama_cpp_n_gpu_layers |
int
|
Number of layers to offload to GPU in llama.cpp configuration. |
0
|
llama_cpp_split_mode |
int
|
Defines how the model is split across multiple GPUs in llama.cpp. |
llama_cpp.LLAMA_SPLIT_LAYER
|
llama_cpp_tensor_split |
Optional[List[float]]
|
Custom tensor split configuration for llama.cpp. |
None
|
llama_cpp_vocab_only |
bool
|
Loads only the vocabulary part of the model in llama.cpp. |
False
|
llama_cpp_use_mmap |
bool
|
Enables memory-mapped files for model loading in llama.cpp. |
True
|
llama_cpp_use_mlock |
bool
|
Locks the model in RAM to prevent swapping in llama.cpp. |
False
|
llama_cpp_kv_overrides |
Optional[Dict[str, Union[bool, int, float]]]
|
Key-value pairs for overriding default llama.cpp model parameters. |
None
|
llama_cpp_seed |
int
|
Seed for random number generation in llama.cpp. |
llama_cpp.LLAMA_DEFAULT_SEED
|
llama_cpp_n_ctx |
int
|
The number of context tokens for the model in llama.cpp. |
2048
|
llama_cpp_n_batch |
int
|
Batch size for processing prompts in llama.cpp. |
512
|
llama_cpp_n_threads |
Optional[int]
|
Number of threads for generation in llama.cpp. |
None
|
llama_cpp_n_threads_batch |
Optional[int]
|
Number of threads for batch processing in llama.cpp. |
None
|
llama_cpp_rope_scaling_type |
Optional[int]
|
Specifies the RoPE (Rotary Positional Embeddings) scaling type in llama.cpp. |
llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED
|
llama_cpp_rope_freq_base |
float
|
Base frequency for RoPE in llama.cpp. |
0.0
|
llama_cpp_rope_freq_scale |
float
|
Frequency scaling factor for RoPE in llama.cpp. |
0.0
|
llama_cpp_yarn_ext_factor |
float
|
Extrapolation mix factor for YaRN in llama.cpp. |
-1.0
|
llama_cpp_yarn_attn_factor |
float
|
Attention factor for YaRN in llama.cpp. |
1.0
|
llama_cpp_yarn_beta_fast |
float
|
Beta fast parameter for YaRN in llama.cpp. |
32.0
|
llama_cpp_yarn_beta_slow |
float
|
Beta slow parameter for YaRN in llama.cpp. |
1.0
|
llama_cpp_yarn_orig_ctx |
int
|
Original context size for YaRN in llama.cpp. |
0
|
llama_cpp_mul_mat_q |
bool
|
Flag to enable matrix multiplication for queries in llama.cpp. |
True
|
llama_cpp_logits_all |
bool
|
Returns logits for all tokens when set to True in llama.cpp. |
False
|
llama_cpp_embedding |
bool
|
Enables embedding mode only in llama.cpp. |
False
|
llama_cpp_offload_kqv |
bool
|
Offloads K, Q, V matrices to GPU in llama.cpp. |
True
|
llama_cpp_last_n_tokens_size |
int
|
Size for the last_n_tokens buffer in llama.cpp. |
64
|
llama_cpp_lora_base |
Optional[str]
|
Base model path for LoRA adjustments in llama.cpp. |
None
|
llama_cpp_lora_scale |
float
|
Scale factor for LoRA adjustments in llama.cpp. |
1.0
|
llama_cpp_lora_path |
Optional[str]
|
Path to LoRA adjustments file in llama.cpp. |
None
|
llama_cpp_numa |
Union[bool, int]
|
NUMA configuration for llama.cpp. |
False
|
llama_cpp_chat_format |
Optional[str]
|
Specifies the chat format for llama.cpp. |
None
|
llama_cpp_draft_model |
Optional[llama_cpp.LlamaDraftModel]
|
Draft model for speculative decoding in llama.cpp. |
None
|
endpoint |
str
|
Network interface to bind the server to. |
'*'
|
port |
int
|
Port number to listen on for incoming requests. |
3000
|
cors_domain |
str
|
Specifies the domain to allow for Cross-Origin Resource Sharing (CORS). |
'http://localhost:3000'
|
username |
Optional[str]
|
Username for basic authentication, if required. |
None
|
password |
Optional[str]
|
Password for basic authentication, if required. |
None
|
**model_args |
Any
|
Additional arguments to pass to the pre-trained language model or llama.cpp configuration. |
{}
|
text(**kwargs)
¶
Generates text based on the given prompt and decoding strategy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**kwargs |
Any
|
Additional arguments to pass to the pre-trained language model. |
{}
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dict[str, Any]: A dictionary containing the prompt, arguments, and generated text. |
validate_password(realm, username, password)
¶
Validate the username and password against expected values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
realm |
str
|
The authentication realm. |
required |
username |
str
|
The provided username. |
required |
password |
str
|
The provided password. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
True if credentials are valid, False otherwise. |