Base Fine Tuner¶

Bases: TextBulk

A class representing a Hugging Face API for generating text using a pre-trained language model.

Attributes:

Name	Type	Description
`model`	`Any`	The pre-trained language model.
`tokenizer`	`Any`	The tokenizer used to preprocess input text.
`model_name`	`str`	The name of the pre-trained language model.
`model_revision`	`Optional[str]`	The revision of the pre-trained language model.
`tokenizer_name`	`str`	The name of the tokenizer used to preprocess input text.
`tokenizer_revision`	`Optional[str]`	The revision of the tokenizer used to preprocess input text.
`model_class`	`str`	The name of the class of the pre-trained language model.
`tokenizer_class`	`str`	The name of the class of the tokenizer used to preprocess input text.
`use_cuda`	`bool`	Whether to use a GPU for inference.
`quantization`	`int`	The level of quantization to use for the pre-trained language model.
`precision`	`str`	The precision to use for the pre-trained language model.
`device_map`	`str \| Dict \| None`	The mapping of devices to use for inference.
`max_memory`	`Dict[int, str]`	The maximum memory to use for inference.
`torchscript`	`bool`	Whether to use a TorchScript-optimized version of the pre-trained language model.
`model_args`	`Any`	Additional arguments to pass to the pre-trained language model.

Methods

text(**kwargs: Any) -> Dict[str, Any]: Generates text based on the given prompt and decoding strategy.

listen(model_name: str, model_class: str = "AutoModelForCausalLM", tokenizer_class: str = "AutoTokenizer", use_cuda: bool = False, precision: str = "float16", quantization: int = 0, device_map: str | Dict | None = "auto", max_memory={0: "24GB"}, torchscript: bool = True, endpoint: str = "", port: int = 3000, cors_domain: str = "http://localhost:3000", username: Optional[str] = None, password: Optional[str] = None, *model_args: Any) -> None: Starts a CherryPy server to listen for requests to generate text.

`init(input, output, state)` ¶

Initializes a new instance of the TextAPI class.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	The input data to process.	required
`output`	`BatchOutput`	The output data to process.	required
`state`	`State`	The state of the API.	required

listen(model_name, model_class='AutoModelForCausalLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, concurrent_queries=False, use_vllm=False, use_llama_cpp=False, vllm_tokenizer_mode='auto', vllm_download_dir=None, vllm_load_format='auto', vllm_seed=42, vllm_max_model_len=1024, vllm_enforce_eager=False, vllm_max_context_len_to_capture=8192, vllm_block_size=16, vllm_gpu_memory_utilization=0.9, vllm_swap_space=4, vllm_sliding_window=None, vllm_pipeline_parallel_size=1, vllm_tensor_parallel_size=1, vllm_worker_use_ray=False, vllm_max_parallel_loading_workers=None, vllm_disable_custom_all_reduce=False, vllm_max_num_batched_tokens=None, vllm_max_num_seqs=64, vllm_max_paddings=512, vllm_max_lora_rank=None, vllm_max_loras=None, vllm_max_cpu_loras=None, vllm_lora_extra_vocab_size=0, vllm_placement_group=None, vllm_log_stats=False, llama_cpp_filename=None, llama_cpp_n_gpu_layers=0, llama_cpp_split_mode=llama_cpp.LLAMA_SPLIT_LAYER, llama_cpp_tensor_split=None, llama_cpp_vocab_only=False, llama_cpp_use_mmap=True, llama_cpp_use_mlock=False, llama_cpp_kv_overrides=None, llama_cpp_seed=llama_cpp.LLAMA_DEFAULT_SEED, llama_cpp_n_ctx=2048, llama_cpp_n_batch=512, llama_cpp_n_threads=None, llama_cpp_n_threads_batch=None, llama_cpp_rope_scaling_type=llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED, llama_cpp_rope_freq_base=0.0, llama_cpp_rope_freq_scale=0.0, llama_cpp_yarn_ext_factor=-1.0, llama_cpp_yarn_attn_factor=1.0, llama_cpp_yarn_beta_fast=32.0, llama_cpp_yarn_beta_slow=1.0, llama_cpp_yarn_orig_ctx=0, llama_cpp_mul_mat_q=True, llama_cpp_logits_all=False, llama_cpp_embedding=False, llama_cpp_offload_kqv=True, llama_cpp_last_n_tokens_size=64, llama_cpp_lora_base=None, llama_cpp_lora_scale=1.0, llama_cpp_lora_path=None, llama_cpp_numa=False, llama_cpp_chat_format=None, llama_cpp_draft_model=None, llama_cpp_verbose=True, endpoint='*', port=3000, cors_domain='http://localhost:3000', username=None, password=None, **model_args) ¶

Starts a CherryPy server to listen for requests to generate text.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Name or identifier of the pre-trained model to be used.	required
`model_class`	`str`	Class name of the model to be used from the transformers library.	`'AutoModelForCausalLM'`
`tokenizer_class`	`str`	Class name of the tokenizer to be used from the transformers library.	`'AutoTokenizer'`
`use_cuda`	`bool`	Flag to enable CUDA for GPU acceleration.	`False`
`precision`	`str`	Specifies the precision configuration for PyTorch tensors, e.g., "float16".	`'float16'`
`quantization`	`int`	Level of model quantization to reduce model size and inference time.	`0`
`device_map`	`Union[str, Dict, None]`	Maps model layers to specific devices for distributed inference.	`'auto'`
`max_memory`	`Dict[int, str]`	Maximum memory allocation for the model on each device.	`{0: '24GB'}`
`torchscript`	`bool`	Enables the use of TorchScript for model optimization.	`False`
`compile`	`bool`	Enables model compilation for further optimization.	`False`
`awq_enabled`	`bool`	Enables Adaptive Weight Quantization (AWQ) for model optimization.	`False`
`flash_attention`	`bool`	Utilizes Flash Attention optimizations for faster processing.	`False`
`concurrent_queries`	`bool`	Allows the server to handle multiple requests concurrently if True.	`False`
`use_vllm`	`bool`	Flag to use Very Large Language Models (VLLM) integration.	`False`
`use_llama_cpp`	`bool`	Flag to use llama.cpp integration for language model inference.	`False`
`llama_cpp_filename`	`Optional[str]`	The filename of the model file for llama.cpp.	`None`
`llama_cpp_n_gpu_layers`	`int`	Number of layers to offload to GPU in llama.cpp configuration.	`0`
`llama_cpp_split_mode`	`int`	Defines how the model is split across multiple GPUs in llama.cpp.	`llama_cpp.LLAMA_SPLIT_LAYER`
`llama_cpp_tensor_split`	`Optional[List[float]]`	Custom tensor split configuration for llama.cpp.	`None`
`llama_cpp_vocab_only`	`bool`	Loads only the vocabulary part of the model in llama.cpp.	`False`
`llama_cpp_use_mmap`	`bool`	Enables memory-mapped files for model loading in llama.cpp.	`True`
`llama_cpp_use_mlock`	`bool`	Locks the model in RAM to prevent swapping in llama.cpp.	`False`
`llama_cpp_kv_overrides`	`Optional[Dict[str, Union[bool, int, float]]]`	Key-value pairs for overriding default llama.cpp model parameters.	`None`
`llama_cpp_seed`	`int`	Seed for random number generation in llama.cpp.	`llama_cpp.LLAMA_DEFAULT_SEED`
`llama_cpp_n_ctx`	`int`	The number of context tokens for the model in llama.cpp.	`2048`
`llama_cpp_n_batch`	`int`	Batch size for processing prompts in llama.cpp.	`512`
`llama_cpp_n_threads`	`Optional[int]`	Number of threads for generation in llama.cpp.	`None`
`llama_cpp_n_threads_batch`	`Optional[int]`	Number of threads for batch processing in llama.cpp.	`None`
`llama_cpp_rope_scaling_type`	`Optional[int]`	Specifies the RoPE (Rotary Positional Embeddings) scaling type in llama.cpp.	`llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED`
`llama_cpp_rope_freq_base`	`float`	Base frequency for RoPE in llama.cpp.	`0.0`
`llama_cpp_rope_freq_scale`	`float`	Frequency scaling factor for RoPE in llama.cpp.	`0.0`
`llama_cpp_yarn_ext_factor`	`float`	Extrapolation mix factor for YaRN in llama.cpp.	`-1.0`
`llama_cpp_yarn_attn_factor`	`float`	Attention factor for YaRN in llama.cpp.	`1.0`
`llama_cpp_yarn_beta_fast`	`float`	Beta fast parameter for YaRN in llama.cpp.	`32.0`
`llama_cpp_yarn_beta_slow`	`float`	Beta slow parameter for YaRN in llama.cpp.	`1.0`
`llama_cpp_yarn_orig_ctx`	`int`	Original context size for YaRN in llama.cpp.	`0`
`llama_cpp_mul_mat_q`	`bool`	Flag to enable matrix multiplication for queries in llama.cpp.	`True`
`llama_cpp_logits_all`	`bool`	Returns logits for all tokens when set to True in llama.cpp.	`False`
`llama_cpp_embedding`	`bool`	Enables embedding mode only in llama.cpp.	`False`
`llama_cpp_offload_kqv`	`bool`	Offloads K, Q, V matrices to GPU in llama.cpp.	`True`
`llama_cpp_last_n_tokens_size`	`int`	Size for the last_n_tokens buffer in llama.cpp.	`64`
`llama_cpp_lora_base`	`Optional[str]`	Base model path for LoRA adjustments in llama.cpp.	`None`
`llama_cpp_lora_scale`	`float`	Scale factor for LoRA adjustments in llama.cpp.	`1.0`
`llama_cpp_lora_path`	`Optional[str]`	Path to LoRA adjustments file in llama.cpp.	`None`
`llama_cpp_numa`	`Union[bool, int]`	NUMA configuration for llama.cpp.	`False`
`llama_cpp_chat_format`	`Optional[str]`	Specifies the chat format for llama.cpp.	`None`
`llama_cpp_draft_model`	`Optional[llama_cpp.LlamaDraftModel]`	Draft model for speculative decoding in llama.cpp.	`None`
`endpoint`	`str`	Network interface to bind the server to.	`'*'`
`port`	`int`	Port number to listen on for incoming requests.	`3000`
`cors_domain`	`str`	Specifies the domain to allow for Cross-Origin Resource Sharing (CORS).	`'http://localhost:3000'`
`username`	`Optional[str]`	Username for basic authentication, if required.	`None`
`password`	`Optional[str]`	Password for basic authentication, if required.	`None`
`**model_args`	`Any`	Additional arguments to pass to the pre-trained language model or llama.cpp configuration.	`{}`

`text(**kwargs)` ¶

Generates text based on the given prompt and decoding strategy.

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	Additional arguments to pass to the pre-trained language model.	`{}`

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: A dictionary containing the prompt, arguments, and generated text.

`validate_password(realm, username, password)` ¶

Validate the username and password against expected values.

Parameters:

Name	Type	Description	Default
`realm`	`str`	The authentication realm.	required
`username`	`str`	The provided username.	required
`password`	`str`	The provided password.	required

Returns:

Name	Type	Description
`bool`		True if credentials are valid, False otherwise.

Base Fine Tuner¶

__init__(input, output, state) ¶

text(**kwargs) ¶

validate_password(realm, username, password) ¶

`init(input, output, state)` ¶

`text(**kwargs)` ¶

`validate_password(realm, username, password)` ¶