Instruction Tuning¶
Bases: TextAPI
InstructionAPI is designed for generating text based on prompts using instruction-tuned language models. It serves as an interface to Hugging Face's pre-trained instruction-tuned models, providing a flexible API for various text generation tasks. It can be used in scenarios ranging from generating creative content to providing instructions or answers based on the prompts.
Attributes:
Name | Type | Description |
---|---|---|
model |
Any
|
The loaded instruction-tuned language model. |
tokenizer |
Any
|
The tokenizer for processing text suitable for the model. |
Methods
complete(**kwargs: Any) -> Dict[str, Any]: Generates text based on the given prompt and decoding strategy.
listen(**model_args: Any) -> None: Starts a server to listen for text generation requests.
CLI Usage Example:
genius InstructionAPI rise \
batch \
--input_folder ./input \
batch \
--output_folder ./output \
none \
listen \
--args \
model_name="TheBloke/Mistral-7B-OpenOrca-AWQ" \
model_class="AutoModelForCausalLM" \
tokenizer_class="AutoTokenizer" \
use_cuda=True \
precision="float16" \
quantization=0 \
device_map="auto" \
max_memory=None \
torchscript=False \
awq_enabled=True \
flash_attention=True \
endpoint="*" \
port=3001 \
cors_domain="http://localhost:3000" \
username="user" \
password="password"
Or using VLLM:
genius InstructionAPI rise \
batch \
--input_folder ./input \
batch \
--output_folder ./output \
none \
--id mistralai/Mistral-7B-Instruct-v0.1 \
listen \
--args \
model_name="mistralai/Mistral-7B-Instruct-v0.1" \
model_class="AutoModelForCausalLM" \
tokenizer_class="AutoTokenizer" \
use_cuda=True \
precision="bfloat16" \
quantization=0 \
device_map="auto" \
max_memory=None \
torchscript=False \
use_vllm=True \
vllm_enforce_eager=True \
vllm_max_model_len=1024 \
concurrent_queries=False \
endpoint="*" \
port=3000 \
cors_domain="http://localhost:3000" \
username="user" \
password="password"
or using llama.cpp:
genius InstructionAPI rise \
batch \
--input_folder ./input \
batch \
--output_folder ./output \
none \
listen \
--args \
model_name="TheBloke/Mistral-7B-Instruct-v0.2-GGUF" \
model_class="AutoModelForCausalLM" \
tokenizer_class="AutoTokenizer" \
use_cuda=True \
use_llama_cpp=True \
llama_cpp_filename="mistral-7b-instruct-v0.2.Q4_K_M.gguf" \
llama_cpp_n_gpu_layers=35 \
llama_cpp_n_ctx=32768 \
concurrent_queries=False \
endpoint="*" \
port=3000 \
cors_domain="http://localhost:3000" \
username="user" \
password="password"
__init__(input, output, state, **kwargs)
¶
Initializes a new instance of the InstructionAPI class, setting up the necessary configurations for input, output, and state.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
Configuration for the input data. |
required |
output |
BatchOutput
|
Configuration for the output data. |
required |
state |
State
|
The state of the API. |
required |
**kwargs |
Any
|
Additional keyword arguments for extended functionality. |
{}
|
chat(**kwargs)
¶
Handles chat interaction using the Hugging Face pipeline. This method enables conversational text generation, simulating a chat-like interaction based on user and system prompts.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**kwargs |
Any
|
Arbitrary keyword arguments containing 'user_prompt' and 'system_prompt'. |
{}
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dict[str, Any]: A dictionary containing the user prompt, system prompt, and chat interaction results. |
Example CURL Request for chat interaction:
chat_llama_cpp(**kwargs)
¶
Handles POST requests to generate chat completions using the llama.cpp engine. This method accepts various parameters for customizing the chat completion request, including messages, sampling settings, and more.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
messages |
List[Dict[str, str]]
|
The chat messages for generating a response. |
required |
functions |
Optional[List[Dict]]
|
A list of functions to use for the chat completion (advanced usage). |
required |
function_call |
Optional[Dict]
|
A function call to use for the chat completion (advanced usage). |
required |
tools |
Optional[List[Dict]]
|
A list of tools to use for the chat completion (advanced usage). |
required |
tool_choice |
Optional[Dict]
|
A tool choice option for the chat completion (advanced usage). |
required |
temperature |
float
|
The temperature to use for sampling, controlling randomness. |
required |
top_p |
float
|
The nucleus sampling's top-p parameter, controlling diversity. |
required |
top_k |
int
|
The top-k sampling parameter, limiting the token selection pool. |
required |
min_p |
float
|
The minimum probability threshold for sampling. |
required |
typical_p |
float
|
The typical-p parameter for locally typical sampling. |
required |
stream |
bool
|
Flag to stream the results. |
required |
stop |
Optional[Union[str, List[str]]]
|
Tokens or sequences where generation should stop. |
required |
seed |
Optional[int]
|
Seed for random number generation to ensure reproducibility. |
required |
response_format |
Optional[Dict]
|
Specifies the format of the generated response. |
required |
max_tokens |
Optional[int]
|
Maximum number of tokens to generate. |
required |
presence_penalty |
float
|
Penalty for token presence to discourage repetition. |
required |
frequency_penalty |
float
|
Penalty for token frequency to discourage common tokens. |
required |
repeat_penalty |
float
|
Penalty applied to tokens that are repeated. |
required |
tfs_z |
float
|
Tail-free sampling parameter to adjust the likelihood of tail tokens. |
required |
mirostat_mode |
int
|
Mirostat sampling mode for dynamic adjustments. |
required |
mirostat_tau |
float
|
Tau parameter for mirostat sampling, controlling deviation. |
required |
mirostat_eta |
float
|
Eta parameter for mirostat sampling, controlling adjustment speed. |
required |
model |
Optional[str]
|
Specifies the model to use for generation. |
required |
logits_processor |
Optional[List]
|
List of logits processors for advanced generation control. |
required |
grammar |
Optional[Dict]
|
Specifies grammar rules for the generated text. |
required |
logit_bias |
Optional[Dict[str, float]]
|
Adjustments to the logits of specified tokens. |
required |
logprobs |
Optional[bool]
|
Whether to include log probabilities in the output. |
required |
top_logprobs |
Optional[int]
|
Number of top log probabilities to include. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dict[str, Any]: A dictionary containing the chat completion response or an error message. |
Example CURL Request:
curl -X POST "http://localhost:3000/api/v1/chat_llama_cpp" -H "Content-Type: application/json" -d '{
"messages": [
{"role": "user", "content": "What is the capital of France?"},
{"role": "system", "content": "The capital of France is"}
],
"temperature": 0.2,
"top_p": 0.95,
"top_k": 40,
"max_tokens": 50,
}'
chat_vllm(**kwargs)
¶
Handles POST requests to generate chat completions using the VLLM (Versatile Language Learning Model) engine. This method accepts various parameters for customizing the chat completion request, including message content, generation settings, and more.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
messages |
List[Dict[str, str]]
|
The chat messages for generating a response. Each message should include a 'role' (either 'user' or 'system') and 'content'. |
required |
temperature |
float
|
The sampling temperature. Defaults to 0.7. Higher values generate more random completions. |
required |
top_p |
float
|
The nucleus sampling probability. Defaults to 1.0. A smaller value leads to higher diversity. |
required |
n |
int
|
The number of completions to generate. Defaults to 1. |
required |
max_tokens |
int
|
The maximum number of tokens to generate. Controls the length of the generated response. |
required |
stop |
Union[str, List[str]]
|
Sequence(s) where the generation should stop. Can be a single string or a list of strings. |
required |
stream |
bool
|
Whether to stream the response. Streaming may be useful for long completions. |
required |
presence_penalty |
float
|
Adjusts the likelihood of tokens based on their presence in the conversation so far. Defaults to 0.0. |
required |
frequency_penalty |
float
|
Adjusts the likelihood of tokens based on their frequency in the conversation so far. Defaults to 0.0. |
required |
logit_bias |
Dict[str, float]
|
Adjustments to the logits of specified tokens, identified by token IDs as keys and adjustment values as values. |
required |
user |
str
|
An identifier for the user making the request. Can be used for logging or customization. |
required |
best_of |
int
|
Generates 'n' completions server-side and returns the best one. Higher values incur more computation cost. |
required |
top_k |
int
|
Filters the generated tokens to the top-k tokens with the highest probabilities. Defaults to -1, which disables top-k filtering. |
required |
ignore_eos |
bool
|
Whether to ignore the end-of-sentence token in generation. Useful for more fluid continuations. |
required |
use_beam_search |
bool
|
Whether to use beam search instead of sampling for generation. Beam search can produce more coherent results. |
required |
stop_token_ids |
List[int]
|
List of token IDs that should cause generation to stop. |
required |
skip_special_tokens |
bool
|
Whether to skip special tokens (like padding or end-of-sequence tokens) in the output. |
required |
spaces_between_special_tokens |
bool
|
Whether to insert spaces between special tokens in the output. |
required |
add_generation_prompt |
bool
|
Whether to prepend the generation prompt to the output. |
required |
echo |
bool
|
Whether to include the input prompt in the output. |
required |
repetition_penalty |
float
|
Penalty applied to tokens that have been generated previously. Defaults to 1.0, which applies no penalty. |
required |
min_p |
float
|
Sets a minimum threshold for token probabilities. Tokens with probabilities below this threshold are filtered out. |
required |
include_stop_str_in_output |
bool
|
Whether to include the stop string(s) in the output. |
required |
length_penalty |
float
|
Exponential penalty to the length for beam search. Only relevant if use_beam_search is True. |
required |
Dict[str, Any]: A dictionary with the chat completion response or an error message.
Example CURL Request:
curl -X POST "http://localhost:3000/api/v1/chat_vllm" -H "Content-Type: application/json" -d '{
"messages": [
{"role": "user", "content": "Whats the weather like in London?"}
],
"temperature": 0.7,
"top_p": 1.0,
"n": 1,
"max_tokens": 50,
"stream": false,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"logit_bias": {},
"user": "example_user"
}'
complete(**kwargs)
¶
Handles POST requests to generate text based on the given prompt and decoding strategy. It uses the pre-trained
model specified in the setup to generate a completion for the input prompt.
Args:
**kwargs (Any): Arbitrary keyword arguments containing the 'prompt' and other parameters for text generation.
Returns:
Dict[str, Any]: A dictionary containing the original prompt and the generated completion.
Example CURL Requests:
```bash
/usr/bin/curl -X POST localhost:3001/api/v1/complete -H "Content-Type: application/json" -d '{
"prompt": "<|system|>
<|end|> <|user|> How do I sort a list in Python?<|end|> <|assistant|>", "decoding_strategy": "generate", "max_new_tokens": 100, "do_sample": true, "temperature": 0.7, "top_k": 50, "top_p": 0.95 }' | jq ```
initialize_pipeline()
¶
Lazy initialization of the Hugging Face pipeline for chat interaction.