Skip to content

Instruction Tuning

Bases: TextAPI

InstructionAPI is designed for generating text based on prompts using instruction-tuned language models. It serves as an interface to Hugging Face's pre-trained instruction-tuned models, providing a flexible API for various text generation tasks. It can be used in scenarios ranging from generating creative content to providing instructions or answers based on the prompts.

Attributes:

Name Type Description
model Any

The loaded instruction-tuned language model.

tokenizer Any

The tokenizer for processing text suitable for the model.

Methods

complete(**kwargs: Any) -> Dict[str, Any]: Generates text based on the given prompt and decoding strategy.

listen(**model_args: Any) -> None: Starts a server to listen for text generation requests.

CLI Usage Example:

genius InstructionAPI rise \
    batch \
        --input_folder ./input \
    batch \
        --output_folder ./output \
    none \
    listen \
        --args \
            model_name="TheBloke/Mistral-7B-OpenOrca-AWQ" \
            model_class="AutoModelForCausalLM" \
            tokenizer_class="AutoTokenizer" \
            use_cuda=True \
            precision="float16" \
            quantization=0 \
            device_map="auto" \
            max_memory=None \
            torchscript=False \
            awq_enabled=True \
            flash_attention=True \
            endpoint="*" \
            port=3001 \
            cors_domain="http://localhost:3000" \
            username="user" \
            password="password"

Or using VLLM:

genius InstructionAPI rise \
    batch \
            --input_folder ./input \
    batch \
            --output_folder ./output \
    none \
    --id mistralai/Mistral-7B-Instruct-v0.1 \
    listen \
        --args \
            model_name="mistralai/Mistral-7B-Instruct-v0.1" \
            model_class="AutoModelForCausalLM" \
            tokenizer_class="AutoTokenizer" \
            use_cuda=True \
            precision="bfloat16" \
            quantization=0 \
            device_map="auto" \
            max_memory=None \
            torchscript=False \
            use_vllm=True \
            vllm_enforce_eager=True \
            vllm_max_model_len=1024 \
            concurrent_queries=False \
            endpoint="*" \
            port=3000 \
            cors_domain="http://localhost:3000" \
            username="user" \
            password="password"

or using llama.cpp:

genius InstructionAPI rise \
    batch \
            --input_folder ./input \
    batch \
            --output_folder ./output \
    none \
    listen \
        --args \
            model_name="TheBloke/Mistral-7B-Instruct-v0.2-GGUF" \
            model_class="AutoModelForCausalLM" \
            tokenizer_class="AutoTokenizer" \
            use_cuda=True \
            use_llama_cpp=True \
            llama_cpp_filename="mistral-7b-instruct-v0.2.Q4_K_M.gguf" \
            llama_cpp_n_gpu_layers=35 \
            llama_cpp_n_ctx=32768 \
            concurrent_queries=False \
            endpoint="*" \
            port=3000 \
            cors_domain="http://localhost:3000" \
            username="user" \
            password="password"

__init__(input, output, state, **kwargs)

Initializes a new instance of the InstructionAPI class, setting up the necessary configurations for input, output, and state.

Parameters:

Name Type Description Default
input BatchInput

Configuration for the input data.

required
output BatchOutput

Configuration for the output data.

required
state State

The state of the API.

required
**kwargs Any

Additional keyword arguments for extended functionality.

{}

chat(**kwargs)

Handles chat interaction using the Hugging Face pipeline. This method enables conversational text generation, simulating a chat-like interaction based on user and system prompts.

Parameters:

Name Type Description Default
**kwargs Any

Arbitrary keyword arguments containing 'user_prompt' and 'system_prompt'.

{}

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: A dictionary containing the user prompt, system prompt, and chat interaction results.

Example CURL Request for chat interaction:

/usr/bin/curl -X POST localhost:3001/api/v1/chat             -H "Content-Type: application/json"             -d '{
        "user_prompt": "What is the capital of France?",
        "system_prompt": "The capital of France is"
    }' | jq

chat_llama_cpp(**kwargs)

Handles POST requests to generate chat completions using the llama.cpp engine. This method accepts various parameters for customizing the chat completion request, including messages, sampling settings, and more.

Parameters:

Name Type Description Default
messages List[Dict[str, str]]

The chat messages for generating a response.

required
functions Optional[List[Dict]]

A list of functions to use for the chat completion (advanced usage).

required
function_call Optional[Dict]

A function call to use for the chat completion (advanced usage).

required
tools Optional[List[Dict]]

A list of tools to use for the chat completion (advanced usage).

required
tool_choice Optional[Dict]

A tool choice option for the chat completion (advanced usage).

required
temperature float

The temperature to use for sampling, controlling randomness.

required
top_p float

The nucleus sampling's top-p parameter, controlling diversity.

required
top_k int

The top-k sampling parameter, limiting the token selection pool.

required
min_p float

The minimum probability threshold for sampling.

required
typical_p float

The typical-p parameter for locally typical sampling.

required
stream bool

Flag to stream the results.

required
stop Optional[Union[str, List[str]]]

Tokens or sequences where generation should stop.

required
seed Optional[int]

Seed for random number generation to ensure reproducibility.

required
response_format Optional[Dict]

Specifies the format of the generated response.

required
max_tokens Optional[int]

Maximum number of tokens to generate.

required
presence_penalty float

Penalty for token presence to discourage repetition.

required
frequency_penalty float

Penalty for token frequency to discourage common tokens.

required
repeat_penalty float

Penalty applied to tokens that are repeated.

required
tfs_z float

Tail-free sampling parameter to adjust the likelihood of tail tokens.

required
mirostat_mode int

Mirostat sampling mode for dynamic adjustments.

required
mirostat_tau float

Tau parameter for mirostat sampling, controlling deviation.

required
mirostat_eta float

Eta parameter for mirostat sampling, controlling adjustment speed.

required
model Optional[str]

Specifies the model to use for generation.

required
logits_processor Optional[List]

List of logits processors for advanced generation control.

required
grammar Optional[Dict]

Specifies grammar rules for the generated text.

required
logit_bias Optional[Dict[str, float]]

Adjustments to the logits of specified tokens.

required
logprobs Optional[bool]

Whether to include log probabilities in the output.

required
top_logprobs Optional[int]

Number of top log probabilities to include.

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: A dictionary containing the chat completion response or an error message.

Example CURL Request:

curl -X POST "http://localhost:3000/api/v1/chat_llama_cpp"             -H "Content-Type: application/json"             -d '{
        "messages": [
            {"role": "user", "content": "What is the capital of France?"},
            {"role": "system", "content": "The capital of France is"}
        ],
        "temperature": 0.2,
        "top_p": 0.95,
        "top_k": 40,
        "max_tokens": 50,
    }'

chat_vllm(**kwargs)

Handles POST requests to generate chat completions using the VLLM (Versatile Language Learning Model) engine. This method accepts various parameters for customizing the chat completion request, including message content, generation settings, and more.

Parameters:

Name Type Description Default
messages List[Dict[str, str]]

The chat messages for generating a response. Each message should include a 'role' (either 'user' or 'system') and 'content'.

required
temperature float

The sampling temperature. Defaults to 0.7. Higher values generate more random completions.

required
top_p float

The nucleus sampling probability. Defaults to 1.0. A smaller value leads to higher diversity.

required
n int

The number of completions to generate. Defaults to 1.

required
max_tokens int

The maximum number of tokens to generate. Controls the length of the generated response.

required
stop Union[str, List[str]]

Sequence(s) where the generation should stop. Can be a single string or a list of strings.

required
stream bool

Whether to stream the response. Streaming may be useful for long completions.

required
presence_penalty float

Adjusts the likelihood of tokens based on their presence in the conversation so far. Defaults to 0.0.

required
frequency_penalty float

Adjusts the likelihood of tokens based on their frequency in the conversation so far. Defaults to 0.0.

required
logit_bias Dict[str, float]

Adjustments to the logits of specified tokens, identified by token IDs as keys and adjustment values as values.

required
user str

An identifier for the user making the request. Can be used for logging or customization.

required
best_of int

Generates 'n' completions server-side and returns the best one. Higher values incur more computation cost.

required
top_k int

Filters the generated tokens to the top-k tokens with the highest probabilities. Defaults to -1, which disables top-k filtering.

required
ignore_eos bool

Whether to ignore the end-of-sentence token in generation. Useful for more fluid continuations.

required
use_beam_search bool

Whether to use beam search instead of sampling for generation. Beam search can produce more coherent results.

required
stop_token_ids List[int]

List of token IDs that should cause generation to stop.

required
skip_special_tokens bool

Whether to skip special tokens (like padding or end-of-sequence tokens) in the output.

required
spaces_between_special_tokens bool

Whether to insert spaces between special tokens in the output.

required
add_generation_prompt bool

Whether to prepend the generation prompt to the output.

required
echo bool

Whether to include the input prompt in the output.

required
repetition_penalty float

Penalty applied to tokens that have been generated previously. Defaults to 1.0, which applies no penalty.

required
min_p float

Sets a minimum threshold for token probabilities. Tokens with probabilities below this threshold are filtered out.

required
include_stop_str_in_output bool

Whether to include the stop string(s) in the output.

required
length_penalty float

Exponential penalty to the length for beam search. Only relevant if use_beam_search is True.

required

Dict[str, Any]: A dictionary with the chat completion response or an error message.

Example CURL Request:

curl -X POST "http://localhost:3000/api/v1/chat_vllm"             -H "Content-Type: application/json"             -d '{
        "messages": [
            {"role": "user", "content": "Whats the weather like in London?"}
        ],
        "temperature": 0.7,
        "top_p": 1.0,
        "n": 1,
        "max_tokens": 50,
        "stream": false,
        "presence_penalty": 0.0,
        "frequency_penalty": 0.0,
        "logit_bias": {},
        "user": "example_user"
    }'
This request asks the VLLM engine to generate a completion for the provided chat context, with specified generation settings.

complete(**kwargs)

    Handles POST requests to generate text based on the given prompt and decoding strategy. It uses the pre-trained
    model specified in the setup to generate a completion for the input prompt.

    Args:
        **kwargs (Any): Arbitrary keyword arguments containing the 'prompt' and other parameters for text generation.

    Returns:
        Dict[str, Any]: A dictionary containing the original prompt and the generated completion.

    Example CURL Requests:
    ```bash
    /usr/bin/curl -X POST localhost:3001/api/v1/complete             -H "Content-Type: application/json"             -d '{
            "prompt": "<|system|>

<|end|> <|user|> How do I sort a list in Python?<|end|> <|assistant|>", "decoding_strategy": "generate", "max_new_tokens": 100, "do_sample": true, "temperature": 0.7, "top_k": 50, "top_p": 0.95 }' | jq ```

initialize_pipeline()

Lazy initialization of the Hugging Face pipeline for chat interaction.