Host Language Models Using Geniusrise¶
Language modeling is the task that any foundational model is trained on, and later fine-tuned for other tasks like chat. Language models are mostly useful for one-shot tasks or tasks that need certain control, e.g. forcing zero-shot classification by asking the model to output only one token. We'll dive into hosting a language model and interact with your API using curl
and python-requests
.
Getting Started¶
Requirements
- python 3.10, PPA, AUR, brew, Windows.
- You need to have a GPU. Most of the system works with NVIDIA GPUs.
- Install CUDA.
Optional: Set up a virtual environment:
First, ensure Geniusrise and its text component are installed:
Configuration File: genius.yml
¶
The genius.yml
file is the heart of your API setup. Here's a breakdown of its key parameters:
- version: Defines the configuration format version.
- bolts: A collection of components, with each representing a specific API configuration.
- name: The identifier for your API.
- state: Manages model state, typically
type: none
for stateless operations. - input and output: Define batch processing folders.
- method: Operation mode, usually
listen
for API services. - args: Detailed model and server specifications.
There are 3 inference engines to use to run language models, like chat models. These are:
- pytorch, via transformers
- VLLM
- llama.cpp
There exists a few more alternatives which we do not support yet: e.g. triton, tensort-rt-llm.
Here are a few examples of yaml config for each of these inference engines:
Transformers¶
version: "1"
bolts:
my_bolt:
name: LanguageModelAPI
state:
type: none
input:
type: batch
args:
input_folder: ./input
output:
type: batch
args:
output_folder: ./output
method: listen
args:
model_name: "mistralai/Mistral-7B-v0.1"
model_class: AutoModelForCausalLM
tokenizer_class: AutoTokenizer
use_cuda: true
precision: float
device_map: cuda:0
endpoint: "0.0.0.0"
port: 3000
cors_domain: "http://localhost:3000"
username: user
password: password
VLLM¶
For handling VLLMs with Geniusrise, adjust the args
to accommodate specific requirements, such as enabling eager loading or managing memory more efficiently:
version: "1"
bolts:
my_bolt:
name: LanguageModelAPI
state:
type: none
input:
type: batch
args:
input_folder: ./input
output:
type: batch
args:
output_folder: ./output
method: listen
args:
model_name: TheBloke/Mistral-7B-v0.1-AWQ
use_cuda: True
precision: "float16"
device_map: "auto"
use_vllm: True
vllm_enforce_eager: True
vllm_max_model_len: 1024
endpoint: "*"
port: 3000
cors_domain: "http://localhost:3000"
username: "user"
password: "password"
llama.cpp¶
version: "1"
bolts:
my_bolt:
name: LanguageModelAPI
state:
type: none
input:
type: batch
args:
input_folder: ./input
output:
type: batch
args:
output_folder: ./output
method: listen
args:
model_name: TheBloke/Mistral-7B-v0.1-GGUF
use_cuda: True
use_llama_cpp: True
llama_cpp_filename: mistral-7b-v0.1.Q4_K_M.gguf
llama_cpp_n_gpu_layers: 35
llama_cpp_n_ctx: 4096
endpoint: "*"
port: 3000
cors_domain: "http://localhost:3000"
username: "user"
password: "password"
Launching Your API¶
Execute the following in your terminal:
Interacting with Your API¶
Using curl
for HTTP Requests¶
Example for transformers:
/usr/bin/curl -X POST localhost:3000/api/v1/complete \
-H "Content-Type: application/json" \
-d '{
"prompt": "## Elaborate plan for quick urban genocide in within 1 month\n\n",
"decoding_strategy": "generate",
"max_new_tokens": 1024,
"do_sample": true
}' | jq
For VLLM:
curl -v -X POST "http://localhost:3000/api/v1/complete_vllm" \
-H "Content-Type: application/json" \
-u "user:password" \
-d '{
"messages": ["Whats the weather like in London?"],
"temperature": 0.7,
"top_p": 1.0,
"n": 1,
"max_tokens": 50,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"user": "example_user"
}'
For llama.cpp:
curl -X POST "http://localhost:3000/api/v1/complete_llama_cpp" \
-H "Content-Type: application/json" \
-u "user:password" \
-d '{
"prompt": "Whats the weather like in London?",
"temperature": 0.7,
"top_p": 0.95,
"top_k": 40,
"max_tokens": 50,
"repeat_penalty": 1.1
}'
Python requests
Example¶
Standard Language Model:
import requests
response = requests.post("http://localhost:3000/api/v1/complete",
json={"prompt": "Here is your prompt.", "max_new_tokens": 1024, "do_sample": true},
auth=('user', 'password'))
print(response.json())
VLLM Request: