Skip to content

Language Model

Bases: TextBulk

LanguageModelBulk is designed for large-scale text generation using Hugging Face language models in a bulk processing manner. It's particularly useful for tasks such as bulk content creation, summarization, or any other scenario where large datasets need to be processed with a language model.

Attributes:

Name Type Description
model Any

The loaded language model used for text generation.

tokenizer Any

The tokenizer corresponding to the language model, used for processing input text.

Parameters:

Name Type Description Default
input BatchInput

Configuration for the input data.

required
output BatchOutput

Configuration for the output data.

required
state State

State management for the API.

required
**kwargs Any

Arbitrary keyword arguments for extended functionality.

{}

CLI Usage Example:

genius LanguageModelBulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/lm \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/lm \
    postgres \
        --postgres_host 127.0.0.1 \
        --postgres_port 5432 \
        --postgres_user postgres \
        --postgres_password postgres \
        --postgres_database geniusrise\
        --postgres_table state \
    --id mistralai/Mistral-7B-Instruct-v0.1-lol \
    complete \
        --args \
            model_name="mistralai/Mistral-7B-Instruct-v0.1" \
            model_class="AutoModelForCausalLM" \
            tokenizer_class="AutoTokenizer" \
            use_cuda=True \
            precision="bfloat16" \
            quantization=0 \
            device_map="auto" \
            max_memory=None \
            torchscript=False \
            decoding_strategy="generate" \
            generation_max_new_tokens=100 \
            generation_do_sample=true

or using VLLM:

genius LanguageModelBulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/lm \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/lm \
    none \
    --id mistralai/Mistral-7B-v0.1 \
    complete_vllm \
        --args \
            model_name="mistralai/Mistral-7B-v0.1" \
            use_cuda=True \
            precision="bfloat16" \
            quantization=0 \
            device_map="auto" \
            vllm_enforce_eager=True \
            generation_temperature=0.7 \
            generation_top_p=1.0 \
            generation_n=1 \
            generation_max_tokens=50 \
            generation_stream=false \
            generation_presence_penalty=0.0 \
            generation_frequency_penalty=0.0

or using llama.cpp:

genius LanguageModelBulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/chat \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/chat \
    none \
    complete_llama_cpp \
        --args \
            model="TheBloke/Mistral-7B-v0.1-GGUF" \
            filename="mistral-7b-v0.1.Q4_K_M.gguf" \
            n_gpu_layers=35  \
            n_ctx=32768 \
            generation_temperature=0.7 \
            generation_top_p=0.95 \
            generation_top_k=40 \
            generation_max_tokens=50 \
            generation_repeat_penalty=0.1

__init__(input, output, state, **kwargs)

Initializes the LanguageModelBulk object with the specified configurations for input, output, and state.

Parameters:

Name Type Description Default
input BatchInput

Configuration and data inputs for the bulk process.

required
output BatchOutput

Configurations for output data handling.

required
state State

State management for the bulk process.

required
**kwargs Any

Additional keyword arguments for extended configurations.

{}

complete(model_name, model_class='AutoModelForCausalLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, decoding_strategy='generate', notification_email=None, **kwargs)

Performs text completion on the loaded dataset using the specified model and tokenizer. The method handles the entire process, including model loading, text generation, and saving the results.

Parameters:

Name Type Description Default
model_name str

The name of the language model to use for text completion.

required
model_class str

The class of the language model. Defaults to "AutoModelForCausalLM".

'AutoModelForCausalLM'
tokenizer_class str

The class of the tokenizer. Defaults to "AutoTokenizer".

'AutoTokenizer'
use_cuda bool

Whether to use CUDA for model inference. Defaults to False.

False
precision str

Precision for model computation. Defaults to "float16".

'float16'
quantization int

Level of quantization for optimizing model size and speed. Defaults to 0.

0
device_map str | Dict | None

Specific device to use for computation. Defaults to "auto".

'auto'
max_memory Dict

Maximum memory configuration for devices. Defaults to {0: "24GB"}.

{0: '24GB'}
torchscript bool

Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False.

False
compile bool

Whether to compile the model before fine-tuning. Defaults to True.

False
awq_enabled bool

Whether to enable AWQ optimization. Defaults to False.

False
flash_attention bool

Whether to use flash attention optimization. Defaults to False.

False
decoding_strategy str

Strategy for decoding the completion. Defaults to "generate".

'generate'
**kwargs Any

Additional keyword arguments for text generation.

{}

complete_llama_cpp(model, filename=None, local_dir=None, n_gpu_layers=0, split_mode=llama_cpp.LLAMA_SPLIT_LAYER, main_gpu=0, tensor_split=None, vocab_only=False, use_mmap=True, use_mlock=False, kv_overrides=None, seed=llama_cpp.LLAMA_DEFAULT_SEED, n_ctx=512, n_batch=512, n_threads=None, n_threads_batch=None, rope_scaling_type=llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED, rope_freq_base=0.0, rope_freq_scale=0.0, yarn_ext_factor=-1.0, yarn_attn_factor=1.0, yarn_beta_fast=32.0, yarn_beta_slow=1.0, yarn_orig_ctx=0, mul_mat_q=True, logits_all=False, embedding=False, offload_kqv=True, last_n_tokens_size=64, lora_base=None, lora_scale=1.0, lora_path=None, numa=False, chat_format=None, chat_handler=None, draft_model=None, tokenizer=None, verbose=True, notification_email=None, **kwargs)

Performs bulk text generation using the LLaMA model with llama.cpp backend. This method handles the entire process, including model loading, prompt processing, text generation, and saving the results.

Parameters:

Name Type Description Default
model str

Path or identifier for the LLaMA model.

required
filename Optional[str]

Optional filename or glob pattern to match the model file.

None
local_dir Optional[Union[str, os.PathLike[str]]]

Local directory to save the model files.

None
n_gpu_layers int

Number of layers to offload to GPU.

0
split_mode int

Split mode for distributing model across GPUs.

llama_cpp.LLAMA_SPLIT_LAYER
main_gpu int

Main GPU index.

0
tensor_split Optional[List[float]]

Configuration for tensor splitting across GPUs.

None
vocab_only bool

Whether to load only the vocabulary.

False
use_mmap bool

Use memory-mapped files for model loading.

True
use_mlock bool

Lock model data in RAM to prevent swapping.

False
kv_overrides Optional[Dict[str, Union[bool, int, float]]]

Key-value pairs for overriding model config.

None
seed int

Seed for random number generation.

llama_cpp.LLAMA_DEFAULT_SEED
n_ctx int

Number of context tokens for generation.

512
n_batch int

Batch size for processing.

512
n_threads Optional[int]

Number of threads for generation.

None
n_threads_batch Optional[int]

Number of threads for batch processing.

None
rope_scaling_type Optional[int]

Scaling type for RoPE.

llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED
rope_freq_base float

Base frequency for RoPE.

0.0
rope_freq_scale float

Frequency scaling for RoPE.

0.0
yarn_ext_factor float

YaRN extrapolation factor.

-1.0
yarn_attn_factor float

YaRN attention factor.

1.0
yarn_beta_fast float

YaRN beta fast parameter.

32.0
yarn_beta_slow float

YaRN beta slow parameter.

1.0
yarn_orig_ctx int

Original context size for YaRN.

0
mul_mat_q bool

Multiply matrices for queries.

True
logits_all bool

Return logits for all tokens.

False
embedding bool

Enable embedding mode.

False
offload_kqv bool

Offload K, Q, V matrices to GPU.

True
last_n_tokens_size int

Size for the last_n_tokens buffer.

64
lora_base Optional[str]

Base model path for LoRA.

None
lora_scale float

Scale factor for LoRA adjustments.

1.0
lora_path Optional[str]

Path for LoRA adjustments.

None
numa Union[bool, int]

NUMA configuration.

False
chat_format Optional[str]

Chat format configuration.

None
chat_handler Optional[llama_cpp.llama_chat_format.LlamaChatCompletionHandler]

Handler for chat completions.

None
draft_model Optional[llama_cpp.LlamaDraftModel]

Draft model for speculative decoding.

None
tokenizer Optional[PreTrainedTokenizerBase]

Custom tokenizer instance.

None
verbose bool

Enable verbose logging.

True
notification_email Optional[str]

Email to send notifications upon completion.

None
**kwargs

Additional arguments for model loading and text generation.

{}

complete_vllm(model_name, use_cuda=False, precision='float16', quantization=0, device_map='auto', vllm_tokenizer_mode='auto', vllm_download_dir=None, vllm_load_format='auto', vllm_seed=42, vllm_max_model_len=1024, vllm_enforce_eager=False, vllm_max_context_len_to_capture=8192, vllm_block_size=16, vllm_gpu_memory_utilization=0.9, vllm_swap_space=4, vllm_sliding_window=None, vllm_pipeline_parallel_size=1, vllm_tensor_parallel_size=1, vllm_worker_use_ray=False, vllm_max_parallel_loading_workers=None, vllm_disable_custom_all_reduce=False, vllm_max_num_batched_tokens=None, vllm_max_num_seqs=64, vllm_max_paddings=512, vllm_max_lora_rank=None, vllm_max_loras=None, vllm_max_cpu_loras=None, vllm_lora_extra_vocab_size=0, vllm_placement_group=None, vllm_log_stats=False, notification_email=None, batch_size=32, **kwargs)

Performs bulk text generation using the Versatile Language Learning Model (VLLM) with specified parameters for fine-tuning model behavior, including quantization and parallel processing settings. This method is designed to process large datasets efficiently by leveraging VLLM capabilities for generating high-quality text completions based on provided prompts.

Parameters:

Name Type Description Default
model_name str

The name or path of the VLLM model to use for text generation.

required
use_cuda bool

Flag indicating whether to use CUDA for GPU acceleration.

False
precision str

Precision of computations, can be "float16", "bfloat16", etc.

'float16'
quantization int

Level of quantization for model weights, 0 for none.

0
device_map str | Dict | None

Specific device(s) to use for model inference.

'auto'
vllm_tokenizer_mode str

Mode of the tokenizer ("auto", "fast", or "slow").

'auto'
vllm_download_dir Optional[str]

Directory to download and load the model and tokenizer.

None
vllm_load_format str

Format to load the model, e.g., "auto", "pt".

'auto'
vllm_seed int

Seed for random number generation.

42
vllm_max_model_len int

Maximum sequence length the model can handle.

1024
vllm_enforce_eager bool

Enforce eager execution instead of using optimization techniques.

False
vllm_max_context_len_to_capture int

Maximum context length for CUDA graph capture.

8192
vllm_block_size int

Block size for caching mechanism.

16
vllm_gpu_memory_utilization float

Fraction of GPU memory to use.

0.9
vllm_swap_space int

Amount of swap space to use in GiB.

4
vllm_sliding_window Optional[int]

Size of the sliding window for processing.

None
vllm_pipeline_parallel_size int

Number of pipeline parallel groups.

1
vllm_tensor_parallel_size int

Number of tensor parallel groups.

1
vllm_worker_use_ray bool

Whether to use Ray for model workers.

False
vllm_max_parallel_loading_workers Optional[int]

Maximum number of workers for parallel loading.

None
vllm_disable_custom_all_reduce bool

Disable custom all-reduce kernel and fall back to NCCL.

False
vllm_max_num_batched_tokens Optional[int]

Maximum number of tokens to be processed in a single iteration.

None
vllm_max_num_seqs int

Maximum number of sequences to be processed in a single iteration.

64
vllm_max_paddings int

Maximum number of paddings to be added to a batch.

512
vllm_max_lora_rank Optional[int]

Maximum rank for LoRA adjustments.

None
vllm_max_loras Optional[int]

Maximum number of LoRA adjustments.

None
vllm_max_cpu_loras Optional[int]

Maximum number of LoRA adjustments stored on CPU.

None
vllm_lora_extra_vocab_size int

Additional vocabulary size for LoRA.

0
vllm_placement_group Optional[dict]

Ray placement group for distributed execution.

None
vllm_log_stats bool

Whether to log statistics during model operation.

False
notification_email Optional[str]

Email to send notifications upon completion.

None
batch_size int

Number of prompts to process in each batch for efficient memory usage.

32
**kwargs Any

Additional keyword arguments for generation settings like temperature, top_p, etc.

{}

This method automates the loading of large datasets, generation of text completions, and saving results, facilitating efficient and scalable text generation tasks.

load_dataset(dataset_path, max_length=512, **kwargs)

Load a completion dataset from a directory.

Parameters:

Name Type Description Default
dataset_path str

The path to the dataset directory.

required
max_length int

The maximum length for tokenization. Defaults to 512.

512
**kwargs

Additional keyword arguments to pass to the underlying dataset loading functions.

{}

Returns:

Name Type Description
Dataset Optional[Dataset]

The loaded dataset.

Raises:

Type Description
Exception

If there was an error loading the dataset.

Supported Data Formats and Structures:

Dataset files saved by Hugging Face datasets library

The directory should contain 'dataset_info.json' and other related files.

JSONL

Each line is a JSON object representing an example.

{"text": "The text content"}

CSV

Should contain 'text' column.

text
"The text content"

Parquet

Should contain 'text' column.

JSON

An array of dictionaries with 'text' key.

[{"text": "The text content"}]

XML

Each 'record' element should contain 'text' child element.

<record>
    <text>The text content</text>
</record>

YAML

Each document should be a dictionary with 'text' key.

- text: "The text content"

TSV

Should contain 'text' column separated by tabs.

Excel (.xls, .xlsx)

Should contain 'text' column.

SQLite (.db)

Should contain a table with 'text' column.

Feather

Should contain 'text' column.