Language Model¶

Bases: TextBulk

LanguageModelBulk is designed for large-scale text generation using Hugging Face language models in a bulk processing manner. It's particularly useful for tasks such as bulk content creation, summarization, or any other scenario where large datasets need to be processed with a language model.

Attributes:

Name	Type	Description
`model`	`Any`	The loaded language model used for text generation.
`tokenizer`	`Any`	The tokenizer corresponding to the language model, used for processing input text.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	Configuration for the input data.	required
`output`	`BatchOutput`	Configuration for the output data.	required
`state`	`State`	State management for the API.	required
`**kwargs`	`Any`	Arbitrary keyword arguments for extended functionality.	`{}`

CLI Usage Example:

href="#__codelineno-0-1">genius LanguageModelBulk rise \ batch \ --input_s3_bucket geniusrise-test \ --input_s3_folder input/lm \ batch \ --output_s3_bucket geniusrise-test \ --output_s3_folder output/lm \ postgres \ --postgres_host 127.0.0.1 \ --postgres_port 5432 \ --postgres_user postgres \ --postgres_password postgres \ --postgres_database geniusrise\ --postgres_table state \ --id mistralai/Mistral-7B-Instruct-v0.1-lol \ complete \ --args \ model_name="mistralai/Mistral-7B-Instruct-v0.1" \ model_class="AutoModelForCausalLM" \ tokenizer_class="AutoTokenizer" \ use_cuda=True \ precision="bfloat16" \ quantization=0 \ device_map="auto" \ max_memory=None \ torchscript=False \ decoding_strategy="generate" \ generation_max_new_tokens=100 \ generation_do_sample=true

or using VLLM:

genius LanguageModelBulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/lm \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/lm \
    none \
    --id mistralai/Mistral-7B-v0.1 \
    complete_vllm \
        --args \
            model_name="mistralai/Mistral-7B-v0.1" \
            use_cuda=True \
            precision="bfloat16" \
            quantization=0 \
            device_map="auto" \
            vllm_enforce_eager=True \
            generation_temperature=0.7 \
            generation_top_p=1.0 \
            generation_n=1 \
            generation_max_tokens=50 \
            generation_stream=false \
            generation_presence_penalty=0.0 \
            generation_frequency_penalty=0.0

or using llama.cpp:

genius LanguageModelBulk rise \
    batch \
        --input_s3_bucket geniusrise-test \
        --input_s3_folder input/chat \
    batch \
        --output_s3_bucket geniusrise-test \
        --output_s3_folder output/chat \
    none \
    complete_llama_cpp \
        --args \
            model="TheBloke/Mistral-7B-v0.1-GGUF" \
            filename="mistral-7b-v0.1.Q4_K_M.gguf" \
            n_gpu_layers=35  \
            n_ctx=32768 \
            generation_temperature=0.7 \
            generation_top_p=0.95 \
            generation_top_k=40 \
            generation_max_tokens=50 \
            generation_repeat_penalty=0.1

`init(input, output, state, **kwargs)` ¶

Initializes the LanguageModelBulk object with the specified configurations for input, output, and state.

Parameters:

Name	Type	Description	Default
`input`	`BatchInput`	Configuration and data inputs for the bulk process.	required
`output`	`BatchOutput`	Configurations for output data handling.	required
`state`	`State`	State management for the bulk process.	required
`**kwargs`	`Any`	Additional keyword arguments for extended configurations.	`{}`

`complete(model_name, model_class='AutoModelForCausalLM', tokenizer_class='AutoTokenizer', use_cuda=False, precision='float16', quantization=0, device_map='auto', max_memory={0: '24GB'}, torchscript=False, compile=False, awq_enabled=False, flash_attention=False, decoding_strategy='generate', notification_email=None, **kwargs)` ¶

Performs text completion on the loaded dataset using the specified model and tokenizer. The method handles the entire process, including model loading, text generation, and saving the results.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	The name of the language model to use for text completion.	required
`model_class`	`str`	The class of the language model. Defaults to "AutoModelForCausalLM".	`'AutoModelForCausalLM'`
`tokenizer_class`	`str`	The class of the tokenizer. Defaults to "AutoTokenizer".	`'AutoTokenizer'`
`use_cuda`	`bool`	Whether to use CUDA for model inference. Defaults to False.	`False`
`precision`	`str`	Precision for model computation. Defaults to "float16".	`'float16'`
`quantization`	`int`	Level of quantization for optimizing model size and speed. Defaults to 0.	`0`
`device_map`	`str \| Dict \| None`	Specific device to use for computation. Defaults to "auto".	`'auto'`
`max_memory`	`Dict`	Maximum memory configuration for devices. Defaults to {0: "24GB"}.	`{0: '24GB'}`
`torchscript`	`bool`	Whether to use a TorchScript-optimized version of the pre-trained language model. Defaults to False.	`False`
`compile`	`bool`	Whether to compile the model before fine-tuning. Defaults to True.	`False`
`awq_enabled`	`bool`	Whether to enable AWQ optimization. Defaults to False.	`False`
`flash_attention`	`bool`	Whether to use flash attention optimization. Defaults to False.	`False`
`decoding_strategy`	`str`	Strategy for decoding the completion. Defaults to "generate".	`'generate'`
`**kwargs`	`Any`	Additional keyword arguments for text generation.	`{}`

complete_llama_cpp(model, filename=None, local_dir=None, n_gpu_layers=0, split_mode=llama_cpp.LLAMA_SPLIT_LAYER, main_gpu=0, tensor_split=None, vocab_only=False, use_mmap=True, use_mlock=False, kv_overrides=None, seed=llama_cpp.LLAMA_DEFAULT_SEED, n_ctx=512, n_batch=512, n_threads=None, n_threads_batch=None, rope_scaling_type=llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED, rope_freq_base=0.0, rope_freq_scale=0.0, yarn_ext_factor=-1.0, yarn_attn_factor=1.0, yarn_beta_fast=32.0, yarn_beta_slow=1.0, yarn_orig_ctx=0, mul_mat_q=True, logits_all=False, embedding=False, offload_kqv=True, last_n_tokens_size=64, lora_base=None, lora_scale=1.0, lora_path=None, numa=False, chat_format=None, chat_handler=None, draft_model=None, tokenizer=None, verbose=True, notification_email=None, **kwargs) ¶

Performs bulk text generation using the LLaMA model with llama.cpp backend. This method handles the entire process, including model loading, prompt processing, text generation, and saving the results.

Parameters:

Name	Type	Description	Default
`model`	`str`	Path or identifier for the LLaMA model.	required
`filename`	`Optional[str]`	Optional filename or glob pattern to match the model file.	`None`
`local_dir`	`Optional[Union[str, os.PathLike[str]]]`	Local directory to save the model files.	`None`
`n_gpu_layers`	`int`	Number of layers to offload to GPU.	`0`
`split_mode`	`int`	Split mode for distributing model across GPUs.	`llama_cpp.LLAMA_SPLIT_LAYER`
`main_gpu`	`int`	Main GPU index.	`0`
`tensor_split`	`Optional[List[float]]`	Configuration for tensor splitting across GPUs.	`None`
`vocab_only`	`bool`	Whether to load only the vocabulary.	`False`
`use_mmap`	`bool`	Use memory-mapped files for model loading.	`True`
`use_mlock`	`bool`	Lock model data in RAM to prevent swapping.	`False`
`kv_overrides`	`Optional[Dict[str, Union[bool, int, float]]]`	Key-value pairs for overriding model config.	`None`
`seed`	`int`	Seed for random number generation.	`llama_cpp.LLAMA_DEFAULT_SEED`
`n_ctx`	`int`	Number of context tokens for generation.	`512`
`n_batch`	`int`	Batch size for processing.	`512`
`n_threads`	`Optional[int]`	Number of threads for generation.	`None`
`n_threads_batch`	`Optional[int]`	Number of threads for batch processing.	`None`
`rope_scaling_type`	`Optional[int]`	Scaling type for RoPE.	`llama_cpp.LLAMA_ROPE_SCALING_UNSPECIFIED`
`rope_freq_base`	`float`	Base frequency for RoPE.	`0.0`
`rope_freq_scale`	`float`	Frequency scaling for RoPE.	`0.0`
`yarn_ext_factor`	`float`	YaRN extrapolation factor.	`-1.0`
`yarn_attn_factor`	`float`	YaRN attention factor.	`1.0`
`yarn_beta_fast`	`float`	YaRN beta fast parameter.	`32.0`
`yarn_beta_slow`	`float`	YaRN beta slow parameter.	`1.0`
`yarn_orig_ctx`	`int`	Original context size for YaRN.	`0`
`mul_mat_q`	`bool`	Multiply matrices for queries.	`True`
`logits_all`	`bool`	Return logits for all tokens.	`False`
`embedding`	`bool`	Enable embedding mode.	`False`
`offload_kqv`	`bool`	Offload K, Q, V matrices to GPU.	`True`
`last_n_tokens_size`	`int`	Size for the last_n_tokens buffer.	`64`
`lora_base`	`Optional[str]`	Base model path for LoRA.	`None`
`lora_scale`	`float`	Scale factor for LoRA adjustments.	`1.0`
`lora_path`	`Optional[str]`	Path for LoRA adjustments.	`None`
`numa`	`Union[bool, int]`	NUMA configuration.	`False`
`chat_format`	`Optional[str]`	Chat format configuration.	`None`
`chat_handler`	`Optional[llama_cpp.llama_chat_format.LlamaChatCompletionHandler]`	Handler for chat completions.	`None`
`draft_model`	`Optional[llama_cpp.LlamaDraftModel]`	Draft model for speculative decoding.	`None`
`tokenizer`	`Optional[PreTrainedTokenizerBase]`	Custom tokenizer instance.	`None`
`verbose`	`bool`	Enable verbose logging.	`True`
`notification_email`	`Optional[str]`	Email to send notifications upon completion.	`None`
`**kwargs`		Additional arguments for model loading and text generation.	`{}`

complete_vllm(model_name, use_cuda=False, precision='float16', quantization=0, device_map='auto', vllm_tokenizer_mode='auto', vllm_download_dir=None, vllm_load_format='auto', vllm_seed=42, vllm_max_model_len=1024, vllm_enforce_eager=False, vllm_max_context_len_to_capture=8192, vllm_block_size=16, vllm_gpu_memory_utilization=0.9, vllm_swap_space=4, vllm_sliding_window=None, vllm_pipeline_parallel_size=1, vllm_tensor_parallel_size=1, vllm_worker_use_ray=False, vllm_max_parallel_loading_workers=None, vllm_disable_custom_all_reduce=False, vllm_max_num_batched_tokens=None, vllm_max_num_seqs=64, vllm_max_paddings=512, vllm_max_lora_rank=None, vllm_max_loras=None, vllm_max_cpu_loras=None, vllm_lora_extra_vocab_size=0, vllm_placement_group=None, vllm_log_stats=False, notification_email=None, batch_size=32, **kwargs) ¶

Performs bulk text generation using the Versatile Language Learning Model (VLLM) with specified parameters for fine-tuning model behavior, including quantization and parallel processing settings. This method is designed to process large datasets efficiently by leveraging VLLM capabilities for generating high-quality text completions based on provided prompts.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	The name or path of the VLLM model to use for text generation.	required
`use_cuda`	`bool`	Flag indicating whether to use CUDA for GPU acceleration.	`False`
`precision`	`str`	Precision of computations, can be "float16", "bfloat16", etc.	`'float16'`
`quantization`	`int`	Level of quantization for model weights, 0 for none.	`0`
`device_map`	`str \| Dict \| None`	Specific device(s) to use for model inference.	`'auto'`
`vllm_tokenizer_mode`	`str`	Mode of the tokenizer ("auto", "fast", or "slow").	`'auto'`
`vllm_download_dir`	`Optional[str]`	Directory to download and load the model and tokenizer.	`None`
`vllm_load_format`	`str`	Format to load the model, e.g., "auto", "pt".	`'auto'`
`vllm_seed`	`int`	Seed for random number generation.	`42`
`vllm_max_model_len`	`int`	Maximum sequence length the model can handle.	`1024`
`vllm_enforce_eager`	`bool`	Enforce eager execution instead of using optimization techniques.	`False`
`vllm_max_context_len_to_capture`	`int`	Maximum context length for CUDA graph capture.	`8192`
`vllm_block_size`	`int`	Block size for caching mechanism.	`16`
`vllm_gpu_memory_utilization`	`float`	Fraction of GPU memory to use.	`0.9`
`vllm_swap_space`	`int`	Amount of swap space to use in GiB.	`4`
`vllm_sliding_window`	`Optional[int]`	Size of the sliding window for processing.	`None`
`vllm_pipeline_parallel_size`	`int`	Number of pipeline parallel groups.	`1`
`vllm_tensor_parallel_size`	`int`	Number of tensor parallel groups.	`1`
`vllm_worker_use_ray`	`bool`	Whether to use Ray for model workers.	`False`
`vllm_max_parallel_loading_workers`	`Optional[int]`	Maximum number of workers for parallel loading.	`None`
`vllm_disable_custom_all_reduce`	`bool`	Disable custom all-reduce kernel and fall back to NCCL.	`False`
`vllm_max_num_batched_tokens`	`Optional[int]`	Maximum number of tokens to be processed in a single iteration.	`None`
`vllm_max_num_seqs`	`int`	Maximum number of sequences to be processed in a single iteration.	`64`
`vllm_max_paddings`	`int`	Maximum number of paddings to be added to a batch.	`512`
`vllm_max_lora_rank`	`Optional[int]`	Maximum rank for LoRA adjustments.	`None`
`vllm_max_loras`	`Optional[int]`	Maximum number of LoRA adjustments.	`None`
`vllm_max_cpu_loras`	`Optional[int]`	Maximum number of LoRA adjustments stored on CPU.	`None`
`vllm_lora_extra_vocab_size`	`int`	Additional vocabulary size for LoRA.	`0`
`vllm_placement_group`	`Optional[dict]`	Ray placement group for distributed execution.	`None`
`vllm_log_stats`	`bool`	Whether to log statistics during model operation.	`False`
`notification_email`	`Optional[str]`	Email to send notifications upon completion.	`None`
`batch_size`	`int`	Number of prompts to process in each batch for efficient memory usage.	`32`
`**kwargs`	`Any`	Additional keyword arguments for generation settings like temperature, top_p, etc.	`{}`

This method automates the loading of large datasets, generation of text completions, and saving results, facilitating efficient and scalable text generation tasks.

`load_dataset(dataset_path, max_length=512, **kwargs)` ¶

Load a completion dataset from a directory.

Parameters:

Name	Type	Description	Default
`dataset_path`	`str`	The path to the dataset directory.	required
`max_length`	`int`	The maximum length for tokenization. Defaults to 512.	`512`
`**kwargs`		Additional keyword arguments to pass to the underlying dataset loading functions.	`{}`

Returns:

Name	Type	Description
`Dataset`	`Optional[Dataset]`	The loaded dataset.

Raises:

Type	Description
`Exception`	If there was an error loading the dataset.

Supported Data Formats and Structures:¶

Dataset files saved by Hugging Face datasets library¶

The directory should contain 'dataset_info.json' and other related files.

JSONL¶

Each line is a JSON object representing an example.

{"text": "The text content"}

CSV¶

Should contain 'text' column.

text
"The text content"

Parquet¶

Should contain 'text' column.

JSON¶

An array of dictionaries with 'text' key.

[{"text": "The text content"}]

XML¶

Each 'record' element should contain 'text' child element.

<record>
    <text>The text content</text>
</record>

YAML¶

Each document should be a dictionary with 'text' key.

- text: "The text content"

TSV¶

Should contain 'text' column separated by tabs.

Excel (.xls, .xlsx)¶

Should contain 'text' column.

SQLite (.db)¶

Should contain a table with 'text' column.

Feather¶

Should contain 'text' column.

Language Model¶

__init__(input, output, state, **kwargs) ¶

load_dataset(dataset_path, max_length=512, **kwargs) ¶

Supported Data Formats and Structures:¶

Dataset files saved by Hugging Face datasets library¶

JSONL¶

CSV¶

Parquet¶

JSON¶

XML¶

YAML¶

TSV¶

Excel (.xls, .xlsx)¶

SQLite (.db)¶

Feather¶

`init(input, output, state, **kwargs)` ¶

`load_dataset(dataset_path, max_length=512, **kwargs)` ¶