Speech to Text

Bases: AudioAPI

SpeechToTextAPI is a subclass of AudioAPI specifically designed for speech-to-text models. It extends the functionality to handle speech-to-text processing using various ASR models.


Name Type Description
model AutoModelForCTC

The speech-to-text model.

processor AutoProcessor

The processor to prepare input audio data for the model.


transcribe(audio_input: bytes) -> str: Transcribes the given audio input to text using the speech-to-text model.

Example CLI Usage:

genius SpeechToTextAPI rise \
batch \
    --input_folder ./input \
batch \
    --output_folder ./output \
none \
    --id facebook/wav2vec2-large-960h-lv60-self \
    listen \
        --args \
            model_name="facebook/wav2vec2-large-960h-lv60-self" \
            model_class="Wav2Vec2ForCTC" \
            processor_class="Wav2Vec2Processor" \
            use_cuda=True \
            precision="float32" \
            quantization=0 \
            device_map="cuda:0" \
            max_memory=None \
            torchscript=False \
            compile=True \
            endpoint="*" \
            port=3000 \
            cors_domain="http://localhost:3000" \
            username="user" \

or using whisper.cpp:

genius SpeechToTextAPI rise \
    batch \
            --input_folder ./input \
    batch \
            --output_folder ./output \
    none \
    listen \
        --args \
            model_name="large" \
            use_whisper_cpp=True \
            endpoint="*" \
            port=3000 \
            cors_domain="http://localhost:3000" \
            username="user" \

__init__(input, output, state, **kwargs)

Initializes the SpeechToTextAPI with configurations for speech-to-text processing.


Name Type Description Default
input BatchInput

The input data configuration.

output BatchOutput

The output data configuration.

state State

The state configuration.


Additional keyword arguments.



Recognizes named entities in the input text using the Hugging Face pipeline.

This method leverages a pre-trained NER model to identify and classify entities in text into categories such as names, organizations, locations, etc. It's suitable for processing various types of text content.


Name Type Description Default
**kwargs Any

Arbitrary keyword arguments, typically containing 'text' for the input text.



Type Description
Dict[str, Any]

Dict[str, Any]: A dictionary containing the original input text and a list of recognized entities.

Example CURL Request for transcription:

(base64 -w 0 sample.flac | awk '{print "{\"audio_file\": \""$0"\", \"model_sampling_rate\": 16000, \"chunk_length_s\": 60}"}' > /tmp/payload.json)
curl -X POST http://localhost:3000/api/v1/asr_pipeline \
    -H "Content-Type: application/json" \
    -u user:password \
    -d @/tmp/payload.json | jq


Lazy initialization of the NER Hugging Face pipeline.

process_faster_whisper(audio_input, model_sampling_rate, chunk_size, generate_args)

Processes audio input with the faster-whisper model.


Name Type Description Default
audio_input bytes

The audio input for transcription.

model_sampling_rate int

The sampling rate of the model.

chunk_size int

The size of audio chunks to process.

generate_args Dict[str, Any]

Additional arguments for transcription.



Type Description
Dict[str, Any]

Dict[str, Any]: A dictionary containing the transcription results.

process_seamless(audio_input, model_sampling_rate, processor_args, chunk_size, overlap_size, generate_args)

Process audio input with the Whisper model.

process_wav2vec2(audio_input, model_sampling_rate, processor_args, chunk_size, overlap_size)

Process audio input with the Wav2Vec2 model.

process_whisper(audio_input, model_sampling_rate, processor_args, chunk_size, overlap_size, generate_args)

Process audio input with the Whisper model.


API endpoint to transcribe the given audio input to text using the speech-to-text model. Expects a JSON input with 'audio_file' as a key containing the base64 encoded audio data.


Type Description

Dict[str, str]: A dictionary containing the transcribed text.

Example CURL Request for transcription:

(base64 -w 0 sample.flac | awk '{print "{\"audio_file\": \""$0"\", \"model_sampling_rate\": 16000, \"chunk_size\": 1280000, \"overlap_size\": 213333, \"do_sample\": true, \"num_beams\": 4, \"temperature\": 0.6, \"tgt_lang\": \"eng\"}"}' > /tmp/payload.json)
curl -X POST http://localhost:3000/api/v1/transcribe \
    -H "Content-Type: application/json" \
    -u user:password \
    -d @/tmp/payload.json | jq