Skip to content

Speech to Text

Bases: AudioAPI

SpeechToTextAPI is a subclass of AudioAPI specifically designed for speech-to-text models. It extends the functionality to handle speech-to-text processing using various ASR models.

Attributes:

Name Type Description
model AutoModelForCTC

The speech-to-text model.

processor AutoProcessor

The processor to prepare input audio data for the model.

Methods

transcribe(audio_input: bytes) -> str: Transcribes the given audio input to text using the speech-to-text model.

Example CLI Usage:

genius SpeechToTextAPI rise \
batch \
    --input_folder ./input \
batch \
    --output_folder ./output \
none \
    --id facebook/wav2vec2-large-960h-lv60-self \
    listen \
        --args \
            model_name="facebook/wav2vec2-large-960h-lv60-self" \
            model_class="Wav2Vec2ForCTC" \
            processor_class="Wav2Vec2Processor" \
            use_cuda=True \
            precision="float32" \
            quantization=0 \
            device_map="cuda:0" \
            max_memory=None \
            torchscript=False \
            compile=True \
            endpoint="*" \
            port=3000 \
            cors_domain="http://localhost:3000" \
            username="user" \
            password="password"

or using whisper.cpp:

genius SpeechToTextAPI rise \
    batch \
            --input_folder ./input \
    batch \
            --output_folder ./output \
    none \
    listen \
        --args \
            model_name="large" \
            use_whisper_cpp=True \
            endpoint="*" \
            port=3000 \
            cors_domain="http://localhost:3000" \
            username="user" \
            password="password"

__init__(input, output, state, **kwargs)

Initializes the SpeechToTextAPI with configurations for speech-to-text processing.

Parameters:

Name Type Description Default
input BatchInput

The input data configuration.

required
output BatchOutput

The output data configuration.

required
state State

The state configuration.

required
**kwargs

Additional keyword arguments.

{}

asr_pipeline(**kwargs)

Recognizes named entities in the input text using the Hugging Face pipeline.

This method leverages a pre-trained NER model to identify and classify entities in text into categories such as names, organizations, locations, etc. It's suitable for processing various types of text content.

Parameters:

Name Type Description Default
**kwargs Any

Arbitrary keyword arguments, typically containing 'text' for the input text.

{}

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: A dictionary containing the original input text and a list of recognized entities.

Example CURL Request for transcription:

(base64 -w 0 sample.flac | awk '{print "{\"audio_file\": \""$0"\", \"model_sampling_rate\": 16000, \"chunk_length_s\": 60}"}' > /tmp/payload.json)
curl -X POST http://localhost:3000/api/v1/asr_pipeline \
    -H "Content-Type: application/json" \
    -u user:password \
    -d @/tmp/payload.json | jq

initialize_pipeline()

Lazy initialization of the NER Hugging Face pipeline.

process_faster_whisper(audio_input, model_sampling_rate, chunk_size, generate_args)

Processes audio input with the faster-whisper model.

Parameters:

Name Type Description Default
audio_input bytes

The audio input for transcription.

required
model_sampling_rate int

The sampling rate of the model.

required
chunk_size int

The size of audio chunks to process.

required
generate_args Dict[str, Any]

Additional arguments for transcription.

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: A dictionary containing the transcription results.

process_seamless(audio_input, model_sampling_rate, processor_args, chunk_size, overlap_size, generate_args)

Process audio input with the Whisper model.

process_wav2vec2(audio_input, model_sampling_rate, processor_args, chunk_size, overlap_size)

Process audio input with the Wav2Vec2 model.

process_whisper(audio_input, model_sampling_rate, processor_args, chunk_size, overlap_size, generate_args)

Process audio input with the Whisper model.

transcribe()

API endpoint to transcribe the given audio input to text using the speech-to-text model. Expects a JSON input with 'audio_file' as a key containing the base64 encoded audio data.

Returns:

Type Description

Dict[str, str]: A dictionary containing the transcribed text.

Example CURL Request for transcription:

(base64 -w 0 sample.flac | awk '{print "{\"audio_file\": \""$0"\", \"model_sampling_rate\": 16000, \"chunk_size\": 1280000, \"overlap_size\": 213333, \"do_sample\": true, \"num_beams\": 4, \"temperature\": 0.6, \"tgt_lang\": \"eng\"}"}' > /tmp/payload.json)
curl -X POST http://localhost:3000/api/v1/transcribe \
    -H "Content-Type: application/json" \
    -u user:password \
    -d @/tmp/payload.json | jq