Speech to Text¶
Bases: AudioAPI
SpeechToTextAPI is a subclass of AudioAPI specifically designed for speech-to-text models. It extends the functionality to handle speech-to-text processing using various ASR models.
Attributes:
Name | Type | Description |
---|---|---|
model |
AutoModelForCTC
|
The speech-to-text model. |
processor |
AutoProcessor
|
The processor to prepare input audio data for the model. |
Methods
transcribe(audio_input: bytes) -> str: Transcribes the given audio input to text using the speech-to-text model.
Example CLI Usage:
genius SpeechToTextAPI rise \
batch \
--input_folder ./input \
batch \
--output_folder ./output \
none \
--id facebook/wav2vec2-large-960h-lv60-self \
listen \
--args \
model_name="facebook/wav2vec2-large-960h-lv60-self" \
model_class="Wav2Vec2ForCTC" \
processor_class="Wav2Vec2Processor" \
use_cuda=True \
precision="float32" \
quantization=0 \
device_map="cuda:0" \
max_memory=None \
torchscript=False \
compile=True \
endpoint="*" \
port=3000 \
cors_domain="http://localhost:3000" \
username="user" \
password="password"
or using whisper.cpp:
genius SpeechToTextAPI rise \
batch \
--input_folder ./input \
batch \
--output_folder ./output \
none \
listen \
--args \
model_name="large" \
use_whisper_cpp=True \
endpoint="*" \
port=3000 \
cors_domain="http://localhost:3000" \
username="user" \
password="password"
__init__(input, output, state, **kwargs)
¶
Initializes the SpeechToTextAPI with configurations for speech-to-text processing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input |
BatchInput
|
The input data configuration. |
required |
output |
BatchOutput
|
The output data configuration. |
required |
state |
State
|
The state configuration. |
required |
**kwargs |
Additional keyword arguments. |
{}
|
asr_pipeline(**kwargs)
¶
Recognizes named entities in the input text using the Hugging Face pipeline.
This method leverages a pre-trained NER model to identify and classify entities in text into categories such as names, organizations, locations, etc. It's suitable for processing various types of text content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**kwargs |
Any
|
Arbitrary keyword arguments, typically containing 'text' for the input text. |
{}
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dict[str, Any]: A dictionary containing the original input text and a list of recognized entities. |
Example CURL Request for transcription:
initialize_pipeline()
¶
Lazy initialization of the NER Hugging Face pipeline.
process_faster_whisper(audio_input, model_sampling_rate, chunk_size, generate_args)
¶
Processes audio input with the faster-whisper model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
audio_input |
bytes
|
The audio input for transcription. |
required |
model_sampling_rate |
int
|
The sampling rate of the model. |
required |
chunk_size |
int
|
The size of audio chunks to process. |
required |
generate_args |
Dict[str, Any]
|
Additional arguments for transcription. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dict[str, Any]: A dictionary containing the transcription results. |
process_seamless(audio_input, model_sampling_rate, processor_args, chunk_size, overlap_size, generate_args)
¶
Process audio input with the Whisper model.
process_wav2vec2(audio_input, model_sampling_rate, processor_args, chunk_size, overlap_size)
¶
Process audio input with the Wav2Vec2 model.
process_whisper(audio_input, model_sampling_rate, processor_args, chunk_size, overlap_size, generate_args)
¶
Process audio input with the Whisper model.
transcribe()
¶
API endpoint to transcribe the given audio input to text using the speech-to-text model. Expects a JSON input with 'audio_file' as a key containing the base64 encoded audio data.
Returns:
Type | Description |
---|---|
Dict[str, str]: A dictionary containing the transcribed text. |
Example CURL Request for transcription:
(base64 -w 0 sample.flac | awk '{print "{\"audio_file\": \""$0"\", \"model_sampling_rate\": 16000, \"chunk_size\": 1280000, \"overlap_size\": 213333, \"do_sample\": true, \"num_beams\": 4, \"temperature\": 0.6, \"tgt_lang\": \"eng\"}"}' > /tmp/payload.json)
curl -X POST http://localhost:3000/api/v1/transcribe \
-H "Content-Type: application/json" \
-u user:password \
-d @/tmp/payload.json | jq