GPUStack supports running both Speech-to-Text and Text-to-Speech models. Speech-to-Text models convert audio inputs in various languages into written text, while Text-to-Speech models transform written text into natural and expressive speech.
In this guide, we will walk you through deploying and using Speech-to-Text and Text-to-Speech models in GPUStack.
Before you begin, ensure that you have the following:
Follow these steps to deploy the model from the Model Catalog:
Model Catalog page in the GPUStack UI.Speech-to-Text in the category filter, then select the Whisper-Large-V3-Turbo model.Save button to deploy the model.After deployment, you can monitor the model deployment's status on the Deployments page. Once the deployment is successful, click the ellipsis icon of the deployment and select Open in Playground to start using the model in the Playground.
In the Speech to Text playground,
Upload button to upload an audio file, or click the Microphone button to record audio.Generate Text Content button to generate the transcription.You can also use the API to get streaming transcriptions. Here's an example using curl:
# Replace ${SERVER_URL} with your GPUStack server URL and ${YOUR_GPUSTACK_API_KEY} with your API key.
curl ${SERVER_URL}/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer ${YOUR_GPUSTACK_API_KEY}" \
-F model="whisper-large-v3-turbo" \
-F file="@/path/to/audio-file;type=audio/mpeg" \
-F language="en" \
-F stream="true"
This will return streaming transcription results as they become available.
!!! note
Streaming transcription is only supported when the Speech-to-Text model is deployed on the `vLLM` backend. The `VoxBox` backend does not support streaming.
Follow these steps to deploy the model from the Model Catalog:
Model Catalog page in the GPUStack UI.Text-to-Speech in the category filter, then select the Qwen3-TTS-12Hz-1.7B-CustomVoice model.Save button to deploy the model.After deployment, you can monitor the model deployment's status on the Deployments page. Once the deployment is successful, click the ellipsis icon of the deployment and select Open in Playground to start using the model in the Playground.
In the Text to Speech playground,
Voice dropdown.Instructions to guide the model to generate the desired style of speech.Submit button to generate the audio.You can also use the API to get streaming audio output. Here's an example using curl:
# Replace ${SERVER_URL} with your GPUStack server URL and ${YOUR_GPUSTACK_API_KEY} with your API key.
curl ${SERVER_URL}/v1/audio/speech \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${YOUR_GPUSTACK_API_KEY}" \
-d '{
"model": "qwen3-tts-12hz-1.7b-customvoice",
"voice": "Vivian",
"task_type": "CustomVoice",
"language": "Auto",
"input": "Good one. Okay, fine, I'm just gonna leave this sock monkey here. Goodbye.",
"stream": true,
"response_format": "pcm"
}' --no-buffer | play -t raw -r 24000 -e signed -b 16 -c 1 -
This will stream the audio output directly and play it using the play command. The audio is streamed in PCM format at 24kHz sample rate with 16-bit signed encoding and mono channel.
!!! note
Streaming speech is only supported when the Text-to-Speech model is deployed on the `vLLM` backend. The `VoxBox` backend does not support streaming.
GPUStack also supports voice cloning with Text-to-Speech models. Here's how to use it:
Model Catalog page in the GPUStack UI.Text-to-Speech in the category filter, then select the Qwen3-TTS-12Hz-1.7B-Base model.Save button to deploy the model.Once the deployment is successful, click the ellipsis icon of the deployment and select Open in Playground to start using the model in the Playground. Then follow these steps:
Reference Audio field, upload an audio file or input an audio URL to provide the reference voice for cloning. For example, you can input the URL https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-0115/APS-en_33.wav.Use Speaker Embedding Only (no ICL) option.Good one. Okay, fine, I'm just gonna leave this sock monkey here. Goodbye.Submit button to generate the speech with the cloned voice.