Update04/19/2026

Grok Speech to Text and Text to Speech APIs

Grok
xAI

April 17, 2026

Grok Speech to Text and Text to Speech APIs
Fast and accurate. Natural, expressive voices. Simple pricing. Multilingual support.

Today, we are excited to announce two powerful standalone audio APIs: Grok Speech to Text (STT) and Grok Text to Speech (TTS). Built on the same stack that powers Grok Voice, Tesla vehicles, and Starlink customer support.

These standalone endpoints make it straightforward for developers to integrate high-quality speech features into any application, whether you're creating voice agents, real-time transcription tools, accessibility solutions, podcasts, or interactive audio experiences.

Speech to Text
High accuracy, low latency.

Generate transcripts from large audio files in milliseconds via our REST API
Transcribe speech in real time with our lowest latency WebSocket API
We’ve added powerful features like word-level timestamps, speaker diarization, and multichannel support. It further includes intelligent Inverse Text Normalization that correctly handles numbers, dates, currencies, and more.

Pricing
We keep pricing straightforward and predictable: Speech to Text is $0.10 per hour for batch and $0.20 per hour for streaming. Full details and current rate limits are available in the xAI API console.

Enterprise-Grade Transcription
Grok STT is evaluated against the top commercial models on phone calls, meetings, video/podcasts, and telephony. It excels at entity recognition and business use cases like medical, legal, and financial.

Most transcription models give you raw spoken words. Grok Speech to Text goes further.

When you enable formatting, the API performs advanced Inverse Text Normalization that intelligently converts spoken language into proper structured output:

Multilingual fluency
The Grok Speech to Text API offers strong multilingual support across 25+ languages, switch languages seamlessly without missing a beat.

Multichannel & Diarization (Speaker Identification)
Transcribe multichannel audio files for perfect speaker separation with the same API.

Detect speakers in both pre-recorded and real-time streaming with word-level speaker IDs using Diarization.

Text to Speech
Fast, natural, and expressive voices with Speech Tags.

Turn long-form text into speech with our REST API
Generate speech in real time with our WebSocket API

Fine-Grained Control
Add natural prosody and emotion using simple inline and wrapping speech tags: [laugh], [sigh], [whisper], <emphasis>, <slow>, <pause>, and many more. These controls let you create engaging, lifelike delivery without complex markup.
</pause></slow></emphasis>

Pricing
Text to Speech is priced at $4.20 per 1 million characters, with straightforward usage-based billing and no hidden fees.