
Kalpa Labs recently launched!
Founded by Prashant Shishodia & Gautam Jha
Prashant led full-stack ML for Google Assistant; trained/evaluated models and scaled smaller Gemini variants to billions of queries/month.
Gautam built nanosecond-latency infrastructure for high-frequency trading.
Today’s speech stack is fragmented. You need different models, and vendors for STT, TTS, voice design, conversational agents, dubbing, even music. That “Curse of Specialization” creates brittle workflows, poor context carryover, and zero “system-prompt” steerability. Meanwhile, LLMs proved that one generalist model + in-context learning unlocks entirely new use cases.
What’s missing in current speech models:
The founders' are disciples of Sutton’s Bitter Lesson: performance eventually comes from scaling compute, data, and simple, general methods. They believe speech is where text was in 2019 - constrained by small models, fixed task boundaries, and narrow post-training. The upside is to do for speech what GPT-3/ChatGPT did for text: one model, in-context learning, and steerability.
Audio tokenization (RVQ) and decoding stacks are ripe for redesign - big efficiency wins are still on the table.
Audio is token-hungry. With typical RVQ, 1s of audio ≈ 100–400 tokens. Flattened token streams like Orpheus cap useful context that consumes 8K tokens to generate 90s of audio. Approaches like CSM-1B help with context but still decode 32+ audio tokens per step, throttling inference.
Kalpa Labs removed the long-audio bottleneck, making audio roughly as cheap to train on as text while preserving long-range context. Practically, that means, you’ll be able to generate hours of audio in one-shot and use speech models with very long interleaved text and audio system prompts.
At Kalpa Labs they’ve pretrained speech models from 800M to 4.8B params on 2M hours of mixed-domain audio.
Cost Efficiency: As a result of Kalpa Labs efficient architecture, their 800M parameter model took less than $1000 to train. For a comparison, Kokoro-82M while being 10x smaller took $1000 to train on 1000x lesser data.
Kalpa Labs larger base models already show visible signs of emergent behaviour. They describe here, some of these behaviour's - but the extent of such emergent capabilities is still under investigation.
Please check out the audios and comparison with ElevenLabs V3 on notion.
Disfluency & repetition handling
Text: “This is a sentence we want our speech model to, to speak.”
Contextual identity/accents
“I am a software engineer living in Bangalore.” → natural Indian English
“I just moved to Shanghai for a new role.” → adapts toward Chinese-English prosody.
Note: that there is no hardcoded voice switch; this emerges from context.
Prosodic context awareness without explicit tags
“I said we could try only once.”
“I said we could ONLY try once.”
“I said we could ONLY try ONCE?”
Stress and intonation match intent without explicit tags like <laugh>, <gasp>, <surprised>.
Voice diversity
These base models can mimic a wide range of speakers/accents beyond a fixed dropdown of voices.
What the founders' are building
A single speech generalist you can steer like an LLM:
They’re solving this by: