Kalpa Labs Launches: Scaling Generalist Speech Models

By
·
February 9, 2026

Kalpa Labs recently launched!

Launch YC: Kalpa Labs: Scaling Generalist Speech Models

"Scaling Foundational Speech Models for In-Context Learning & Instruction Following"

TL;DR: Kalpa Labs is building a generalist speech model - one system that does speech-to-text, text-to-speech, speech-in/speech-out reasoning, and cross-modal tasks with LLM-level steerability and context engineering. See the demo of emergent capabilities.

Founded by Prashant Shishodia & Gautam Jha

About The Founders'

Prashant led full-stack ML for Google Assistant; trained/evaluated models and scaled smaller Gemini variants to billions of queries/month.

Gautam built nanosecond-latency infrastructure for high-frequency trading.

Problem

Today’s speech stack is fragmented. You need different models, and vendors for STT, TTS, voice design, conversational agents, dubbing, even music. That “Curse of Specialization” creates brittle workflows, poor context carryover, and zero “system-prompt” steerability. Meanwhile, LLMs proved that one generalist model + in-context learning unlocks entirely new use cases.

What’s missing in current speech models:

  1. Contextual awareness: They ignore emotional/prosodic cues and can’t adapt delivery to spoken context history. Their response is the same whether the user says the sentence while crying, laughing, or in a banal tone.
  2. Instruction following: There’s no real analogue to LLM system prompts like “Speak slower for older callers” “Pronounce ‘WORA’ like ‘wah-rah’” “Use a neutral US accent unless user is in India”.
  3. Generalism: Separate models for STT/TTS/music/voice cloning make cross-specialization tasks impossible. Example: “sing a song in my voice with the following lyrics and the tone from this audio clip”.

The founders' are disciples of Sutton’s Bitter Lesson: performance eventually comes from scaling compute, data, and simple, general methods. They believe speech is where text was in 2019 - constrained by small models, fixed task boundaries, and narrow post-training. The upside is to do for speech what GPT-3/ChatGPT did for text: one model, in-context learning, and steerability.

Audio tokenization (RVQ) and decoding stacks are ripe for redesign - big efficiency wins are still on the table.

What’s Hard & What Kalpa Labs changed

Audio is token-hungry. With typical RVQ, 1s of audio ≈ 100–400 tokens. Flattened token streams like Orpheus cap useful context that consumes 8K tokens to generate 90s of audio. Approaches like CSM-1B help with context but still decode 32+ audio tokens per step, throttling inference.

Kalpa Labs removed the long-audio bottleneck, making audio roughly as cheap to train on as text while preserving long-range context. Practically, that means, you’ll be able to generate hours of audio in one-shot and use speech models with very long interleaved text and audio system prompts.

The Progress

At Kalpa Labs they’ve pretrained speech models from 800M to 4.8B params on 2M hours of mixed-domain audio.

Cost Efficiency: As a result of Kalpa Labs efficient architecture, their 800M parameter model took less than $1000 to train. For a comparison, Kokoro-82M while being 10x smaller took $1000 to train on 1000x lesser data.

Emergent Behavior

Kalpa Labs larger base models already show visible signs of emergent behaviour. They describe here, some of these behaviour's - but the extent of such emergent capabilities is still under investigation.

Please check out the audios and comparison with ElevenLabs V3 on notion.

Disfluency & repetition handling

Text: “This is a sentence we want our speech model to, to speak.”

Contextual identity/accents

“I am a software engineer living in Bangalore.” → natural Indian English

“I just moved to Shanghai for a new role.” → adapts toward Chinese-English prosody.

Note: that there is no hardcoded voice switch; this emerges from context.

Prosodic context awareness without explicit tags

“I said we could try only once.”

“I said we could ONLY try once.”

“I said we could ONLY try ONCE?”

Stress and intonation match intent without explicit tags like <laugh>, <gasp>, <surprised>.

Voice diversity

These base models can mimic a wide range of speakers/accents beyond a fixed dropdown of voices.

What the founders' are building

A single speech generalist you can steer like an LLM:

  • Speech-in / speech-out with system prompts and audio examples “adopt this tone; pronounce these domain terms like this.”
  • Generalist Model: STT, TTS, conversational agents, dubbing, and cross-specialization (e.g., “sing this verse in my voice,” “explain this chart out loud in a calm tone”).

They’re solving this by:

  1. Scale: keep pushing model size, data diversity, and post-training task complexity.
  2. Align: instruction-following for speech - system prompts + audio-in-context.
  3. Distill: real-time deployment targets with graceful latency/quality trade-offs.

The Asks

  • Have a speech use case that current models fail at? Tell the founders', they’ll try their best to solve your edge cases.
  • Running speech in production? and have strong opinions on what current speech models do right or wrong. They would love your opinions. Coffee’s on them.
  • Compute partners: If you can offer clusters ≥128×B200, the founders' are eager to collaborate.
  • Early pilots: If you want to try out this models in early access and help them navigate the future of speech models - they’d love to chat.

Learn More

🌐 Visit kalpalabs.ai to learn more.
🤝  You can reach out to the founders via email, X : Prashant | Gautam | KalpaLabs, & Linkedin: Prashant | Gautam | KalpaLabs.

👣 Follow Kalpa Labs on LinkedIn & X.