
Founded by Prady Modukuru, Prajwal K R, and Rudrabha Mukhopadhyay
They’ve built a state-of-the-art lip-sync model – and they’re building towards real-time face-to-face conversations w/ AI indistinguishable from humans 🦾
Try Sync's playground here: https://app.synclabs.so/playground
Theoretically, their models can support any language — they learn phoneme / viseme mappings (the most basic unit / “token” of how sounds we make map to the shapes our mouths make to create them). It’s simple, but a start towards learning a foundational understanding of humans from video.
Check out how they used it to dub the entire 2-hour Tucker Carlson interview with Putin speaking fluent English.
Imagine millions gaining access to knowledge, entertainment, and connection — regardless of their native tongue.
Realtime at the edge takes us further — live multilingual broadcasts + video calls, even walking around Tokyo w/ a Vision Pro 2 speaking English while everyone else Japanese.
Keyboard / mice are lossy + low bandwidth. Human communication is rich and goes beyond just the words we say. What if we could compute w/ a face-to-face interaction?
Maybe embedding context around expressions + body language in inputs / outputs would help us interact w/ computers in a more human way. This thread of research is exciting.
Powerful models small enough to run at the edge could unlock a lot:
eg.
Extreme compression for face-to-face video streaming
Enhanced, spatial-aware transcription w/ lip-reading
Detecting deepfakes in the wild
On-device real-time video translation
etc.