Modulate

AI/ML
Data & Analytics
Enterprise
Saas
Boston

Velma, by Modulate, is a frontier model for real conversation - used by enterprises to fight fraud, ensure compliance, and protect users, and by developers to power more accurate real-time agents, transcribe noisy audio more reliably, or detect subtle vocal cues of deepfakes.

Story

For decades, the dream of conversational AI has been simple: a system that listens the way people do. Over the last three years, the field has made extraordinary progress on the language half of that problem. Large language models can read transcripts, summarize them, and respond in fluent, helpful prose. It's a remarkable foundation, and it's reshaped what's possible.

But conversation isn't just language. It's also sound.

Anyone who has been on the phone with a frustrated customer, or felt the shift in tone when a friend says "I'm fine," knows that the words are only part of the signal. The pauses, the prosody, the stress in someone's voice, the breath before a difficult sentence — these carry meaning that a transcript can't preserve. And as voice becomes the primary interface to AI agents handling support, sales, claims, and triage, the gap between reading a conversation and hearing one is becoming the difference between an AI deployment that works and one that doesn't.

The Opportunity: Voice Needs Its Own Architecture

Voice is multidimensional. A frustrated customer and a confused one can say the same six words and mean entirely different things. Catching that difference requires a system that processes audio as audio — tone, pause, prosody, background, cadence — rather than collapsing it into text first and reasoning over the text.

That's a different architectural problem than the one LLMs were designed to solve. It calls for something purpose-built.

The next layer of conversational AI isn't a bigger language model. It's an architecture that was built to listen.

Velma: An Ensemble Built for Voice

In late 2025, Modulate's research team had a breakthrough that reframed the company. The result is Velma — a proprietary voice intelligence platform built on a novel Ensemble Listening Model (ELM) architecture purpose-designed for the structure of sound.

Instead of one massive generalist, Velma orchestrates over 100 specialized models, each analyzing a different dimension of voice — emotion, prosody, deception, escalation, synthetic voice detection, speaker behavior, conversational context — and fuses their outputs through a time-aligned orchestration layer into a single, explainable interpretation of what was actually said. Five layers of analysis. Auditable at every step. No black box.

The performance speaks for itself:

  • #1 globally on Hugging Face's speech deepfake detection leaderboard at 98.9% accuracy.
  • #1 globally on the AMI dataset for noisy, multi-speaker transcription — the hardest real-world voice benchmark there is.
  • ~30% higher accuracy on conversation understanding than today's leading generalist models.
  • ~$0.13 per 1,000 minutes of voice processed — 10 to 100x more cost-efficient than typical hyperscaler voice pipelines.

Velma didn't appear from nowhere. The ELM architecture has been hardened in production for years inside ToxMod, Modulate's live voice-moderation system embedded in Call of Duty, Grand Theft Auto Online, and Rainbow Six Siege — the noisiest, most adversarial real-time voice environment that exists. Today Velma runs at staggering scale: 300M+ users, 250M+ hours of voice data, and another 700 conversations every minute, across customers including Activision Blizzard, DoorDash, Epic Games, Meta, and Microsoft. To date, the platform has protected over 40M consumers from fraud, harassment, and abuse, and safeguarded $100M+ in enterprise value. You can hear it yourself at m-demo.in.

The strategic shift is that Velma is no longer a feature inside a moderation product. It's the foundation — and Modulate is becoming the voice intelligence layer that enterprises building real-world conversational AI will increasingly depend on.

Why We Invested

Hyperplane led Modulate's pre-seed in 2017 on a single thesis: voice would become the primary interface to AI. Nine years later, three things had to be true at the same time, and they finally are:

  • Voice is the interface. AI agents are picking up the phone for support, sales, claims, and triage. Every Fortune 500 voice deployment runs through audio, and the volume is compounding daily.
  • The complementary layer is needed. Language models handle text beautifully. The next leap in conversational AI comes from pairing them with systems built natively for sound.
  • The architecture exists. ELMs aren't a research project. They're already at the top of every public benchmark, already in production at the hardest scale.

Mike and Carter met at MIT when Mike cracked a physics problem Carter was whiteboarding in a hallway. Fast friends, faster collaborators — they've spent nearly a decade building the world's deepest expertise in voice intelligence. Carter's optimization work delivers orders-of-magnitude improvements in cost and performance; before Modulate, he was at NASA's Jet Propulsion Lab building ML systems optimized to fly on spacecraft. Mike shapes how the Fortune 500 and Capitol Hill think about deploying AI responsibly, sitting on working groups alongside leaders from across the industry. They've been right about voice for nine years. They'll be right for the next nine.

Human beings talk. Modulate's AI can finally listen.