microsoft/VibeVoice

Simon Willison ·

Microsoft released VibeVoice, a Whisper-style open-source speech-to-text model with built-in speaker diarization, MIT licensed and available via MLX for local Mac inference.

Categories: Model Releases, OSS & Tools

Excerpt

<p><strong><a href="https://github.com/microsoft/VibeVoice">microsoft/VibeVoice</a></strong></p> VibeVoice is Microsoft's Whisper-style audio model for speech-to-text, MIT licensed and with speaker diarization built into the model.</p> <p>Microsoft released it on January 21st, 2026 but I hadn't tried it until today. Here's a one-liner to run it on a Mac with <code>uv</code>, <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a> (by Prince Canuma) and the 5.71GB <a href="https://huggingface.co/mlx-community/VibeVoice-ASR-4bit">mlx-community/VibeVoice-ASR-4bit</a> MLX conversion of the <a href="https://huggingface.co/microsoft/VibeVoice-ASR/tree/main">17.3GB VibeVoice-ASR</a> model, in this case against a downloaded copy of my recent <a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/">podcast appearance with Lenny Rachitsky</a>:</p> <pre><code>uv run --with mlx-audio python -m mlx_audio.stt.generate \ --model mlx-community/VibeVoice-ASR-4bit \ --audio lenny.mp3 --output-path lenny \ --format json --verbose --max-tokens 32768 </code></pre> <p><img alt="" src="https://static.simonwillison.net/static/2026/vibevoice-terminal.jpg" /></p> <p>The tool reported back:</p> <pre><code>Processing time: 524.79 seconds Prompt: 26615 tokens, 50.718 tokens-per-sec Generation: 20248 tokens, 38.585 tokens-per-sec Peak memory: 30.44 GB </code></pre> <p>So that's 8 minutes 45 seconds for an hour of audio (running on a 128GB M5 Max MacBook Pro).</p> <p>I've tested it against <code>.wav</code> and <code>.mp3</code> files and they both worked fine.</p> <p>If you omit <code>--max-tokens</code> it defaults to 8192, which is enough for about 25 minutes of audio. I discovered that through trial-and-error and quadrupled it to guarantee I'd get the full hour.</p> <p>That command reported using 30.44GB of RAM at peak, but in Activity Monitor I observed 61.5GB of usage during the prefill stage and around 18GB during the generating phase.</p> <p>Here's <a href="https://gist.g