microsoft/VibeVoice

Simon Willison · Apr 27, 2026

Microsoft released VibeVoice, a Whisper-style open-source speech-to-text model with built-in speaker diarization, MIT licensed and available via MLX for local Mac inference.

Categories: Model Releases, OSS & Tools

Excerpt

<a href="https://github.com/microsoft/VibeVoice">microsoft/VibeVoice</a> VibeVoice is Microsoft's Whisper-style audio model for speech-to-text, MIT licensed and with speaker diarization built into the model. Microsoft released it on January 21st, 2026 but I hadn't tried it until today. Here's a one-liner to run it on a Mac with <code>uv</code>, <a href="https://github.com/Blaizzy/mlx-audio">mlx-audio</a> (by Prince Canuma) and the 5.71GB <a href="https://huggingface.co/mlx-community/VibeVoice-ASR-4bit">mlx-community/VibeVoice-ASR-4bit</a> MLX conversion of the <a href="https://huggingface.co/microsoft/VibeVoice-ASR/tree/main">17.3GB VibeVoice-ASR</a> model, in this case against a downloaded copy of my recent <a href="https://simonwillison.net/2026/Apr/2/lennys-podcast/">podcast appearance with Lenny Rachitsky</a>: <pre><code>uv run --with mlx-audio python -m mlx_audio.stt.generate \ --model mlx-community/VibeVoice-ASR-4bit \ --audio lenny.mp3 --output-path lenny \ --format json --verbose --max-tokens 32768 </code></pre> <img alt="" src="https://static.simonwillison.net/static/2026/vibevoice-terminal.jpg" /> The tool reported back: <pre><code>Processing time: 524.79 seconds Prompt: 26615 tokens, 50.718 tokens-per-sec Generation: 20248 tokens, 38.585 tokens-per-sec Peak memory: 30.44 GB </code></pre> So that's 8 minutes 45 seconds for an hour of audio (running on a 128GB M5 Max MacBook Pro). I've tested it against <code>.wav</code> and <code>.mp3</code> files and they both worked fine. If you omit <code>--max-tokens</code> it defaults to 8192, which is enough for about 25 minutes of audio. I discovered that through trial-and-error and quadrupled it to guarantee I'd get the full hour. That command reported using 30.44GB of RAM at peak, but in Activity Monitor I observed 61.5GB of usage during the prefill stage and around 18GB during the generating phase. Here's <a href="https://gist.g

Read at source: https://simonwillison.net/2026/Apr/27/vibevoice/#atom-everything