vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference
Pure C++/ggml port of Microsoft VibeVoice enables local TTS with voice cloning running on CPU/CUDA/Metal/Vulkan without Python dependencies.
Excerpt
A few weeks ago I shipped [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp), a pure-C++ ggml port of Microsoft
VibeVoice (the speech-to-speech model with voice cloning, https://github.com/microsoft/VibeVoice). Wanted to post a follow-up here because we're at a point where the engine has grown well past "first-pass port" and into something other people might actually want to run.
This work was brought to you with <3 from the [LocalAI](https://github.com/mudler/LocalAI) team!
What it does:
* TTS with pre-converted voice prompts (any of upstream's .pt voices, ours or yours converted via scripts/convert\_voice\_to\_gguf.py): give it a 30s reference clip, generate 24kHz speech in the cloned voice. Ships pre-converted GGUFs (0.5B realtime model) on [https://huggingface.co/mudler/vibevoice.cpp-models](https://huggingface.co/mudler/vibevoice.cpp-models)
* Long-form ASR with speaker diarization : 7B-parameter model, returns
* JSON segments {start, end, speaker, content}. Tested up to 17 minutes
* audio in one shot.
Backends: CPU (CPU-only baseline), CUDA, Metal, Vulkan, hipBLAS via ggml's
backend dispatch. Single binary or [libvibevoice.so](http://libvibevoice.so) \+ flat C ABI for embedding (purego/cgo/dlopen-friendly).
Numbers:
Inference RTF Peak RSS
68s sample, CUDA Q4_K (GB10): 28 s 0.41 ~6 GB
68s sample, CPU Q4_K (R9): 150 s 2.20 ~8 GB
17min audio, CPU Q8_0: 1929 s 1.94 ~26 G
Read at source: https://www.reddit.com/r/LocalLLaMA/comments/1t48fkt/vibevoicecpp_microsoft_vibevoice_tts_longform_asr/