vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference

· r/LocalLLaMA ·

Pure C++/ggml port of Microsoft VibeVoice enables local TTS with voice cloning running on CPU/CUDA/Metal/Vulkan without Python dependencies.

Categories: OSS & Tools

Excerpt

A few weeks ago I shipped [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp), a pure-C++ ggml port of Microsoft VibeVoice (the speech-to-speech model with voice cloning, https://github.com/microsoft/VibeVoice). Wanted to post a follow-up here because we're at a point where the engine has grown well past "first-pass port" and into something other people might actually want to run. This work was brought to you with <3 from the [LocalAI](https://github.com/mudler/LocalAI) team! What it does: * TTS with pre-converted voice prompts (any of upstream's .pt voices, ours or yours converted via scripts/convert\_voice\_to\_gguf.py): give it a 30s reference clip, generate 24kHz speech in the cloned voice. Ships pre-converted GGUFs (0.5B realtime model) on [https://huggingface.co/mudler/vibevoice.cpp-models](https://huggingface.co/mudler/vibevoice.cpp-models) * Long-form ASR with speaker diarization : 7B-parameter model, returns * JSON segments {start, end, speaker, content}. Tested up to 17 minutes * audio in one shot. Backends: CPU (CPU-only baseline), CUDA, Metal, Vulkan, hipBLAS via ggml's backend dispatch. Single binary or [libvibevoice.so](http://libvibevoice.so) \+ flat C ABI for embedding (purego/cgo/dlopen-friendly). Numbers: Inference RTF Peak RSS 68s sample, CUDA Q4_K (GB10): 28 s 0.41 ~6 GB 68s sample, CPU Q4_K (R9): 150 s 2.20 ~8 GB 17min audio, CPU Q8_0: 1929 s 1.94 ~26 G

Discussions