Detailed Introduction
VoxCPM is an open-source tokenizer-free text-to-speech (TTS) system released by OpenBMB. It models speech in a continuous acoustic space to enable context-aware, highly expressive synthesis and supports zero-shot voice cloning from short reference audio. Built on a MiniCPM-4 backbone, VoxCPM provides training and inference pipelines, pretrained weights, and an interactive demo on Hugging Face for quick evaluation.
Main Features
- Context-aware expressiveness: generates prosody and speaking style that match the semantic content by modeling continuous acoustic representations.
- True-to-life voice cloning: accurate zero-shot cloning capturing timbre, prosody, and fine-grained characteristics from brief reference audio.
- Efficient inference: engineering optimizations enable streaming synthesis with low real-time factor (RTF) on consumer GPUs.
- Open-source release: code, checkpoints and examples published under Apache-2.0 license on GitHub and Hugging Face.
Use Cases
VoxCPM is suitable for high-fidelity and context-sensitive synthesis tasks such as voice assistants, media dubbing, linguistic research, and prototyping TTS for low-resource languages. Researchers can experiment with novel synthesis techniques, and engineers can quickly integrate pretrained models and demo services.
Technical Features
VoxCPM employs tokenizer-free continuous acoustic modeling combined with hierarchical language modeling and FSQ constraints to decouple semantics and acoustics. The system uses a diffusion autoregressive pipeline built on MiniCPM-4, and provides training recipes, inference interfaces, and example scripts; demo and model downloads are available via the project page and Hugging Face demo: Demo and ArXiv report .