ArkSphere Community : AI-native runtime, infrastructure, and open source.

VoxCPM

A tokenizer-free text-to-speech (TTS) system enabling context-aware speech generation and true-to-life zero-shot voice cloning.

Detailed Introduction

VoxCPM is an open-source tokenizer-free text-to-speech (TTS) system released by OpenBMB. It models speech in a continuous acoustic space to enable context-aware, highly expressive synthesis and supports zero-shot voice cloning from short reference audio. Built on a MiniCPM-4 backbone, VoxCPM provides training and inference pipelines, pretrained weights, and an interactive demo on Hugging Face for quick evaluation.

Main Features

  • Context-aware expressiveness: generates prosody and speaking style that match the semantic content by modeling continuous acoustic representations.
  • True-to-life voice cloning: accurate zero-shot cloning capturing timbre, prosody, and fine-grained characteristics from brief reference audio.
  • Efficient inference: engineering optimizations enable streaming synthesis with low real-time factor (RTF) on consumer GPUs.
  • Open-source release: code, checkpoints and examples published under Apache-2.0 license on GitHub and Hugging Face.

Use Cases

VoxCPM is suitable for high-fidelity and context-sensitive synthesis tasks such as voice assistants, media dubbing, linguistic research, and prototyping TTS for low-resource languages. Researchers can experiment with novel synthesis techniques, and engineers can quickly integrate pretrained models and demo services.

Technical Features

VoxCPM employs tokenizer-free continuous acoustic modeling combined with hierarchical language modeling and FSQ constraints to decouple semantics and acoustics. The system uses a diffusion autoregressive pipeline built on MiniCPM-4, and provides training recipes, inference interfaces, and example scripts; demo and model downloads are available via the project page and Hugging Face demo: Demo and ArXiv report .

VoxCPM
Resource Info
🗣️ Text to Speech 🌱 Open Source 🔊 Audio