text to speech models

16 models · ranked by HuggingFace downloads

Kokoro-82M

Kokoro-82M is a compact 82-million-parameter text-to-speech model fine-tuned from StyleTTS2, targeting natural-sounding English speech synthesis at a size runnable on CPU or modest GPU. Released under Apache 2.0 with a HuggingFace DOI, it gained attention as a high-quality open TTS model at significantly smaller scale than most alternatives. It supports multiple English voice styles.

16,925,704 ↓ · 6,372 ♡

XTTS-v2

XTTS-v2 is Coqui's multilingual text-to-speech model supporting 17 languages with voice cloning from a short audio sample. It uses a GPT-style decoder for speech token generation, enabling zero-shot speaker cloning without fine-tuning. The model was released before Coqui's closure and remains available under a non-standard license.

9,578,491 ↓ · 3,610 ♡

Qwen3-TTS-12Hz-1.7B-CustomVoice

Qwen3-TTS-12Hz-1.7B-CustomVoice synthesizes speech waveforms from text input. It produces natural-sounding audio and supports different speaking rates or voice styles depending on the variant.

2,122,898 ↓ · 1,622 ♡

chatterbox

Chatterbox is Resemble AI's open-source text-to-speech model offering voice cloning and expressive speech synthesis. It is designed as a production-grade TTS system with controllable prosody and emotion.

2,016,229 ↓ · 1,641 ♡

OmniVoice

OmniVoice from k2-fsa is a multilingual speech model targeting end-to-end ASR and voice processing tasks. Published as part of the k2/Lhotse/sherpa-onnx ecosystem for server and edge speech applications.

1,829,342 ↓ · 1,058 ♡

Qwen3-TTS-12Hz-0.6B-CustomVoice

Qwen3-TTS CustomVoice is the 0.6B variant of Qwen's TTS family focused on voice customization from reference audio. At 12Hz token rate and 0.6B parameters, it's designed for constrained environments where a full 1.7B TTS model is too heavy. Supports 9 languages including CJK languages and major European languages.

1,016,366 ↓ · 157 ♡

indic-parler-tts

indic-parler-tts is a TTS model that generates audio directly from text tokens, enabling low-latency speech synthesis without a separate vocoder stage.

879,755 ↓ · 245 ♡

Qwen3-TTS-12Hz-0.6B-Base

Qwen3-TTS-12Hz-0.6B-Base is a TTS model that generates audio directly from text tokens, enabling low-latency speech synthesis without a separate vocoder stage.

813,494 ↓ · 249 ♡

VibeVoice-Realtime-0.5B

VibeVoice-Realtime-0.5B synthesizes speech waveforms from text input. It produces natural-sounding audio and supports different speaking rates or voice styles depending on the variant.

737,556 ↓ · 1,233 ♡

Qwen3-TTS-12Hz-1.7B-VoiceDesign

Qwen3-TTS VoiceDesign is a 1.7B text-to-speech model operating at 12Hz token rate, designed to support custom voice creation alongside standard TTS. It covers multiple languages and generates expressive speech from text input. Apache-2.0 licensed and part of Qwen's audio model family.

728,897 ↓ · 362 ♡

F5-TTS

F5-TTS synthesizes speech waveforms from text input. It produces natural-sounding audio and supports different speaking rates or voice styles depending on the variant.

655,125 ↓ · 1,179 ♡

MOSS-TTS

MOSS-TTS is OpenMOSS's multilingual text-to-speech model supporting 20 languages including Chinese, English, German, Japanese, Korean, Russian, and Hebrew. It uses a delay-based autoregressive architecture (moss_tts_delay) for high-quality speech synthesis with natural prosody. Apache-2.0 licensing makes it a viable open alternative to commercial TTS APIs for multilingual applications.

534,515 ↓ · 402 ♡

Kokoro-82M-v1.0-ONNX

Kokoro-82M is a lightweight 82M-parameter text-to-speech model converted to ONNX by the HuggingFace ONNX community, enabling browser-based and edge TTS via Transformers.js. It uses the StyleTTS2 architecture, which separates style and content representations to produce expressive speech without large acoustic models. The ONNX conversion allows direct client-side inference without a server.

528,748 ↓ · 232 ♡

Search