Pyannote speaker-diarization-3.1 is a complete speaker diarization pipeline from pyannote.audio that answers 'who spoke when' in an audio recording. It segments audio into speaker-homogeneous regions, clusters them by speaker identity using embedding models, and outputs timestamped speaker labels. Used in meeting transcription, podcast editing, and call center analytics.
8,496,857 ↓ · 2,401 ♡
WhisperKit CoreML is a collection of Whisper speech recognition models exported to Apple's CoreML format by Argmax, enabling on-device ASR on Apple Silicon (iPhone, iPad, Mac) without network calls. The models run via the WhisperKit framework, which handles chunking, VAD, and decoding on-device. Designed for iOS/macOS applications requiring offline transcription.
8,387,494 ↓ · 193 ♡
Whisper Large-v3-Turbo is a distilled version of Whisper Large-v3, fine-tuned to achieve most of the large model's transcription accuracy at substantially lower inference cost. It supports over 99 languages and maintains the original model's multilingual ASR quality while requiring fewer decoder layers. MIT licensed and directly compatible with HuggingFace's whisper inference pipeline.
7,853,551 ↓ · 3,104 ♡
whisper-base is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
6,352,714 ↓ · 271 ♡
Whisper Large-v3 is OpenAI's full-size ASR model supporting 99+ languages, trained on 680,000 hours of multilingual audio. It delivers state-of-the-art transcription accuracy across languages at the cost of significant inference compute. Apache 2.0 licensed. The Large-v3-Turbo variant (a distilled version) provides similar quality at lower cost for most use cases.
5,977,766 ↓ · 5,846 ♡
wav2vec2-large-xlsr-53-japanese converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.
5,959,856 ↓ · 59 ♡
wav2vec2-large-xlsr-53-polish transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.
3,949,805 ↓ · 12 ♡
A Russian-language ASR model fine-tuned from Facebook's wav2vec2-large-xlsr-53 (cross-lingual 53-language pre-training) on Mozilla Common Voice and Common Voice 6.0 Russian datasets. Produces Russian text transcriptions from audio using CTC decoding. Community-contributed under Apache 2.0.
3,463,019 ↓ · 75 ♡
Wav2Vec2 XLSR-53 Large fine-tuned on Mozilla Common Voice 6 Dutch data for Dutch automatic speech recognition. Part of Jonatas Grosman's systematic XLSR fine-tuning series covering multiple languages. Apache-2.0 licensed with published evaluation results.
3,451,832 ↓ · 15 ♡
wav2vec2-indonesian-javanese-sundanese converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.
3,413,717 ↓ · 15 ♡
A pretrained voice activity detection pipeline from pyannote.audio, identifying speech segments in audio streams. It is trained on AMI, DIHARD, and VoxConverse corpora and outputs timestamped speech/non-speech labels.
3,303,479 ↓ · 236 ♡
MMS-300M-1130-forced-aligner is Meta's 300M parameter wav2vec2-based model fine-tuned for forced phoneme-level alignment across 1,130 languages. It takes audio and a text transcript as input and outputs word- or phoneme-level timestamps, enabling subtitle synchronization and linguistic documentation at scale. The CC-BY-NC-4.0 license restricts commercial deployment.
3,265,689 ↓ · 91 ♡
wav2vec2-large-xlsr-53-portuguese is a XLSR-53 model fine-tuned on Portuguese Common Voice data for automatic speech recognition using CTC decoding on 16kHz mono audio. It achieves competitive word error rates on both European and Brazilian Portuguese test sets. Part of the community XLSR fine-tuning effort from the 2021 HuggingFace strong speech event.
3,191,379 ↓ · 54 ♡
A community-supported speaker diarization pipeline from pyannote.audio that segments multi-speaker audio into per-speaker turns. It combines voice activity detection, speaker embedding, and clustering steps into a single callable pipeline.
3,156,125 ↓ · 587 ♡
XLS-R 53-language wav2vec2 large fine-tuned for Greek ASR by Jonatas Grosman, part of their extensive series of language-specific ASR models. Provides a practical open Greek speech recognition model fine-tuned from a strong multilingual backbone.
3,009,670 ↓ · 4 ♡
wav2vec2-large-xlsr-53-arabic converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.
2,787,941 ↓ · 54 ♡
Whisper-small is OpenAI's 244M-parameter multilingual speech recognition model, covering 99 languages with reasonable accuracy. It balances quality and inference speed, performing significantly better than tiny/base while running on modest hardware.
2,744,535 ↓ · 569 ♡
Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Hungarian speech recognition, trained on Mozilla Common Voice. The model converts Hungarian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
2,574,739 ↓ · 10 ♡
Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Telugu speech recognition, trained on OpenSLR. The model converts Telugu audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
2,202,189 ↓ · 5 ♡
Fine-tuned a wav2vec2 backbone for Romanian speech recognition, trained on Mozilla Common Voice. The model converts Romanian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
2,151,707 ↓ · 7 ♡
Fine-tuned a wav2vec2 backbone for Swedish speech recognition, trained on Mozilla Common Voice and NST. The model converts Swedish audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
2,028,894 ↓ · 13 ♡
Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Persian speech recognition, trained on Mozilla Common Voice. The model converts Persian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,971,447 ↓ · 26 ♡
whisper-tiny transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.
1,852,074 ↓ · 433 ♡
A wav2vec2 300M model fine-tuned for Filipino (Tagalog) ASR using the XLS-R multilingual pretrained backbone. One of the few open Filipino speech recognition models available.
1,782,593 ↓ · 2 ♡
Wav2Vec2-large-xlsr-hindi transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.
1,772,746 ↓ · 12 ♡
Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Urdu speech recognition, trained on Mozilla Common Voice. The model converts Urdu audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,769,429 ↓ · 13 ♡
Fine-tuned a wav2vec2 backbone for Tamil speech recognition, trained on available speech corpora. The model converts Tamil audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,739,258 ↓ · 4 ♡
wav2vec2-large-xlsr-53-th converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.
1,690,093 ↓ · 28 ♡
Qwen3-ASR 1.7B is Alibaba's 1.7B parameter automatic speech recognition model supporting multiple languages. It is designed as a production-grade ASR model with strong multilingual performance at a compact size.
1,655,144 ↓ · 889 ♡
Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Czech speech recognition, trained on Mozilla Common Voice. The model converts Czech audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,495,070 ↓ · 3 ♡
Voxtral-Mini-4B-Realtime-2602 is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
1,491,258 ↓ · 886 ♡
faster-whisper-base transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.
1,429,313 ↓ · 30 ♡
Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Finnish speech recognition, trained on Mozilla Common Voice. The model converts Finnish audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,426,086 ↓ · 1 ♡
wav2vec2-large-xlsr-53-chinese-zh-cn is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
1,419,807 ↓ · 134 ♡
Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Hebrew speech recognition, trained on available speech corpora. The model converts Hebrew audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,398,070 ↓ · 6 ♡
Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for the target language speech recognition, trained on available speech corpora. The model converts the target language audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,347,905 ↓ · 5 ♡
Fine-tuned a wav2vec2 backbone for Vietnamese speech recognition, trained on VLSP 2020. The model converts Vietnamese audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,342,328 ↓ · 2 ♡
wav2vec2-base-960h converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.
1,260,737 ↓ · 398 ♡
parakeet-tdt-0.6b-v3 transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.
1,256,195 ↓ · 46 ♡
Fine-tuned a 1B-parameter wav2vec2 backbone for Norwegian Bokmål speech recognition, trained on available speech corpora. The model converts Norwegian Bokmål audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,236,478 ↓ · 0 ♡
Wav2Vec2 fine-tuned on 60 languages from the LV-60 dataset for phoneme recognition using eSpeak phoneme labels, trained on Common Voice. Produces phoneme-level output rather than word transcription, making it useful for phonetics research and pronunciation assessment rather than standard ASR.
1,215,591 ↓ · 69 ♡
w2v-xls-r-uk is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
1,183,683 ↓ · 8 ♡
Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Bengali speech recognition, trained on OpenSLR. The model converts Bengali audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,152,790 ↓ · 10 ♡
faster-whisper-large-v3 is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
1,152,731 ↓ · 602 ♡
Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Danish speech recognition, trained on FTSpeech. The model converts Danish audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,108,396 ↓ · 0 ♡
faster-whisper-small is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
1,063,325 ↓ · 34 ♡
Fine-tuned a wav2vec2 backbone for Nepali speech recognition, trained on OpenSLR and Mozilla Common Voice. The model converts Nepali audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,062,079 ↓ · 8 ♡
Fine-tuned a wav2vec2 backbone for Croatian speech recognition, trained on ParlaSpeech. The model converts Croatian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
1,029,793 ↓ · 3 ♡
faster-whisper-tiny.en is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
1,021,151 ↓ · 10 ♡
NB-Wav2Vec2 1B for Nynorsk is the Norwegian National Library's 1B-parameter wav2vec2 model fine-tuned for automatic speech recognition in Nynorsk (New Norwegian). One of very few dedicated Nynorsk ASR models publicly available.
997,937 ↓ · 0 ♡
Fine-tuned a wav2vec2 backbone for Malayalam speech recognition, trained on available speech corpora. The model converts Malayalam audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
987,635 ↓ · 7 ♡
parakeet-ctc-1.1b is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
952,031 ↓ · 50 ♡
Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Swahili speech recognition, trained on Mozilla Common Voice. The model converts Swahili audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
879,998 ↓ · 3 ♡
Fine-tuned a wav2vec2 backbone for Catalan speech recognition, trained on Mozilla Common Voice. The model converts Catalan audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
873,669 ↓ · 1 ♡
Qwen3-ASR-0.6B is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
866,077 ↓ · 305 ♡
speaker-diarization is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
844,292 ↓ · 1,286 ♡
MLX-format conversion of NVIDIA's Parakeet-TDT 0.6B ASR model, optimized for on-device inference on Apple Silicon. Parakeet-TDT is a FastConformer-based model trained on 64k hours of English audio and achieves competitive WER on LibriSpeech.
806,822 ↓ · 43 ♡
distil-large-v3 transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.
798,261 ↓ · 376 ♡
Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Estonian speech recognition, trained on Mozilla Common Voice. The model converts Estonian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
749,481 ↓ · 1 ♡
VibeVoice-ASR transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.
744,782 ↓ · 1,184 ♡
Fine-tuned a wav2vec2 backbone for Arabic speech recognition, trained on available speech corpora. The model converts Arabic audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
725,726 ↓ · 1,008 ♡
Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Slovak speech recognition, trained on Mozilla Common Voice. The model converts Slovak audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
724,128 ↓ · 0 ♡
wav2vec2-large-xlsr-korean is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
721,969 ↓ · 56 ♡
Fine-tuned a wav2vec2 backbone for Lithuanian speech recognition, trained on Mozilla Common Voice. The model converts Lithuanian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
719,625 ↓ · 2 ♡
Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Basque speech recognition, trained on Mozilla Common Voice. The model converts Basque audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
716,054 ↓ · 1 ♡
Fine-tuned a wav2vec2 backbone for Sanskrit speech recognition, trained on available speech corpora. The model converts Sanskrit audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
707,813 ↓ · 4 ♡
faster-whisper-tiny converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.
652,853 ↓ · 23 ♡
Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Welsh speech recognition, trained on Mozilla Common Voice. The model converts Welsh audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
640,938 ↓ · 0 ♡
Fine-tuned a wav2vec2 backbone for Khmer speech recognition, trained on OpenSLR and Mozilla Common Voice. The model converts Khmer audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
614,229 ↓ · 2 ♡
Wav2Vec2 XLS-R 300M fine-tuned on Mozilla Common Voice 7 Turkish data for Turkish automatic speech recognition. XLS-R is Meta's cross-lingual speech representation model; this checkpoint adapts it to Turkish via CTC fine-tuning. CC-BY-4.0 licensed.
591,324 ↓ · 15 ♡
HuBERT-Large fine-tuned on LibriSpeech 960h for English automatic speech recognition. HuBERT uses offline clustering of audio features as pseudo-labels during pretraining, achieving strong ASR quality. Apache-2.0 licensed, it's a foundational ASR model from Meta.
569,271 ↓ · 76 ♡
Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Punjabi speech recognition, trained on Mozilla Common Voice. The model converts Punjabi audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
568,244 ↓ · 4 ♡
Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Bulgarian speech recognition, trained on Mozilla Common Voice. The model converts Bulgarian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
563,260 ↓ · 1 ♡
Granite Speech 4.1 2B is IBM's compact speech-language model combining an ASR encoder with a 2B language model decoder. It handles transcription and speech-grounded question answering within a single architecture, targeting enterprise speech analytics use cases.
533,312 ↓ · 143 ♡
Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Sinhala speech recognition, trained on available speech corpora. The model converts Sinhala audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
526,301 ↓ · 0 ♡
A Whisper-based automatic speech recognition model fine-tuned for Bemba, a Bantu language spoken primarily in Zambia. The fine-tune adapts OpenAI's Whisper architecture to Bemba phonology and vocabulary, a language with very limited prior ASR coverage. Evaluation data and training details are sparse, so users should benchmark on their own domain audio before production use.
524,421 ↓ · 0 ♡
Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Marathi speech recognition, trained on OpenSLR. The model converts Marathi audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.
522,512 ↓ · 2 ♡
Granite Speech 3.3-2B is IBM's 2B ASR model supporting 6 languages (English, French, German, Spanish, Portuguese), using a Granite encoder-decoder architecture. It's positioned for multilingual transcription in enterprise settings. Apache-2.0 licensed with eval-results published.
507,326 ↓ · 55 ♡
wav2vec2-large-xlsr-53-slovenian is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
500,396 ↓ · 0 ♡
Phi-4-Multimodal-Instruct is Microsoft's compact multimodal model handling text, audio, images, and video in a single instruction-tuned model. Based on Phi-4-Mini, it covers 23 languages and supports speech recognition, speech translation, and visual QA. MIT-licensed — fully permissive for commercial use.
492,814 ↓ · 1,604 ♡
wav2vec2-large-xlsr-kn is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
485,697 ↓ · 1 ♡
reverb-diarization-v1 is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
471,481 ↓ · 13 ♡
whisper-medium transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.
469,757 ↓ · 284 ♡
seamless-m4t-v2-large is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
458,890 ↓ · 989 ♡
wav2vec2-large-xlsr-53-german is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
457,416 ↓ · 8 ♡
Qwen3-ForcedAligner-0.6B is a forced alignment model from the Qwen3 ASR family, designed to align audio segments to text transcripts at the phoneme or word level. At 0.6B parameters it's compact for deployment in audio processing pipelines. Apache-2.0 licensed.
449,032 ↓ · 144 ♡
Faster-Whisper is SYSTRAN's CTranslate2-optimized conversion of OpenAI Whisper, enabling 4× faster inference at reduced memory. The medium variant (769M) balances multilingual ASR accuracy with throughput.
446,802 ↓ · 52 ♡
wav2vec2-BERT-cantonese is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
442,320 ↓ · 6 ♡
parakeet-tdt-0.6b-v2 is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
417,980 ↓ · 1,500 ♡
wav2vec2-large-xlsr-kazakh is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
401,543 ↓ · 19 ♡
wav2vec2-large-xlsr-galician is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
397,246 ↓ · 2 ♡
wav2vec2-xls-r-300m-pashto is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
376,615 ↓ · 0 ♡
wav2vec2-large-xlsr-latvian-cv is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
368,939 ↓ · 3 ♡
Parakeetkit-Pro is Argmax's optimised packaging of NVIDIA's Parakeet ASR model in CoreML format for Apple Silicon, distributed via the WhisperKit framework. It delivers high-accuracy English transcription on-device with Metal acceleration, positioning itself as a pro-tier local ASR option for macOS applications. The Parakeet architecture is a FastConformer model from NVIDIA trained on 64k+ hours of English speech.
359,906 ↓ · 4 ♡
wav2vec2-large-xlsr-georgian is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
351,678 ↓ · 1 ♡
wav2vec2-large-xls-r-300m-armenian is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
351,066 ↓ · 0 ♡
wav2vec2-large-xlsr-gu is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
347,787 ↓ · 0 ♡
Wav2Vec2 XLSR-53 fine-tuned on Common Voice for 53-language phoneme recognition using eSpeak labels, producing phoneme sequences rather than word transcriptions. Useful for linguistic and phonetics applications requiring language-agnostic phoneme extraction. Apache-2.0 licensed.
335,043 ↓ · 49 ♡
pyannote/speaker-diarization-3.0 is the third major release of the popular pyannote audio diarization pipeline, combining a speaker segmentation model with a speaker embedding model for 'who spoke when' labeling of audio recordings.
322,363 ↓ · 218 ♡
parakeet-tdt-0.6b-v3 is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
317,246 ↓ · 855 ♡
speakerkit-pro converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.
314,854 ↓ · 20 ♡
An MLX-format conversion of NVIDIA's Parakeet TDT-CTC 110M, an English ASR model built on the FastConformer architecture and trained by NVIDIA using the NeMo framework. The MLX conversion enables native Apple Silicon inference. Parakeet TDT-CTC uses a Token-and-Duration Transducer with CTC decoding, which provides fast greedy decoding without beam search overhead.
310,064 ↓ · 1 ♡
whisper-large-v3-turbo-german is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
305,152 ↓ · 57 ♡
wav2vec2-large-xlsr-53-icelandic-ep30-967h is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
305,109 ↓ · 3 ♡
mms-1b-all is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.
303,978 ↓ · 199 ♡
parakeet-tdt-0.6b-v3-coreml is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
302,950 ↓ · 42 ♡
wav2vec2-large-xls-r-300m-albanian-colab is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
300,894 ↓ · 1 ♡
canary-1b-flash is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
299,525 ↓ · 272 ♡
overlapped-speech-detection converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.
294,754 ↓ · 56 ♡
T-one is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
293,020 ↓ · 90 ♡
wav2vec2-cv-be is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.
231,739 ↓ · 1 ♡