automatic speech recognition models

111 models · ranked by HuggingFace downloads

speaker-diarization-3.1

Pyannote speaker-diarization-3.1 is a complete speaker diarization pipeline from pyannote.audio that answers 'who spoke when' in an audio recording. It segments audio into speaker-homogeneous regions, clusters them by speaker identity using embedding models, and outputs timestamped speaker labels. Used in meeting transcription, podcast editing, and call center analytics.

8,496,857 ↓ · 2,401 ♡

whisperkit-coreml

WhisperKit CoreML is a collection of Whisper speech recognition models exported to Apple's CoreML format by Argmax, enabling on-device ASR on Apple Silicon (iPhone, iPad, Mac) without network calls. The models run via the WhisperKit framework, which handles chunking, VAD, and decoding on-device. Designed for iOS/macOS applications requiring offline transcription.

8,387,494 ↓ · 193 ♡

whisper-large-v3-turbo

Whisper Large-v3-Turbo is a distilled version of Whisper Large-v3, fine-tuned to achieve most of the large model's transcription accuracy at substantially lower inference cost. It supports over 99 languages and maintains the original model's multilingual ASR quality while requiring fewer decoder layers. MIT licensed and directly compatible with HuggingFace's whisper inference pipeline.

7,853,551 ↓ · 3,104 ♡

whisper-base

whisper-base is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.

6,352,714 ↓ · 271 ♡

whisper-large-v3

Whisper Large-v3 is OpenAI's full-size ASR model supporting 99+ languages, trained on 680,000 hours of multilingual audio. It delivers state-of-the-art transcription accuracy across languages at the cost of significant inference compute. Apache 2.0 licensed. The Large-v3-Turbo variant (a distilled version) provides similar quality at lower cost for most use cases.

5,977,766 ↓ · 5,846 ♡

wav2vec2-large-xlsr-53-japanese

wav2vec2-large-xlsr-53-japanese converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.

5,959,856 ↓ · 59 ♡

wav2vec2-large-xlsr-53-polish

wav2vec2-large-xlsr-53-polish transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.

3,949,805 ↓ · 12 ♡

wav2vec2-large-xlsr-53-russian

A Russian-language ASR model fine-tuned from Facebook's wav2vec2-large-xlsr-53 (cross-lingual 53-language pre-training) on Mozilla Common Voice and Common Voice 6.0 Russian datasets. Produces Russian text transcriptions from audio using CTC decoding. Community-contributed under Apache 2.0.

3,463,019 ↓ · 75 ♡

wav2vec2-large-xlsr-53-dutch

Wav2Vec2 XLSR-53 Large fine-tuned on Mozilla Common Voice 6 Dutch data for Dutch automatic speech recognition. Part of Jonatas Grosman's systematic XLSR fine-tuning series covering multiple languages. Apache-2.0 licensed with published evaluation results.

3,451,832 ↓ · 15 ♡

wav2vec2-indonesian-javanese-sundanese

wav2vec2-indonesian-javanese-sundanese converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.

3,413,717 ↓ · 15 ♡

voice-activity-detection

A pretrained voice activity detection pipeline from pyannote.audio, identifying speech segments in audio streams. It is trained on AMI, DIHARD, and VoxConverse corpora and outputs timestamped speech/non-speech labels.

3,303,479 ↓ · 236 ♡

mms-300m-1130-forced-aligner

MMS-300M-1130-forced-aligner is Meta's 300M parameter wav2vec2-based model fine-tuned for forced phoneme-level alignment across 1,130 languages. It takes audio and a text transcript as input and outputs word- or phoneme-level timestamps, enabling subtitle synchronization and linguistic documentation at scale. The CC-BY-NC-4.0 license restricts commercial deployment.

3,265,689 ↓ · 91 ♡

wav2vec2-large-xlsr-53-portuguese

wav2vec2-large-xlsr-53-portuguese is a XLSR-53 model fine-tuned on Portuguese Common Voice data for automatic speech recognition using CTC decoding on 16kHz mono audio. It achieves competitive word error rates on both European and Brazilian Portuguese test sets. Part of the community XLSR fine-tuning effort from the 2021 HuggingFace strong speech event.

3,191,379 ↓ · 54 ♡

speaker-diarization-community-1

A community-supported speaker diarization pipeline from pyannote.audio that segments multi-speaker audio into per-speaker turns. It combines voice activity detection, speaker embedding, and clustering steps into a single callable pipeline.

3,156,125 ↓ · 587 ♡

wav2vec2-large-xlsr-53-greek

XLS-R 53-language wav2vec2 large fine-tuned for Greek ASR by Jonatas Grosman, part of their extensive series of language-specific ASR models. Provides a practical open Greek speech recognition model fine-tuned from a strong multilingual backbone.

3,009,670 ↓ · 4 ♡

wav2vec2-large-xlsr-53-arabic

wav2vec2-large-xlsr-53-arabic converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.

2,787,941 ↓ · 54 ♡

whisper-small

Whisper-small is OpenAI's 244M-parameter multilingual speech recognition model, covering 99 languages with reasonable accuracy. It balances quality and inference speed, performing significantly better than tiny/base while running on modest hardware.

2,744,535 ↓ · 569 ♡

wav2vec2-large-xlsr-53-hungarian

Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Hungarian speech recognition, trained on Mozilla Common Voice. The model converts Hungarian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

2,574,739 ↓ · 10 ♡

wav2vec2-large-xlsr-53-telugu

Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Telugu speech recognition, trained on OpenSLR. The model converts Telugu audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

2,202,189 ↓ · 5 ♡

romanian-wav2vec2

Fine-tuned a wav2vec2 backbone for Romanian speech recognition, trained on Mozilla Common Voice. The model converts Romanian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

2,151,707 ↓ · 7 ♡

wav2vec2-large-voxrex-swedish

Fine-tuned a wav2vec2 backbone for Swedish speech recognition, trained on Mozilla Common Voice and NST. The model converts Swedish audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

2,028,894 ↓ · 13 ♡

wav2vec2-large-xlsr-53-persian

Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Persian speech recognition, trained on Mozilla Common Voice. The model converts Persian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,971,447 ↓ · 26 ♡

whisper-tiny

whisper-tiny transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.

1,852,074 ↓ · 433 ♡

filipino-wav2vec2-l-xls-r-300m-official

A wav2vec2 300M model fine-tuned for Filipino (Tagalog) ASR using the XLS-R multilingual pretrained backbone. One of the few open Filipino speech recognition models available.

1,782,593 ↓ · 2 ♡

Wav2Vec2-large-xlsr-hindi

Wav2Vec2-large-xlsr-hindi transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.

1,772,746 ↓ · 12 ♡

wav2vec2-large-xls-r-300m-Urdu

Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Urdu speech recognition, trained on Mozilla Common Voice. The model converts Urdu audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,769,429 ↓ · 13 ♡

vakyansh-wav2vec2-tamil-tam-250

Fine-tuned a wav2vec2 backbone for Tamil speech recognition, trained on available speech corpora. The model converts Tamil audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,739,258 ↓ · 4 ♡

wav2vec2-large-xlsr-53-th

wav2vec2-large-xlsr-53-th converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.

1,690,093 ↓ · 28 ♡

Qwen3-ASR-1.7B

Qwen3-ASR 1.7B is Alibaba's 1.7B parameter automatic speech recognition model supporting multiple languages. It is designed as a production-grade ASR model with strong multilingual performance at a compact size.

1,655,144 ↓ · 889 ♡

wav2vec2-xls-r-300m-cs-250

Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Czech speech recognition, trained on Mozilla Common Voice. The model converts Czech audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,495,070 ↓ · 3 ♡

Voxtral-Mini-4B-Realtime-2602

Voxtral-Mini-4B-Realtime-2602 is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.

1,491,258 ↓ · 886 ♡

faster-whisper-base

faster-whisper-base transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.

1,429,313 ↓ · 30 ♡

wav2vec2-large-xlsr-53-finnish

Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Finnish speech recognition, trained on Mozilla Common Voice. The model converts Finnish audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,426,086 ↓ · 1 ♡

wav2vec2-large-xlsr-53-chinese-zh-cn

wav2vec2-large-xlsr-53-chinese-zh-cn is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.

1,419,807 ↓ · 134 ♡

wav2vec2-xls-r-300m-hebrew

Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Hebrew speech recognition, trained on available speech corpora. The model converts Hebrew audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,398,070 ↓ · 6 ♡

wav2vec2-xls-r-300m-mixed

Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for the target language speech recognition, trained on available speech corpora. The model converts the target language audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,347,905 ↓ · 5 ♡

wav2vec2-base-vi-vlsp2020

Fine-tuned a wav2vec2 backbone for Vietnamese speech recognition, trained on VLSP 2020. The model converts Vietnamese audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,342,328 ↓ · 2 ♡

wav2vec2-base-960h

wav2vec2-base-960h converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.

1,260,737 ↓ · 398 ♡

parakeet-tdt-0.6b-v3

parakeet-tdt-0.6b-v3 transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.

1,256,195 ↓ · 46 ♡

nb-wav2vec2-1b-bokmaal-v2

Fine-tuned a 1B-parameter wav2vec2 backbone for Norwegian Bokmål speech recognition, trained on available speech corpora. The model converts Norwegian Bokmål audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,236,478 ↓ · 0 ♡

wav2vec2-lv-60-espeak-cv-ft

Wav2Vec2 fine-tuned on 60 languages from the LV-60 dataset for phoneme recognition using eSpeak phoneme labels, trained on Common Voice. Produces phoneme-level output rather than word transcription, making it useful for phonetics research and pronunciation assessment rather than standard ASR.

1,215,591 ↓ · 69 ♡

w2v-xls-r-uk

w2v-xls-r-uk is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.

1,183,683 ↓ · 8 ♡

wav2vec2-xls-r-300m-bengali

Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Bengali speech recognition, trained on OpenSLR. The model converts Bengali audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,152,790 ↓ · 10 ♡

faster-whisper-large-v3

faster-whisper-large-v3 is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.

1,152,731 ↓ · 602 ♡

wav2vec2-xls-r-300m-ftspeech

Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Danish speech recognition, trained on FTSpeech. The model converts Danish audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,108,396 ↓ · 0 ♡

faster-whisper-small

faster-whisper-small is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.

1,063,325 ↓ · 34 ♡

wav2vec2-xlsr-nepali

Fine-tuned a wav2vec2 backbone for Nepali speech recognition, trained on OpenSLR and Mozilla Common Voice. The model converts Nepali audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,062,079 ↓ · 8 ♡

wav2vec2-xls-r-parlaspeech-hr

Fine-tuned a wav2vec2 backbone for Croatian speech recognition, trained on ParlaSpeech. The model converts Croatian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

1,029,793 ↓ · 3 ♡

faster-whisper-tiny.en

faster-whisper-tiny.en is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.

1,021,151 ↓ · 10 ♡

nb-wav2vec2-1b-nynorsk

NB-Wav2Vec2 1B for Nynorsk is the Norwegian National Library's 1B-parameter wav2vec2 model fine-tuned for automatic speech recognition in Nynorsk (New Norwegian). One of very few dedicated Nynorsk ASR models publicly available.

997,937 ↓ · 0 ♡

wav2vec2-large-xlsr-malayalam

Fine-tuned a wav2vec2 backbone for Malayalam speech recognition, trained on available speech corpora. The model converts Malayalam audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

987,635 ↓ · 7 ♡

parakeet-ctc-1.1b

parakeet-ctc-1.1b is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.

952,031 ↓ · 50 ♡

wav2vec2-large-xlsr-mvc-swahili

Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Swahili speech recognition, trained on Mozilla Common Voice. The model converts Swahili audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

879,998 ↓ · 3 ♡

wav2vec2-large-xlsr-catala

Fine-tuned a wav2vec2 backbone for Catalan speech recognition, trained on Mozilla Common Voice. The model converts Catalan audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

873,669 ↓ · 1 ♡

Qwen3-ASR-0.6B

Qwen3-ASR-0.6B is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.

866,077 ↓ · 305 ♡

speaker-diarization

speaker-diarization is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.

844,292 ↓ · 1,286 ♡

parakeet-tdt-0.6b-v2

MLX-format conversion of NVIDIA's Parakeet-TDT 0.6B ASR model, optimized for on-device inference on Apple Silicon. Parakeet-TDT is a FastConformer-based model trained on 64k hours of English audio and achieves competitive WER on LibriSpeech.

806,822 ↓ · 43 ♡

distil-large-v3

distil-large-v3 transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.

798,261 ↓ · 376 ♡

wav2vec2-large-xlsr-53-estonian

Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Estonian speech recognition, trained on Mozilla Common Voice. The model converts Estonian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

749,481 ↓ · 1 ♡

VibeVoice-ASR

VibeVoice-ASR transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.

744,782 ↓ · 1,184 ♡

cohere-transcribe-03-2026

Fine-tuned a wav2vec2 backbone for Arabic speech recognition, trained on available speech corpora. The model converts Arabic audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

725,726 ↓ · 1,008 ♡

wav2vec2-xls-r-300m-sk-cv8

Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Slovak speech recognition, trained on Mozilla Common Voice. The model converts Slovak audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

724,128 ↓ · 0 ♡

wav2vec2-large-xlsr-korean

wav2vec2-large-xlsr-korean is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.

721,969 ↓ · 56 ♡

wav2vec2-large-xlsr-lithuanian

Fine-tuned a wav2vec2 backbone for Lithuanian speech recognition, trained on Mozilla Common Voice. The model converts Lithuanian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

719,625 ↓ · 2 ♡

wav2vec2-large-xlsr-53-basque

Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Basque speech recognition, trained on Mozilla Common Voice. The model converts Basque audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

716,054 ↓ · 1 ♡

vakyansh-wav2vec2-sanskrit-sam-60

Fine-tuned a wav2vec2 backbone for Sanskrit speech recognition, trained on available speech corpora. The model converts Sanskrit audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

707,813 ↓ · 4 ♡

faster-whisper-tiny

faster-whisper-tiny converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.

652,853 ↓ · 23 ♡

wav2vec2-large-xls-r-300m-welsh

Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Welsh speech recognition, trained on Mozilla Common Voice. The model converts Welsh audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

640,938 ↓ · 0 ♡

wav2vec2-xlsr-khmer

Fine-tuned a wav2vec2 backbone for Khmer speech recognition, trained on OpenSLR and Mozilla Common Voice. The model converts Khmer audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

614,229 ↓ · 2 ♡

wav2vec2-xls-r-300m-cv7-turkish

Wav2Vec2 XLS-R 300M fine-tuned on Mozilla Common Voice 7 Turkish data for Turkish automatic speech recognition. XLS-R is Meta's cross-lingual speech representation model; this checkpoint adapts it to Turkish via CTC fine-tuning. CC-BY-4.0 licensed.

591,324 ↓ · 15 ♡

hubert-large-ls960-ft

HuBERT-Large fine-tuned on LibriSpeech 960h for English automatic speech recognition. HuBERT uses offline clustering of audio features as pseudo-labels during pretraining, achieving strong ASR quality. Apache-2.0 licensed, it's a foundational ASR model from Meta.

569,271 ↓ · 76 ♡

wav2vec2-large-xlsr-53-punjabi

Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Punjabi speech recognition, trained on Mozilla Common Voice. The model converts Punjabi audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

568,244 ↓ · 4 ♡

wav2vec2-large-xls-r-300m-bg-d2

Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Bulgarian speech recognition, trained on Mozilla Common Voice. The model converts Bulgarian audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

563,260 ↓ · 1 ♡

granite-speech-4.1-2b

Granite Speech 4.1 2B is IBM's compact speech-language model combining an ASR encoder with a 2B language model decoder. It handles transcription and speech-grounded question answering within a single architecture, targeting enterprise speech analytics use cases.

533,312 ↓ · 143 ♡

wav2vec2-large-xls-r-300m-sinhala-low-LR-part1

Fine-tuned Facebook's wav2vec2-xls-r-300m cross-lingual backbone for Sinhala speech recognition, trained on available speech corpora. The model converts Sinhala audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

526,301 ↓ · 0 ♡

whisper-bemba-stt

A Whisper-based automatic speech recognition model fine-tuned for Bemba, a Bantu language spoken primarily in Zambia. The fine-tune adapts OpenAI's Whisper architecture to Bemba phonology and vocabulary, a language with very limited prior ASR coverage. Evaluation data and training details are sparse, so users should benchmark on their own domain audio before production use.

524,421 ↓ · 0 ♡

wav2vec2-large-xlsr-marathi

Fine-tuned Facebook's wav2vec2-large-xlsr-53 (300M parameters) for Marathi speech recognition, trained on OpenSLR. The model converts Marathi audio to text and is compatible with the Hugging Face `pipeline('automatic-speech-recognition')` API. It was produced during the XLSR Fine-Tuning Week or similar community events, targeting languages underrepresented in commercial ASR offerings.

522,512 ↓ · 2 ♡

granite-speech-3.3-2b

Granite Speech 3.3-2B is IBM's 2B ASR model supporting 6 languages (English, French, German, Spanish, Portuguese), using a Granite encoder-decoder architecture. It's positioned for multilingual transcription in enterprise settings. Apache-2.0 licensed with eval-results published.

507,326 ↓ · 55 ♡

wav2vec2-large-xlsr-53-slovenian

wav2vec2-large-xlsr-53-slovenian is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

500,396 ↓ · 0 ♡

Phi-4-multimodal-instruct

Phi-4-Multimodal-Instruct is Microsoft's compact multimodal model handling text, audio, images, and video in a single instruction-tuned model. Based on Phi-4-Mini, it covers 23 languages and supports speech recognition, speech translation, and visual QA. MIT-licensed — fully permissive for commercial use.

492,814 ↓ · 1,604 ♡

wav2vec2-large-xlsr-kn

wav2vec2-large-xlsr-kn is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

485,697 ↓ · 1 ♡

reverb-diarization-v1

reverb-diarization-v1 is an ASR model that accepts 16 kHz audio and outputs transcriptions. Accuracy varies by language and audio quality; background noise and accents reduce performance.

471,481 ↓ · 13 ♡

whisper-medium

whisper-medium transcribes audio to text using an encoder-decoder architecture. It processes raw audio waveforms and outputs word sequences with optional timestamps.

469,757 ↓ · 284 ♡

seamless-m4t-v2-large

seamless-m4t-v2-large is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

458,890 ↓ · 989 ♡

wav2vec2-large-xlsr-53-german

wav2vec2-large-xlsr-53-german is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

457,416 ↓ · 8 ♡

Qwen3-ForcedAligner-0.6B

Qwen3-ForcedAligner-0.6B is a forced alignment model from the Qwen3 ASR family, designed to align audio segments to text transcripts at the phoneme or word level. At 0.6B parameters it's compact for deployment in audio processing pipelines. Apache-2.0 licensed.

449,032 ↓ · 144 ♡

faster-whisper-medium

Faster-Whisper is SYSTRAN's CTranslate2-optimized conversion of OpenAI Whisper, enabling 4× faster inference at reduced memory. The medium variant (769M) balances multilingual ASR accuracy with throughput.

446,802 ↓ · 52 ♡

wav2vec2-BERT-cantonese

wav2vec2-BERT-cantonese is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

442,320 ↓ · 6 ♡

parakeet-tdt-0.6b-v2

parakeet-tdt-0.6b-v2 is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

417,980 ↓ · 1,500 ♡

wav2vec2-large-xlsr-kazakh

wav2vec2-large-xlsr-kazakh is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

401,543 ↓ · 19 ♡

wav2vec2-large-xlsr-galician

wav2vec2-large-xlsr-galician is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

397,246 ↓ · 2 ♡

wav2vec2-xls-r-300m-pashto

wav2vec2-xls-r-300m-pashto is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

376,615 ↓ · 0 ♡

wav2vec2-large-xlsr-latvian-cv

wav2vec2-large-xlsr-latvian-cv is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

368,939 ↓ · 3 ♡

parakeetkit-pro

Parakeetkit-Pro is Argmax's optimised packaging of NVIDIA's Parakeet ASR model in CoreML format for Apple Silicon, distributed via the WhisperKit framework. It delivers high-accuracy English transcription on-device with Metal acceleration, positioning itself as a pro-tier local ASR option for macOS applications. The Parakeet architecture is a FastConformer model from NVIDIA trained on 64k+ hours of English speech.

359,906 ↓ · 4 ♡

wav2vec2-large-xlsr-georgian

wav2vec2-large-xlsr-georgian is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

351,678 ↓ · 1 ♡

wav2vec2-large-xls-r-300m-armenian

wav2vec2-large-xls-r-300m-armenian is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

351,066 ↓ · 0 ♡

wav2vec2-large-xlsr-gu

wav2vec2-large-xlsr-gu is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

347,787 ↓ · 0 ♡

wav2vec2-xlsr-53-espeak-cv-ft

Wav2Vec2 XLSR-53 fine-tuned on Common Voice for 53-language phoneme recognition using eSpeak labels, producing phoneme sequences rather than word transcriptions. Useful for linguistic and phonetics applications requiring language-agnostic phoneme extraction. Apache-2.0 licensed.

335,043 ↓ · 49 ♡

speaker-diarization-3.0

pyannote/speaker-diarization-3.0 is the third major release of the popular pyannote audio diarization pipeline, combining a speaker segmentation model with a speaker embedding model for 'who spoke when' labeling of audio recordings.

322,363 ↓ · 218 ♡

parakeet-tdt-0.6b-v3

parakeet-tdt-0.6b-v3 is an open-source automatic-speech-recognition model available on HuggingFace. Details are sourced from the public model registry.

317,246 ↓ · 855 ♡

speakerkit-pro

speakerkit-pro converts spoken audio to written text. It was trained on large multilingual speech corpora and supports chunked inference for long recordings.

314,854 ↓ · 20 ♡

parakeet-tdt_ctc-110m

An MLX-format conversion of NVIDIA's Parakeet TDT-CTC 110M, an English ASR model built on the FastConformer architecture and trained by NVIDIA using the NeMo framework. The MLX conversion enables native Apple Silicon inference. Parakeet TDT-CTC uses a Token-and-Duration Transducer with CTC decoding, which provides fast greedy decoding without beam search overhead.

310,064 ↓ · 1 ♡