audio classification models

13 models · ranked by HuggingFace downloads

clap-htsat-fused

LAION's CLAP (Contrastive Language-Audio Pretraining) model using the HTSAT (Hierarchical Token-Semantic Audio Transformer) encoder, fused with a text encoder to align audio and text in a shared embedding space. Analogous to CLIP for images, it enables zero-shot audio classification and retrieval using natural language descriptions without task-specific labeled audio data.

16,636,514 ↓ · 106 ♡

wav2vec2-large-robust-24-ft-age-gender

wav2vec2-large-robust-24-ft-age-gender performs audio classification by encoding spectral and temporal features to predict one or more discrete labels.

979,819 ↓ · 54 ♡

wav2vec2-large-robust-12-ft-emotion-msp-dim

wav2vec2-large-robust-12-ft-emotion-msp-dim maps audio waveforms to class labels. Trained on labeled audio datasets for tasks like language identification and speaker recognition.

724,853 ↓ · 169 ♡

wav2vec2-large-xlsr-53-gender-recognition-librispeech

wav2vec2-large-xlsr-53-gender-recognition-librispeech is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.

478,319 ↓ · 47 ♡

MERT-v1-330M

MERT-v1-330M is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.

435,871 ↓ · 89 ♡

ast-finetuned-audioset-10-10-0.4593

ast-finetuned-audioset-10-10-0.4593 classifies audio inputs into discrete categories such as language, emotion, speaker identity, or sound event.

431,468 ↓ · 359 ♡

emotion-recognition-wav2vec2-IEMOCAP

emotion-recognition-wav2vec2-IEMOCAP performs audio classification by encoding spectral and temporal features to predict one or more discrete labels.

426,887 ↓ · 188 ♡

MuQ-large-msd-iter

MuQ-large-msd-iter is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.

399,750 ↓ · 24 ♡

WeSpeaker-ResNet34-LM-MLX

An MLX conversion of WeSpeaker's ResNet34 speaker embedding model for Apple Silicon. WeSpeaker-ResNet34 generates d-vector speaker embeddings used for speaker verification and diarization tasks.

344,789 ↓ · 2 ♡

open-vakgyata

open-vakgyata is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.

333,799 ↓ · 3 ♡

hubert-large-speech-emotion-recognition-russian-dusha-finetuned

hubert-large-speech-emotion-recognition-russian-dusha-finetuned is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.

327,156 ↓ · 15 ♡

wav2vec-vm-finetune

wav2vec-vm-finetune maps audio waveforms to class labels. Trained on labeled audio datasets for tasks like language identification and speaker recognition.

322,931 ↓ · 12 ♡

music_genres_classification

music_genres_classification performs audio classification by encoding spectral and temporal features to predict one or more discrete labels.

308,873 ↓ · 39 ♡