LAION's CLAP (Contrastive Language-Audio Pretraining) model using the HTSAT (Hierarchical Token-Semantic Audio Transformer) encoder, fused with a text encoder to align audio and text in a shared embedding space. Analogous to CLIP for images, it enables zero-shot audio classification and retrieval using natural language descriptions without task-specific labeled audio data.
16,636,514 ↓ · 106 ♡
wav2vec2-large-robust-24-ft-age-gender performs audio classification by encoding spectral and temporal features to predict one or more discrete labels.
979,819 ↓ · 54 ♡
wav2vec2-large-robust-12-ft-emotion-msp-dim maps audio waveforms to class labels. Trained on labeled audio datasets for tasks like language identification and speaker recognition.
724,853 ↓ · 169 ♡
wav2vec2-large-xlsr-53-gender-recognition-librispeech is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
478,319 ↓ · 47 ♡
MERT-v1-330M is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
435,871 ↓ · 89 ♡
ast-finetuned-audioset-10-10-0.4593 classifies audio inputs into discrete categories such as language, emotion, speaker identity, or sound event.
431,468 ↓ · 359 ♡
emotion-recognition-wav2vec2-IEMOCAP performs audio classification by encoding spectral and temporal features to predict one or more discrete labels.
426,887 ↓ · 188 ♡
MuQ-large-msd-iter is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
399,750 ↓ · 24 ♡
An MLX conversion of WeSpeaker's ResNet34 speaker embedding model for Apple Silicon. WeSpeaker-ResNet34 generates d-vector speaker embeddings used for speaker verification and diarization tasks.
344,789 ↓ · 2 ♡
open-vakgyata is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
333,799 ↓ · 3 ♡
hubert-large-speech-emotion-recognition-russian-dusha-finetuned is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
327,156 ↓ · 15 ♡
wav2vec-vm-finetune maps audio waveforms to class labels. Trained on labeled audio datasets for tasks like language identification and speaker recognition.
322,931 ↓ · 12 ♡
music_genres_classification performs audio classification by encoding spectral and temporal features to predict one or more discrete labels.
308,873 ↓ · 39 ♡