LAION's CLAP (Contrastive Language-Audio Pretraining) model using the HTSAT (Hierarchical Token-Semantic Audio Transformer) encoder, fused with a text encoder to align audio and text in a shared embedding space. Analogous to CLIP for images, it enables zero-shot audio classification and retrieval using natural language descriptions without task-specific labeled audio data.
18,153,697 ↓ · 82 ♡
wav2vec2-large-robust-24-ft-age-gender is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
1,048,077 ↓ · 50 ♡
wav2vec2-large-robust-12-ft-emotion-msp-dim is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
1,025,791 ↓ · 159 ♡
wav2vec-vm-finetune is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
869,595 ↓ · 11 ♡
emotion-recognition-wav2vec2-IEMOCAP is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
607,362 ↓ · 184 ♡
music_genres_classification is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
577,056 ↓ · 38 ♡
ast-finetuned-audioset-10-10-0.4593 is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
574,720 ↓ · 353 ♡
WeSpeaker-ResNet34-LM-MLX is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
325,817 ↓ · 2 ♡
hubert-large-speech-emotion-recognition-russian-dusha-finetuned is an open-source audio-classification model available on HuggingFace. Details are sourced from the public model registry.
300,655 ↓ · 15 ♡