audio text to text models

4 models · ranked by HuggingFace downloads

ultravox-v0_5-llama-3_2-1b

ultravox-v0_5-llama-3_2-1b is released without a specific pipeline. Common uses include feature extraction, encoder probing, and domain-specific fine-tuning.

1,090,906 ↓ · 85 ♡

Qwen2-Audio-7B-Instruct is Alibaba's multimodal model handling audio and text inputs, capable of audio analysis, speech-to-text transcription, and audio-grounded Q&A. It's instruction-tuned for dialog about audio content. Apache-2.0 licensed and compatible with the Transformers qwen2_audio model type.

719,063 ↓ · 540 ♡

VibeVoice-ASR-HF

VibeVoice-ASR is Microsoft's HuggingFace-packaged automatic speech recognition model, likely a Whisper-style or custom encoder-decoder ASR system targeting informal or conversational speech. The 'Vibe' branding suggests orientation toward natural conversational audio.

655,297 ↓ · 151 ♡

ultravox-v0_6-llama-3_1-8b

ultravox-v0_6-llama-3_1-8b is an open-source audio-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

508,866 ↓ · 6 ♡