any to any models

22 models · ranked by HuggingFace downloads

gemma-4-E4B-it

Gemma 4-E4B-IT is Google DeepMind's edge-optimized 4-billion-parameter any-to-any multimodal model from the Gemma 4 family, designed for deployment on mobile and edge devices rather than servers. The 'any-to-any' pipeline_tag indicates multimodal input and output capability beyond standard image-text-to-text. Apache 2.0 licensed.

6,138,750 ↓ · 1,269 ♡

gemma-4-E2B-it

Gemma 4 E2B is Google's efficient 2B-parameter multimodal model, instruction-tuned for both image-text and text-only prompts. It targets edge and on-device deployment where a sub-3B footprint is necessary.

2,390,353 ↓ · 767 ♡

Qwen3-Omni-30B-A3B-Instruct

Qwen3-Omni-30B-A3B-Instruct handles multiple input and output modalities including text, images, and audio within a single unified architecture.

2,020,526 ↓ · 943 ♡

gemma-4-12B-it

gemma-4-12B-it is Google's Gemma 4 multimodal (text + image) instruction-tuned model. It accepts both text and image inputs and produces text, making it suitable for document analysis, visual Q&A, and structured data extraction. Released under Apache-2.0, it targets users who need a capable VLM without access restrictions.

1,696,240 ↓ · 1,108 ♡

Qwen2.5-Omni-3B

Qwen2.5-Omni-3B handles multiple input and output modalities including text, images, and audio within a single unified architecture.

1,667,766 ↓ · 336 ♡

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 processes and generates across multiple modalities, enabling cross-modal reasoning in a single model call.

1,369,439 ↓ · 143 ♡

gemma-4-12B-it-qat-w4a16-ct

gemma-4-12B-it-qat-w4a16-ct is a quantization-aware trained (QAT) weights for W4A16 deployment version of Google's Gemma 4 multimodal (text + image) instruction-tuned model. 12B parameters are reduced to lower-precision weights for deployment on memory-constrained hardware or Apple Silicon, with quality degradation typically small for general chat tasks. The base model is Apache-2.0 licensed.

1,270,771 ↓ · 29 ♡

gemma-4-E4B-it-MLX-4bit

A 4-bit MLX quantization of Google's Gemma 4 E4B instruct model (an efficient 4B-equivalent MoE variant) for Apple Silicon. Targets developers who want Gemma 4 running locally on MacBook-class hardware.

1,083,883 ↓ · 12 ♡

gemma-4-E4B-it-MLX-8bit

An 8-bit MLX quantization of Google's Gemma 4 E4B instruct model for Apple Silicon. Higher quality than the 4-bit variant at the cost of roughly double the memory, targeting M2/M3 Pro or Max class machines.

1,056,151 ↓ · 7 ♡

gemma-4-E4B-it-MLX-6bit

gemma-4-E4B-it-MLX-6bit is a MLX 6-bit quantized weights optimized for Apple Silicon inference version of Google's Gemma 4 MoE-based multimodal (text + image) instruction-tuned model. parameters are reduced to lower-precision weights for deployment on memory-constrained hardware or Apple Silicon, with quality degradation typically small for general chat tasks. The base model is Apache-2.0 licensed.

1,039,621 ↓ · 3 ♡

gemma-4-E4B-it-MLX-5bit

gemma-4-E4B-it-MLX-5bit is a MLX 5-bit quantized weights optimized for Apple Silicon inference version of Google's Gemma 4 MoE-based multimodal (text + image) instruction-tuned model. parameters are reduced to lower-precision weights for deployment on memory-constrained hardware or Apple Silicon, with quality degradation typically small for general chat tasks. The base model is Apache-2.0 licensed.

1,038,809 ↓ · 0 ♡

Qwen2.5-Omni-7B

Qwen2.5-Omni-7B is a multimodal model accepting diverse input types and producing outputs across text, vision, and audio modalities.

722,512 ↓ · 1,910 ♡

gemma-4-E4B

gemma-4-E4B is a multimodal model accepting diverse input types and producing outputs across text, vision, and audio modalities.

596,652 ↓ · 323 ♡

gemma-4-31B-it-assistant

gemma-4-31B-it-assistant is an open-source any-to-any model available on HuggingFace. Details are sourced from the public model registry.

489,708 ↓ · 304 ♡

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Nemotron-3 Nano Omni is NVIDIA's multimodal reasoning model — 30B total parameters with 3B active per token — that extends the Nemotron-H architecture to support any-to-any input and output modalities including audio, image, and text. The Reasoning variant includes a thinking mode for extended chain-of-thought. It runs in BF16 full precision, targeting multi-GPU H100/H200 deployments.

445,501 ↓ · 357 ♡

gemma-4-12B-it-qat-q4_0-gguf

gemma-4-12B-it-qat-q4_0-gguf is an open-source any-to-any model available on HuggingFace. Details are sourced from the public model registry.

441,974 ↓ · 178 ♡

OneThinker-SFT-Qwen3-8B

OneThinker-SFT is a Qwen3-8B model fine-tuned by OneThink with supervised fine-tuning (SFT) on a vision-language task mixture, using the Qwen3-VL architecture for any-to-any multimodal output. Apache-2.0 licensed.

431,837 ↓ · 4 ♡

MiniCPM-o-2_6

MiniCPM-o 2.6 is an omnimodal 8B model from OpenBMB supporting speech, image, and text inputs with real-time audio output. It targets on-device multimodal scenarios, particularly mobile and edge deployments, with end-to-end speech conversation capability.

424,139 ↓ · 1,292 ♡

gemma-4-E2B

Gemma-4-E2B is Google's 2B edge model from the Gemma-4 family, designed for on-device deployment with multimodal any-to-any capability. The 'E' prefix indicates edge-optimized — smaller memory footprint and lower latency are prioritized over raw capability. Supports image and text input/output in a single model.

391,355 ↓ · 352 ♡

Search