AI Tools.

Search

image text to text models

199 models · ranked by HuggingFace downloads

gemma-4-26B-A4B-it

Gemma 4-26B-A4B-IT is Google DeepMind's 26-billion-total-parameter MoE (Mixture-of-Experts) vision-language model, with approximately 4 billion active parameters per token. The MoE design means it achieves 26B parameter quality while activating only ~4B per forward pass, reducing per-token compute relative to a dense 26B model. Apache 2.0 licensed.

12,607,949 ↓ · 1,165 ♡

gemma-4-31B-it

Gemma 4-31B-IT is Google DeepMind's 31-billion-parameter instruction-tuned vision-language model from the Gemma 4 family, supporting both image and text inputs. It offers strong multimodal reasoning at open-weight scale, with Apache 2.0 licensing making it directly deployable for commercial applications. Part of the gemma4 architecture with improvements over Gemma 2.

11,126,418 ↓ · 3,041 ♡

Qwen3.5-4B

Qwen3.5-4B is Alibaba Cloud's 4-billion-parameter instruction-tuned vision-language model from the Qwen3.5 series, fine-tuned from Qwen3.5-4B-Base for multimodal conversational tasks. It handles image and text inputs at a scale deployable on consumer GPUs with 8-12GB VRAM. Apache 2.0 licensed.

9,557,891 ↓ · 667 ♡

Qwen3.5-9B

Qwen3.5-9B is a 9-billion-parameter instruction-tuned vision-language model from Alibaba Cloud's Qwen3.5 series, fine-tuned from Qwen3.5-9B-Base for multimodal conversational tasks. It accepts image and text inputs for visual reasoning, document understanding, and grounded question answering. Apache 2.0 licensed.

9,463,589 ↓ · 1,586 ♡

Qwen2.5-VL-7B-Instruct

Qwen2.5-VL-7B-Instruct is Alibaba Cloud's 7-billion-parameter vision-language model from the Qwen2.5-VL series, accepting image and video inputs alongside text for visual question answering, document understanding, and grounding tasks. It supports multiple image resolutions dynamically and shows improved OCR and document reasoning compared to the earlier Qwen-VL series. Apache 2.0 licensed.

8,081,428 ↓ · 1,588 ♡

Qwen3-VL-8B-Instruct

Qwen3-VL-8B-Instruct is Alibaba Cloud's 8-billion-parameter vision-language model from the Qwen3-VL series, extending the VL line with improved visual reasoning and document understanding. It targets mid-tier server GPU deployment where 2B VLMs are insufficient and 30B+ is impractical. Apache 2.0 licensed.

7,347,992 ↓ · 966 ♡

Qwen3.6-27B

Qwen3.6-27B is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

5,989,156 ↓ · 1,765 ♡

Qwen3.6-27B-FP8

FP8-quantized version of Qwen 3.6 27B for H100/H200 serving. Reduces memory from ~54GB (BF16) to approximately 27GB while maintaining near-BF16 quality on most benchmarks for a dense multimodal model.

5,904,658 ↓ · 278 ♡

Qwen3.6-35B-A3B-FP8

FP8-quantized version of Qwen3.6-35B-A3B for deployment on hardware with FP8 support (H100/H200). Reduces memory footprint and inference latency compared to BF16 with minimal quality degradation on most benchmarks.

5,459,912 ↓ · 277 ♡

Qwen2.5-VL-3B-Instruct

Qwen2.5-VL-3B-Instruct is Alibaba's 3B parameter vision-language model from the Qwen2.5-VL series, supporting image and video frame understanding alongside text instruction-following. It targets edge and mobile deployment where 7B+ VL models are too memory-intensive, while maintaining reasonable accuracy on OCR, chart reading, and visual QA. Instruction-tuned for conversational use.

5,336,318 ↓ · 660 ♡

Qwen3.6-35B-A3B

Qwen 3.6 is a Mixture-of-Experts model with 35B total parameters but only 3B active per token, giving MoE inference efficiency at near-35B capacity. It handles image and text inputs and is competitive with dense 14–20B models on standard benchmarks.

5,058,494 ↓ · 2,189 ♡

gemma-4-26B-A4B-it-AWQ-4bit

An AWQ 4-bit quantized version of Gemma 4's 26B MoE model (4B active parameters), reducing the memory footprint for local deployment on consumer hardware. Community-produced quantization targeting llama.cpp and vLLM compatibility.

4,549,055 ↓ · 81 ♡

Qwen3-VL-32B-Instruct

Qwen3-VL-32B-Instruct is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

4,391,541 ↓ · 206 ♡

Qwen3-VL-4B-Instruct

Qwen3-VL 4B is Alibaba's compact vision-language instruction model supporting image and video understanding at 4B scale. It targets use cases where Qwen2-VL-7B quality is acceptable but deployment must fit tighter memory constraints.

4,288,240 ↓ · 398 ♡

Qwen2-VL-2B-Instruct

Qwen2-VL-2B-Instruct is a 2B parameter vision-language model from Alibaba's Qwen team, supporting image and video understanding alongside text instruction-following. At 2B parameters it runs on consumer GPUs while retaining competitive OCR, chart reading, and visual QA accuracy. It is the instruction-tuned version of the Qwen2-VL-2B base.

3,879,947 ↓ · 510 ♡

Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive

Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

3,812,636 ↓ · 2,050 ♡

llava-1.5-7b-hf

LLaVA 1.5 7B connects a CLIP ViT-L/14@336 vision encoder to Vicuna 7B via a simple MLP projection. It was a state-of-the-art open multimodal model at release and remains widely used as a baseline for vision-language research.

3,220,104 ↓ · 366 ♡

Kimi-K2.6

Kimi-K2.6 processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

2,657,296 ↓ · 1,469 ♡

Florence-2-base

Florence-2-base combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

2,600,621 ↓ · 379 ♡

DeepSeek-OCR-2

DeepSeek-OCR-2 processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

2,598,142 ↓ · 992 ♡

Qwen3.5-0.8B

Qwen 3.5 0.8B is Alibaba's smallest production language model in the 3.5 series, designed for on-device and edge inference. Despite its size, it supports the same instruction format as larger Qwen models and is suitable for simple classification, extraction, and short-form generation.

2,467,794 ↓ · 582 ♡

Qwen3.5-27B

Qwen 3.5 27B is a dense image-text-to-text model from Alibaba, positioned between the 14B and 72B variants for users who need more capacity than 14B but can't serve 72B. It handles both vision and language instructions.

2,428,009 ↓ · 987 ♡

gemma-3-12b-it

Gemma 3 12B is Google's mid-size instruction-tuned model in the Gemma 3 family, designed to balance capability and deployment cost. It handles text-only instruction following and is positioned between the 4B and 27B variants.

2,377,662 ↓ · 761 ♡

DeepSeek-OCR

DeepSeek OCR is a vision-language model from DeepSeek optimized specifically for optical character recognition from natural scene and document images. It aims to handle mixed layouts, multi-language text, and complex typographic scenarios.

2,320,342 ↓ · 3,283 ♡

Qwen3.5-35B-A3B

Qwen3.5-35B-A3B is a 35B total parameter mixture-of-experts multimodal model from Alibaba, with approximately 3B active parameters per token during inference. It combines vision and language understanding for image captioning, visual QA, and document analysis tasks at lower compute cost than a dense 35B model. Apache 2.0 licensed.

2,284,090 ↓ · 1,446 ♡

Qwen3-VL-2B-Instruct

Qwen3-VL-2B-Instruct is a 2-billion-parameter vision-language model from Alibaba Cloud that jointly processes images and text for visual question answering, captioning, and document understanding. Its 2B scale positions it as one of the smaller instruction-tuned VLMs capable of zero-shot visual reasoning. Apache 2.0 licensed.

2,168,395 ↓ · 427 ♡

chandra-ocr-2

chandra-ocr-2 processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

2,000,546 ↓ · 391 ♡

Qwen2-VL-7B-Instruct

Qwen2-VL 7B is Alibaba's second-generation vision-language model, instruction-tuned to follow text+image prompts. It handles variable-resolution inputs natively and scores competitively against GPT-4V on standard multimodal benchmarks at the 7B scale.

1,981,248 ↓ · 1,281 ♡

Phi-3.5-vision-instruct

Phi-3.5-vision-instruct is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

1,826,191 ↓ · 736 ♡

Qwen3.6-27B-int4-AutoRound

AutoRound INT4 quantization of Qwen3.6-27B with W4G128 weight grouping and W4A16 configuration. AutoRound uses sign gradient descent to minimize quantization error, generally outperforming GPTQ at the same bit-width. Includes multi-token prediction (MTP) head for speculative decoding, which can increase throughput when paired with a draft model.

1,825,535 ↓ · 115 ♡

Kimi-K2.5

Kimi-K2.5 combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

1,805,271 ↓ · 2,823 ♡

Qwen2-VL-7B-Instruct-AWQ

Qwen2-VL-7B-Instruct-AWQ processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

1,802,523 ↓ · 49 ♡

Qwen3-VL-235B-A22B-Instruct

Qwen3-VL-235B-A22B-Instruct combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

1,784,968 ↓ · 394 ♡

moondream2

Moondream2 is a 1.9B parameter vision-language model designed to be the smallest model that can meaningfully answer questions about images. It pairs a SigLIP vision encoder with a Phi-1.5 language backbone and achieves surprising capability at its size.

1,777,982 ↓ · 1,420 ♡

Qwen3.5-2B

Qwen3.5-2B combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

1,700,327 ↓ · 314 ♡

InternVL2-2B

InternVL2-2B processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

1,659,411 ↓ · 80 ♡

Qwen3.6-27B-AWQ-INT4

Qwen3.6-27B-AWQ-INT4 combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

1,652,380 ↓ · 83 ♡

Qwen3.6-35B-A3B-AWQ-4bit

AWQ 4-bit quantization of Qwen3.6-35B-A3B, a mixture-of-experts model that activates approximately 3B parameters per token despite 35B total parameters. The cyankiwi quantization uses compressed-tensors format compatible with vLLM. MoE architecture means memory footprint scales with total parameters, not active ones.

1,625,050 ↓ · 79 ♡

gemma-4-E4B-it

gemma-4-E4B-it is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

1,587,967 ↓ · 22 ♡

gemma-3-4b-it

Gemma 3 4B Instruct is Google's compact instruction-following model, targeting deployment on single-GPU and edge devices. It covers both text and image inputs and is suitable for conversational AI applications with moderate resource constraints.

1,572,582 ↓ · 1,373 ♡

Qwen3.5-35B-A3B-FP8

Qwen3.5-35B-A3B-FP8 processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

1,525,499 ↓ · 150 ♡

Qwen3.5-27B-FP8

Qwen3.5-27B-FP8 combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

1,513,857 ↓ · 134 ♡

Qwen3.5-122B-A10B-FP8

Qwen3.5-122B-A10B-FP8 combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

1,452,980 ↓ · 106 ♡

gemma-4-26B-A4B-it-GGUF

gemma-4-26B-A4B-it-GGUF is Unsloth's GGUF quantization of Google's Gemma 4 26B mixture-of-experts instruction-tuned multimodal model. With approximately 4B active parameters per token, it runs on 16–24GB VRAM in GGUF format while retaining vision and text understanding capabilities. GGUF format provides llama.cpp and Ollama compatibility for local self-hosted deployment.

1,449,270 ↓ · 892 ♡

Qwen2.5-VL-7B-Instruct-AWQ

Qwen2.5-VL-7B-Instruct-AWQ is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

1,232,692 ↓ · 105 ♡

gemma-3-27b-it

gemma-3-27b-it combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

1,225,290 ↓ · 1,981 ♡

gemma-4-31B-it-FP8-block

gemma-4-31B-it-FP8-block is a FP8 quantization for reduced VRAM on supported GPU backends (vLLM, llm-compressor) version of Google's Gemma 4 multimodal (text + image) instruction-tuned model. 31B parameters are reduced to lower-precision weights for deployment on memory-constrained hardware or Apple Silicon, with quality degradation typically small for general chat tasks. The base model is Apache-2.0 licensed.

1,201,658 ↓ · 33 ♡

gemma-4-12b-it-GGUF

gemma-4-12b-it-GGUF is a GGUF format (quantized) for llama.cpp, LM Studio, and compatible runtimes version of Google's Gemma 4 multimodal (text + image) instruction-tuned model. parameters are reduced to lower-precision weights for deployment on memory-constrained hardware or Apple Silicon, with quality degradation typically small for general chat tasks. The base model is Apache-2.0 licensed.

1,182,658 ↓ · 665 ♡

Llama-3.1-Nemotron-Nano-VL-8B-V1

Llama-3.1-Nemotron-Nano-VL-8B-V1 is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

1,171,362 ↓ · 181 ♡

Qwen3.5-9B-GGUF

Qwen3.5-9B-GGUF processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

1,021,791 ↓ · 697 ♡

Qwen3.6-35B-A3B-GGUF

Unsloth's GGUF-converted and optionally quantized version of Qwen3.6-35B-A3B, optimized for local inference via llama.cpp and Ollama. Unsloth applies custom quantization recipes to reduce size while minimizing quality loss.

1,017,926 ↓ · 1,254 ♡

SmolVLM-256M-Instruct

SmolVLM-256M-Instruct combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

958,339 ↓ · 367 ♡

llava-onevision-qwen2-0.5b-ov-hf

llava-onevision-qwen2-0.5b-ov-hf combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

942,186 ↓ · 55 ♡

Qwen3.5-122B-A10B

Qwen3.5-122B-A10B combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

934,724 ↓ · 575 ♡

Qwen3.6-27B-MTP-GGUF

Unsloth's GGUF quantisation of Qwen3.6-27B with Multi-Token Prediction (MTP) heads, enabling speculative decoding with compatible runtimes like llama.cpp. MTP allows the model to predict multiple future tokens per step, increasing throughput on CPU and single-GPU machines. Unsloth applies imatrix-based importance weighting to reduce quality loss in lower-bit GGUF variants.

923,833 ↓ · 793 ♡

gemma-4-E4B-it-GGUF

gemma-4-E4B-it-GGUF is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

903,533 ↓ · 511 ♡

Qwen3.5-9B-NVFP4

AxionML's NVFP4 quantisation of Qwen3.5-9B using NVIDIA's ModelOpt toolkit, targeting sglang and vLLM serving on Hopper GPUs. Qwen3.5-9B is a multimodal model with image-text input capability; the NVFP4 format enables deployment at reduced memory cost while leveraging H100 4-bit tensor cores for throughput. ModelOpt-based quantisation preserves calibration-aware weight scaling.

899,150 ↓ · 17 ♡

Qwen3.5-4B-GGUF

Qwen3.5-4B-GGUF processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

884,398 ↓ · 287 ♡

gemma-4-31B-it-AWQ-4bit

gemma-4-31B-it-AWQ-4bit is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

863,064 ↓ · 49 ♡

Qwen3.5-397B-A17B-FP8

Qwen3.5-397B-A17B-FP8 is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

861,790 ↓ · 179 ♡

Qwen3.6-35B-A3B-MTP-GGUF

Unsloth's GGUF quantisation of Qwen3.6-35B, a sparse MoE model with 35B total parameters but only ~3B active per token, enhanced with Multi-Token Prediction heads for speculative decoding. The imatrix calibration in Unsloth's quantisation pipeline reduces perplexity loss compared to uncalibrated GGUF. At 35B total capacity this is a large multimodal model that fits on consumer hardware only through aggressive quantisation.

857,099 ↓ · 562 ♡

Qwen3.5-9B-AWQ-4bit

AWQ 4-bit quantization of Qwen3.5-9B, a dense image-text-to-text model. At 9B parameters with AWQ INT4, inference requires roughly 6-8 GB VRAM, placing it within reach of RTX 3080/4070-class cards. compressed-tensors format is vLLM-native.

844,925 ↓ · 33 ♡

deepseek-vl2-tiny

deepseek-vl2-tiny is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

825,421 ↓ · 248 ♡

gemma-4-E2B-it-GGUF

gemma-4-E2B-it-GGUF combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

813,669 ↓ · 245 ♡

Qwen3-VL-2B-Instruct-GGUF

Qwen3-VL-2B-Instruct-GGUF is Unsloth's GGUF distribution of Qwen3-VL-2B-Instruct, making the 2B vision-language model directly usable in llama.cpp, LM Studio, and Ollama. At 2B parameters, it is intended for on-device or memory-constrained deployment scenarios where a capable VLM must run locally. Quantization options (Q4, Q5, Q8) allow further trade-offs between quality and memory.

807,354 ↓ · 33 ♡

LFM2.5-VL-450M

LFM2.5-VL-450M is LiquidAI's 450M-parameter multimodal edge model from the LFM2.5-VL family, supporting 10 languages and designed for on-device deployment on mobile and embedded hardware. It uses LiquidAI's custom LFM2 architecture (not a standard transformer) for efficient inference at the sub-500M scale. Despite the small size, it handles image+text inputs across English, Japanese, Korean, French, Spanish, German, Arabic, Chinese, Portuguese, and others.

771,726 ↓ · 187 ♡

EXAONE-4.5-33B

EXAONE-4.5-33B processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

757,019 ↓ · 163 ♡

olmOCR-2-7B-1025-FP8

olmOCR-2-7B-1025-FP8 is AllenAI's FP8-quantized vision-language model for optical character recognition and document understanding, fine-tuned from Qwen2.5-VL-7B. It is optimized for extracting text from PDFs, research papers, and complex document layouts including tables, equations, and multi-column formats. The FP8 quantization allows deployment on a single A100 with reduced memory footprint.

753,813 ↓ · 241 ♡

MiniCPM-V-4.6

MiniCPM-V-4.6 is OpenBMB's MiniCPM-V 4.6, a lightweight on-device multimodal model optimized for image+text tasks at minimal parameter count. Version 4.6 targets improved document OCR, mathematical diagram understanding, and multilingual captioning within the constraints of mobile or edge deployment. It is compatible with deployment via llama.cpp or the MiniCPM-specific inference stack.

751,380 ↓ · 1,121 ♡

Qwen3.6-27B-GGUF

Qwen3.6-27B-GGUF is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

750,150 ↓ · 809 ♡

llava-v1.6-mistral-7b-hf

llava-v1.6-mistral-7b-hf is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

747,661 ↓ · 310 ♡

Qwen3.6-27B-MLX-4bit

Qwen3.6-27B-MLX-4bit is an MLX 4-bit quantized version of Qwen3.6-27B, packaged by lmstudio-community for inference on Apple Silicon via the MLX framework. MLX quantization converts the model to integer weights while preserving floating-point activations, enabling the 27B model to run within 16-24GB unified memory on M2/M3 Pro or Ultra configurations. Intended for use in LM Studio or direct MLX inference.

744,421 ↓ · 5 ♡

Qwen3.6-27B-NVFP4

Unsloth's NVFP4 (NVIDIA FP4) quantization of Qwen3.6-27B, targeting inference on H100/H200 GPUs with FP4 hardware support. FP4 enables significant throughput gains over BF16 on Ada Lovelace and Hopper-architecture GPUs that support native FP4 compute.

743,390 ↓ · 85 ♡

Molmo2-8B

Molmo2-8B is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

740,056 ↓ · 189 ♡

InternVL2-1B

InternVL2-1B processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

736,039 ↓ · 82 ♡

blip2-opt-2.7b

blip2-opt-2.7b processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

734,780 ↓ · 446 ♡

gemma-4-26B-A4B-it-QAT-MLX-4bit

gemma-4-26B-A4B-it-QAT-MLX-4bit is a MLX 4-bit quantized weights optimized for Apple Silicon inference version of Google's Gemma 4 MoE-based multimodal (text + image) instruction-tuned model. 26B parameters are reduced to lower-precision weights for deployment on memory-constrained hardware or Apple Silicon, with quality degradation typically small for general chat tasks. The base model is Apache-2.0 licensed.

719,520 ↓ · 1 ♡

dots.mocr

dots.mocr is RedNote's multimodal OCR model based on a custom Transformer architecture, designed for high-accuracy text extraction from documents including complex layouts, tables, formulas, and mixed Chinese-English content. It goes beyond standard OCR by understanding document structure, making it suitable for parsing invoices, forms, and academic papers in both Chinese and English.

718,353 ↓ · 132 ♡

Qwen3.6-27B-MLX-8bit

Qwen3.6-27B-MLX-8bit is an MLX 8-bit quantized version of Qwen3.6-27B, packaged by lmstudio-community for inference on Apple Silicon via the MLX framework. MLX quantization converts the model to integer weights while preserving floating-point activations, enabling the 27B model to run within 16-24GB unified memory on M2/M3 Pro or Ultra configurations. Intended for use in LM Studio or direct MLX inference.

709,144 ↓ · 2 ♡

Qwen3.5-4B-AWQ-4bit

AWQ 4-bit quantization of Qwen3.5-4B, a dense multimodal model supporting image-text-to-text tasks. At 4B parameters with AWQ compression, inference fits within ~4 GB VRAM, making it accessible on mid-range consumer cards. compressed-tensors format targets vLLM serving.

704,152 ↓ · 15 ♡

Qwen3.5-35B-A3B-GPTQ-Int4

Qwen3.5-35B-A3B-GPTQ-Int4 combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

700,414 ↓ · 89 ♡

Llama-4-Scout-17B-16E-Instruct

Llama 4 Scout is Meta's first MoE entry in the Llama series: 17B parameters per expert across 16 experts, with a small number active per token. The instruct variant follows instructions and handles image-text inputs natively, supporting 12 languages. Scout targets deployments where multimodal capability is needed at a lower active-parameter cost than dense Llama 3 models.

683,261 ↓ · 1,309 ♡

diffusiongemma-26B-A4B-it

diffusiongemma-26B-A4B-it is Google's experimental diffusion-based language model built on the Gemma 4 MoE architecture, applying masked diffusion to text generation instead of autoregressive decoding. At 26B active-parameter scale it explores whether diffusion LMs can match autoregressive quality on instruction-following tasks. It accepts text and image inputs and produces text through iterative denoising.

673,464 ↓ · 1,025 ♡

Qwen3-VL-8B-Instruct-FP8

Qwen3-VL-8B-Instruct-FP8 processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

666,843 ↓ · 72 ♡

Qwen3.6-27B-MLX-6bit

Qwen3.6-27B-MLX-6bit is an MLX 6-bit quantized version of Qwen3.6-27B, packaged by lmstudio-community for inference on Apple Silicon via the MLX framework. MLX quantization converts the model to integer weights while preserving floating-point activations, enabling the 27B model to run within 16-24GB unified memory on M2/M3 Pro or Ultra configurations. Intended for use in LM Studio or direct MLX inference.

660,946 ↓ · 0 ♡

Qwen3.6-27B-MLX-5bit

Qwen3.6-27B-MLX-5bit is an MLX 5-bit quantized version of Qwen3.6-27B, packaged by lmstudio-community for inference on Apple Silicon via the MLX framework. MLX quantization converts the model to integer weights while preserving floating-point activations, enabling the 27B model to run within 16-24GB unified memory on M2/M3 Pro or Ultra configurations. Intended for use in LM Studio or direct MLX inference.

659,075 ↓ · 0 ♡

Cosmos-Reason2-2B

Cosmos-Reason2-2B is NVIDIA's 2B visual reasoning model from the Cosmos series, fine-tuned from Qwen3-VL-2B for physical world understanding tasks. It is trained to reason about spatial relationships, object interactions, and temporal dynamics in images and videos, targeting robotics and autonomous system perception research. Despite the 2B scale, the Cosmos training pipeline includes extensive world-model data.

656,689 ↓ · 104 ♡

SmolVLM2-500M-Video-Instruct

SmolVLM2-500M-Video-Instruct processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

643,599 ↓ · 152 ♡

Qwen3.5-397B-A17B

Qwen3.5-397B-A17B combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

641,540 ↓ · 1,515 ♡

gemma-4-31B-it-GGUF

gemma-4-31B-it-GGUF processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

640,946 ↓ · 495 ♡

Gemma-4-E4B-Uncensored-HauhauCS-Aggressive

Gemma-4-E4B-Uncensored-HauhauCS-Aggressive processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

628,886 ↓ · 811 ♡

Florence-2-large

Florence-2-large is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

618,487 ↓ · 1,820 ♡

gemma-4-26B-A4B-it-qat-GGUF

gemma-4-26B-A4B-it-qat-GGUF is a GGUF format (quantized) for llama.cpp, LM Studio, and compatible runtimes version of Google's Gemma 4 MoE-based multimodal (text + image) instruction-tuned model. 26B parameters are reduced to lower-precision weights for deployment on memory-constrained hardware or Apple Silicon, with quality degradation typically small for general chat tasks. The base model is Apache-2.0 licensed.

616,828 ↓ · 187 ♡

medgemma-27b-it

medgemma-27b-it is Google's 27B medical vision-language model, fine-tuned from Gemma 3 on radiology reports, chest X-rays, histopathology slides, dermatology images, and ophthalmology fundus photographs. It is designed for medical image interpretation research, not clinical deployment. The model accepts image+text input and outputs clinical-style text descriptions, differentials, or structured findings.

594,626 ↓ · 368 ♡

granite-docling-258M

granite-docling-258M is a 258M-parameter vision-language model fine-tuned specifically for document understanding tasks within the Docling pipeline. It handles OCR, layout parsing, table extraction, formula recognition, and chart reading in a single inference pass. The model is built on the Idefics3 architecture and integrates directly with the open-source Docling library.

592,037 ↓ · 1,184 ♡

Qwen3-VL-30B-A3B-Instruct

Qwen3-VL-30B-A3B-Instruct processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

588,774 ↓ · 579 ♡

paligemma-3b-pt-224

PaliGemma 3B pretrained at 224×224 resolution is Google's compact vision-language model checkpoint before instruction fine-tuning. The PT (pretrained) variant is intended as a foundation for task-specific fine-tuning rather than direct deployment.

584,140 ↓ · 483 ♡

UI-TARS-1.5-7B

UI-TARS-1.5-7B is ByteDance's 7B GUI agent model built on Qwen2.5-VL, fine-tuned for autonomous interaction with graphical user interfaces. It can interpret screenshots, identify UI elements, and generate action sequences (click, type, scroll) to complete computer tasks from natural language instructions. Version 1.5 improves over 1.0 on web-based task completion and cross-platform generalization.

582,237 ↓ · 565 ♡

Qwen3.5-27B-AWQ-4bit

A 4-bit AWQ quantisation of Qwen3.5-27B, a multimodal model combining image and text understanding at 27B parameters. AWQ preserves the most activationally important weights at higher precision, minimising accuracy loss compared to round-to-nearest quantisation. The result fits in significantly less GPU memory than the BF16 checkpoint while remaining compatible with vLLM and transformers backends.

572,866 ↓ · 41 ♡

GLM-4.1V-9B-Thinking

GLM-4.1V-9B-Thinking is Zhipu AI's 9B-parameter vision-language model with an integrated chain-of-thought reasoning module. The 'Thinking' variant explicitly generates internal reasoning steps before producing final answers, improving performance on complex visual question answering and multi-step visual reasoning tasks. It supports English and Chinese natively.

570,941 ↓ · 776 ♡

Qwen3.6-27B-Uncensored-HauhauCS-Aggressive

An 'aggressive' uncensored abliterated GGUF variant of Qwen3.6-27B, with safety refusal mechanisms removed via abliteration. Available in imatrix-calibrated GGUF quantizations. Safety removals affect the model's ability to decline harmful requests — this is a community fine-tune without safety evaluation.

566,788 ↓ · 458 ♡

Qwen3.5-9B-AWQ

Qwen3.5-9B-AWQ is a 4-bit AWQ quantization of Qwen3.5-9B, packaged for vLLM deployment. Qwen3.5 is the multimodal variant of the Qwen3 series, and the 9B size targets a balance of quality and throughput. AWQ (Activation-aware Weight Quantization) calibrates quantization ranges to minimize output degradation, making this suitable for serving in production environments.

554,765 ↓ · 22 ♡

tiny-Qwen2_5_VLForConditionalGeneration

tiny-Qwen2_5_VLForConditionalGeneration processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

554,263 ↓ · 0 ♡

gemma-4-31B-it-unsloth-bnb-4bit

gemma-4-31B-it-unsloth-bnb-4bit combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

551,869 ↓ · 19 ♡

Qwen3-VL-4B-Thinking

Qwen3-VL-4B-Thinking combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

542,482 ↓ · 111 ♡

Qwen3.6-27B-Heretic-Uncensored-FINETUNE-NEO-CODE-Di-IMatrix-MAX-GGUF

A 27B-parameter GGUF quantization of Qwen3.6 fine-tuned for creative writing, fiction, and code, with abliterated safety filters via the Heretic series. The imatrix quantization preserves perplexity better than naive integer quantization at the same bit-width.

523,948 ↓ · 333 ♡

Qwen3-VL-32B-Thinking-FP8

Qwen3-VL-32B-Thinking-FP8 is a 32B FP8-quantized Qwen3 vision-language model with extended reasoning ('Thinking') mode, enabling multi-step chain-of-thought for complex visual analysis tasks. FP8 quantization allows it to run on a single 80GB GPU rather than requiring multi-GPU setup for the full BF16 model. The Thinking mode produces visible reasoning traces before the final answer, improving accuracy on math, logic, and diagram interpretation.

513,940 ↓ · 26 ♡

MinerU2.5-Pro-2604-1.2B

MinerU2.5-Pro is a 1.2B-parameter vision-language model from OpenDataLab fine-tuned for high-accuracy document parsing, built on the Qwen2-VL backbone. It handles mixed Chinese-English documents, extracting structured content from PDFs including formulas, tables, and figures. Version 2.5-Pro targets production-grade accuracy improvements over the earlier MinerU pipeline.

510,982 ↓ · 156 ♡

Qwen2.5-VL-72B-Instruct

Qwen2.5-VL-72B-Instruct is Qwen's 72B vision-language model, the largest in the Qwen2.5-VL series, handling image, video, and text inputs with a 32K token context window. At 72B scale it targets document understanding, complex visual reasoning, and structured extraction from multi-page documents. It supports bounding box output for grounded visual answers.

509,711 ↓ · 629 ♡

Step3-VL-10B

Step3-VL-10B is StepFun's 10B-parameter vision-language model with a custom transformer architecture (step_robotics). It targets multimodal understanding tasks including image captioning, visual QA, and document reading. The model uses safetensors weights with custom inference code and is positioned as a mid-size VLM in StepFun's model family.

501,621 ↓ · 409 ♡

medgemma-1.5-4b-it

medgemma-1.5-4b-it is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

485,173 ↓ · 688 ♡

Qwen3.6-27B-AWQ

QuantTrio's AWQ 4-bit quantization of Qwen3.6-27B, a dense (non-MoE) multimodal model supporting image and text inputs. Tagged for vLLM serving with compressed-tensors compatibility. Qwen3.5/3.6 dense variants trade MoE routing complexity for more predictable latency.

482,913 ↓ · 17 ♡

Qwen3.5-35B-A3B-AWQ-4bit

Qwen3.5-35B-A3B-AWQ-4bit processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

480,489 ↓ · 45 ♡

Qwen3.5-9B-MLX-4bit

Qwen3.5-9B-MLX-4bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

469,601 ↓ · 2 ♡

Qwen3.6-35B-A3B-Uncensored-Wasserstein-GGUF

An abliterated (refusal-removed) GGUF fine-tune of Qwen3.6-35B-A3B, produced via a Wasserstein-distance-guided weight adjustment technique to remove model refusal behaviour. The 'uncensored' label means safety filters have been deliberately removed. This is a community research release targeting users who need the model to engage with content that safety-tuned models decline.

468,102 ↓ · 96 ♡

medgemma-4b-it

MedGemma-4B-it is Google's 4B instruction-tuned multimodal model specialized for medical image and text understanding, covering radiology, dermatology, pathology, and ophthalmology. It accepts medical images (chest X-rays, skin images, histology slides, fundus photos) paired with clinical questions. Not cleared for clinical decision support — research and development only.

466,523 ↓ · 975 ♡

Qwen2.5-VL-32B-Instruct

Qwen2.5-VL-32B-Instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

458,836 ↓ · 491 ♡

Qwen3.6-35B-A3B-GPTQ-Int4

Qwen3.6-35B-A3B-GPTQ-Int4 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

454,597 ↓ · 24 ♡

gemma-3-4b-pt

Gemma-3-4B-PT is Google's 4B pre-trained (base, non-instruction-tuned) multimodal model from the Gemma-3 family, accepting image and text inputs. As a base model it requires fine-tuning or careful prompting for task-specific use. The gemma license applies.

440,651 ↓ · 155 ♡

gemma-4-E4B-it-unsloth-bnb-4bit

gemma-4-E4B-it-unsloth-bnb-4bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

437,464 ↓ · 22 ♡

gemma-4-31B

Gemma 4 31B is Google's base (non-instruct) multimodal language model with image-text-to-text capability, released under Apache 2.0. As a base model, it is intended for fine-tuning and research rather than direct deployment as a chat assistant. The 31B parameter count puts it in a tier where it competes with Mistral-Medium and Llama 3.1 70B in terms of raw capability before instruction tuning.

433,117 ↓ · 422 ♡

Qwen3.6-35B-A3B-AWQ

Qwen3.6-35B-A3B-AWQ processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

432,502 ↓ · 26 ♡

InternVL2-8B

InternVL2-8B is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

429,442 ↓ · 187 ♡

Idefics3-8B-Llama3

Idefics3-8B-Llama3 is HuggingFace's open multimodal model combining a SigLIP vision encoder with a Llama 3 8B language backbone. It is designed to follow instructions over interleaved image-text inputs and was released alongside training infrastructure on the HuggingFace Hub. The model is positioned as a fully open (weights + training code + datasets) alternative to commercial VLMs.

427,686 ↓ · 304 ♡

gemma-3-27b-it-GPTQ-4b-128g

gemma-3-27b-it-GPTQ-4b-128g is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

425,125 ↓ · 44 ♡

Qwen3.6-35B-A3B-MLX-8bit

Qwen3.6-35B-A3B-MLX-8bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

419,474 ↓ · 0 ♡

Qwen3.6-35B-A3B-MLX-4bit

Qwen3.6-35B-A3B-MLX-4bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

418,170 ↓ · 1 ♡

Qwen3.5-9B-DeepSeek-V4-Flash-GGUF

A GGUF-quantised Qwen3.5-9B fine-tuned with DeepSeek V4 Flash distillation — the model has been trained on reasoning traces from a larger DeepSeek teacher to improve chain-of-thought quality at 9B scale. It targets multilingual reasoning with long-context CoT traces, supporting English, Chinese, Korean, Japanese, Spanish, and Russian. The GGUF format enables llama.cpp local inference.

415,934 ↓ · 232 ♡

gemma-4-31B-it-qat-w4a16-ct

gemma-4-31B-it-qat-w4a16-ct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

413,910 ↓ · 27 ♡

Phi-3.5-vision-instruct-int8-ov

Phi-3.5-vision-instruct-int8-ov is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

413,139 ↓ · 2 ♡

MinerU2.5-2509-1.2B

MinerU2.5-2509-1.2B is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

409,174 ↓ · 356 ♡

Qwen3.5-9B-MLX-8bit

Qwen3.5-9B-MLX-8bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

407,052 ↓ · 0 ♡

Qwen3.6-35B-A3B-MLX-6bit

Qwen3.6-35B-A3B-MLX-6bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

397,892 ↓ · 0 ♡

NVIDIA-Nemotron-Parse-v1.1

NVIDIA-Nemotron-Parse-v1.1 processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

382,564 ↓ · 169 ♡

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

380,877 ↓ · 122 ♡

Qwopus3.6-27B-v1-preview-GGUF

Qwopus3.6-27B is a GGUF-quantised preview fine-tune of Qwen3.6-27B positioned as a 'Claude Opus-style' reasoning and instruction model — the name blends Qwen and Opus. It targets advanced instruction following, multilingual reasoning, and multimodal (vision-language) tasks. As a v1 preview, evaluation is still community-driven and production use should be preceded by task-specific benchmarking.

375,239 ↓ · 125 ♡

Qwen3-VL-8B-Thinking

Qwen3-VL-8B-Thinking is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

374,948 ↓ · 210 ♡

SmolVLM-500M-Instruct

SmolVLM-500M-Instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

372,795 ↓ · 195 ♡

Qwen3.5-27B-GPTQ-Int4

Official Alibaba GPTQ INT4 quantization of Qwen3.5-27B, a dense multimodal model for image and text tasks. GPTQ INT4 reduces memory to approximately 15-18 GB, making the model accessible on A100 or RTX 4090-class hardware. Apache-2.0 licensed.

370,018 ↓ · 55 ♡

gemma-3n-E2B-it

Gemma-3n-E2B-it is Google's instruction-tuned 2B edge model from the Gemma-3n family, combining image, audio, video, and text understanding in a single model. The 'n' suffix indicates the next-generation architecture with per-layer embeddings for efficiency. Gemma license applies — allows research and commercial use with restrictions.

368,852 ↓ · 304 ♡

gemma-4-31B-it-MLX-8bit

An 8-bit MLX quantization of Google's Gemma 4 31B instruct model, prepared by the LM Studio community for Apple Silicon local inference. Gemma 4 31B is a dense instruction-tuned model targeting the mid-to-high capability tier.

367,599 ↓ · 2 ♡

Qwen3-VL-4B-Instruct-FP8

Qwen3-VL-4B-Instruct-FP8 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

363,078 ↓ · 63 ♡

Qwen3.6-27B-AWQ-BF16-INT4

Qwen3.6-27B-AWQ-BF16-INT4 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

359,795 ↓ · 38 ♡

Qwen3.5-122B-A10B-AWQ-4bit

Qwen3.5-122B-A10B-AWQ-4bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

358,118 ↓ · 38 ♡

Qwen2.5-VL-7B-Instruct-GGUF

Qwen2.5-VL-7B-Instruct-GGUF is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

356,460 ↓ · 189 ♡

gemma-3-27b-it-quantized.w4a16

gemma-3-27b-it-quantized.w4a16 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

353,484 ↓ · 13 ♡

kanana-1.5-v-3b-instruct

Kanana-1.5-V is Kakao's 3B vision-language instruct model, part of their Kanana model family targeting Korean-English bilingual multimodal tasks. Optimized for practical deployment at 3B parameters while maintaining decent visual understanding.

353,198 ↓ · 55 ♡

InternVL3-8B-AWQ

InternVL3-8B-AWQ is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

348,390 ↓ · 8 ♡

Qwen3-VL-32B-Instruct-FP8

Qwen3-VL-32B-Instruct-FP8 combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

348,280 ↓ · 46 ♡

SmolVLM2-2.2B-Instruct

SmolVLM2-2.2B-Instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

346,122 ↓ · 321 ♡

Qwopus3.6-27B-v2-MTP-GGUF

Qwopus3.6-27B-v2-MTP-GGUF is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

344,928 ↓ · 330 ♡

LightOnOCR-2-1B

LightOnOCR-2-1B is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

344,735 ↓ · 691 ♡

Qwen3.5-35B-A3B-AWQ

An AWQ 4-bit quantization of Qwen3.5-35B-A3B (a 35B MoE with 3B active parameters) by QuantTrio, enabling memory-efficient inference on single high-VRAM GPUs. The MoE architecture means the 4-bit quantization applies to all expert weights rather than a dense 35B weight matrix.

343,576 ↓ · 18 ♡

Qwen3.5-397B-A17B-AWQ-4bit

Qwen3.5-397B-A17B-AWQ-4bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

340,681 ↓ · 3 ♡

InternVL3-1B-hf

InternVL3-1B-hf is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

338,460 ↓ · 10 ♡

HunyuanOCR

HunyuanOCR is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

337,307 ↓ · 758 ♡

Cosmos-Reason2-8B

Cosmos-Reason2-8B is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

334,080 ↓ · 193 ♡

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

332,853 ↓ · 656 ♡

Qwen3-VL-30B-A3B-Instruct-FP8

Qwen3-VL 30B MoE vision-language model in FP8 precision with 3B active parameters per token, instruction-tuned for multimodal tasks. Combines Qwen3's language capability with vision understanding, optimized for H100-class GPU serving.

331,510 ↓ · 110 ♡

gemma-3-27b-it-abliterated

An abliterated version of Google's Gemma-3-27B-IT, with safety refusal mechanisms removed by mlabonne using directional activation manipulation. Gemma license applies to the underlying weights. The abliteration removes content restrictions while preserving the model's multimodal instruction-following capability.

324,867 ↓ · 322 ♡

gemma-3n-E4B-it

Gemma 3n E4B Instruct repackaged by Unsloth for efficient local fine-tuning and inference. Gemma 3n is Google's on-device model family designed for mobile and edge hardware; E4B uses per-layer selective parameter activation to run with approximately 4B effective parameters while having a larger total capacity. Unsloth's repackage enables QLoRA fine-tuning of this model on consumer GPUs.

318,961 ↓ · 10 ♡

Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled

Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

318,006 ↓ · 60 ♡

translategemma-4b-it

TranslateGemma-4b-it is Google's Gemma 3-based 4B instruction-tuned model fine-tuned specifically for translation tasks. Unlike generic multilingual LLMs, it was trained with translation as a primary objective, producing more accurate and fluent translations than prompting a general-purpose model. It uses the standard HuggingFace transformers interface for translation inference.

317,340 ↓ · 780 ♡

Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive

Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

316,109 ↓ · 1,407 ♡

Qianfan-OCR

Qianfan-OCR is Baidu's vision-language model specialized for optical character recognition and document intelligence, supporting multilingual text extraction from images. It combines a vision encoder with a language model for scene text understanding beyond simple character recognition. Apache-2.0 licensed with published benchmark results.

313,490 ↓ · 1,176 ♡

Qwopus3.6-35B-A3B-v1-GGUF

Qwopus3.6-35B-A3B-v1-GGUF is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

313,403 ↓ · 203 ♡

Qwen3.5-35B-A3B-GGUF

Qwen3.5-35B-A3B-GGUF processes interleaved image-text input and produces free-form text output. Scene understanding, chart reading, and screenshot analysis are within scope.

312,514 ↓ · 843 ♡

gemma-4-31B-it-AWQ

gemma-4-31B-it-AWQ is a vision-language model that takes images and text prompts as input and generates text responses. It handles visual QA, image description, and document parsing.

311,957 ↓ · 11 ♡

Kimi-K2.6-GGUF

Unsloth's GGUF conversion of Moonshot AI's Kimi K2.6 MoE model, enabling local inference via llama.cpp. Kimi K2 is a large MoE model from Moonshot AI notable for its strong reasoning performance at a competitive compute cost.

309,012 ↓ · 157 ♡

gemma-3n-E4B-it-MLX-bf16

gemma-3n-E4B-it-MLX-bf16 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

308,673 ↓ · 3 ♡

Qwen3-VL-235B-A22B-Instruct-FP8

Qwen3-VL-235B-A22B-Instruct-FP8 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

308,125 ↓ · 44 ♡

InternVL2_5-8B

InternVL2_5-8B is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

308,120 ↓ · 104 ♡

Qwen3.5-27B-AWQ

QuantTrio's AWQ 4-bit quantisation of Qwen3.5-27B, a multimodal image-text model at 27 billion parameters. This variant uses vLLM-compatible AWQ serialisation and targets teams running the 27B model on GPU servers with constrained memory. QuantTrio maintains several AWQ quantisations of Qwen family models with consistent quantisation settings.

307,462 ↓ · 43 ♡

gemma-3n-E4B-it-MLX-8bit

gemma-3n-E4B-it-MLX-8bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

307,139 ↓ · 0 ♡

NVIDIA-Nemotron-Nano-12B-v2-VL-FP8

NVIDIA-Nemotron-Nano-12B-v2-VL-FP8 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

306,743 ↓ · 50 ♡

google_gemma-4-26B-A4B-it-GGUF

google_gemma-4-26B-A4B-it-GGUF is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

304,592 ↓ · 113 ♡

Mistral-Small-3.2-24B-Instruct-2506-bnb-4bit

Mistral-Small-3.2-24B-Instruct-2506-bnb-4bit combines a visual encoder with a language decoder to answer questions about images. The model reasons over image patches alongside the text context.

304,064 ↓ · 10 ♡

gemma-3n-E4B-it-MLX-6bit

gemma-3n-E4B-it-MLX-6bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

303,662 ↓ · 0 ♡

Qwen3-VL-2B-Instruct-FP8

Qwen3-VL-2B-Instruct-FP8 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

303,557 ↓ · 39 ♡

RolmOCR

RolmOCR is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

302,994 ↓ · 586 ♡

gemma-3-27b-it-AWQ-INT4

gemma-3-27b-it-AWQ-INT4 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

300,315 ↓ · 7 ♡

gemma-3-4b-it-qat-4bit

gemma-3-4b-it-qat-4bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

300,091 ↓ · 8 ♡

Gemma-4-E2B-Uncensored-HauhauCS-Aggressive

Gemma-4-E2B-Uncensored-HauhauCS-Aggressive is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

299,361 ↓ · 165 ♡

gemma-4-31B-it-MLX-4bit

gemma-4-31B-it-MLX-4bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

298,419 ↓ · 1 ♡

Qwen3.5-27B-GGUF

Qwen3.5-27B-GGUF is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

297,706 ↓ · 490 ♡

Qwen3.5-0.8B-GGUF

Unsloth's GGUF conversion of Qwen3.5-0.8B, the smallest model in the Qwen3.5 series. At 0.8B parameters, it targets extremely constrained inference environments — Raspberry Pi, microcontrollers with GGUF support, or embedding in applications.

297,370 ↓ · 178 ♡

Qwen3.5-2B-GGUF

Qwen3.5-2B-GGUF is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

296,345 ↓ · 100 ♡

Qwen3.5-122B-A10B-GPTQ-Int4

Qwen3.5-122B-A10B-GPTQ-Int4 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

295,370 ↓ · 37 ♡

Qwopus3.5-9B-v3

Qwopus3.5-9B-v3 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

294,775 ↓ · 88 ♡

Nanonets-OCR2-3B

Nanonets-OCR2-3B is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

293,380 ↓ · 505 ♡

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

290,793 ↓ · 2,814 ♡

gemma-3n-E4B-it-MLX-4bit

gemma-3n-E4B-it-MLX-4bit is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

289,004 ↓ · 2 ♡

Qwen3.5-9B-FP8

Qwen3.5-9B-FP8 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

287,785 ↓ · 10 ♡

google_gemma-4-31B-it-GGUF

google_gemma-4-31B-it-GGUF is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

285,205 ↓ · 62 ♡

llava-v1.5-7b

LLaVA 1.5 7B is Haotian Liu et al.'s multimodal instruction-following model combining a CLIP vision encoder with a Vicuna-7B language model. At 7B, it was one of the strongest open VLMs at its release and remains a common fine-tuning starting point.

235,049 ↓ · 555 ♡

gemma-4-26B-A4B-it-qat-q4_0-gguf

gemma-4-26B-A4B-it-qat-q4_0-gguf is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

229,514 ↓ · 70 ♡