gemma-4-31B-it vs Qwen3.5-9B

gemma-4-31B-it and Qwen3.5-9B are both image-text-to-text models. See each entry for specifics.

gemma-4-31B-it

Pipeline: image text to text
Downloads: 8,206,643
Likes: 2,526

Gemma 4-31B-IT is Google DeepMind's 31-billion-parameter instruction-tuned vision-language model from the Gemma 4 family, supporting both image and text inputs. It offers strong multimodal reasoning at open-weight scale, with Apache 2.0 licensing making it directly deployable for commercial applications. Part of the gemma4 architecture with improvements over Gemma 2.

Qwen3.5-9B

Pipeline: image text to text
Downloads: 7,745,704
Likes: 1,388

Qwen3.5-9B is a 9-billion-parameter instruction-tuned vision-language model from Alibaba Cloud's Qwen3.5 series, fine-tuned from Qwen3.5-9B-Base for multimodal conversational tasks. It accepts image and text inputs for visual reasoning, document understanding, and grounded question answering. Apache 2.0 licensed.

Key differences

See individual model pages for architecture and use cases.

Common ground

Both are open-source models on HuggingFace.

Which should you pick?

Pick based on your compute budget and specific task requirements.