zero shot image classification models

26 models · ranked by HuggingFace downloads

clip-vit-base-patch32

OpenAI's CLIP model using a ViT-B/32 image encoder, the smaller of the two widely deployed CLIP variants. Trained contrastively on 400 million image-text pairs, it aligns image and text representations in a shared embedding space for zero-shot classification and retrieval. The B/32 variant sacrifices accuracy versus ViT-L/14 for faster inference.

23,240,836 ↓ · 963 ♡

clip-vit-large-patch14

OpenAI's CLIP model using a ViT-L/14 image encoder, trained contrastively on 400 million image-text pairs from the internet. It aligns image and text in a shared embedding space, enabling zero-shot image classification by comparing image embeddings against text label embeddings. The ViT-L/14 variant offers higher accuracy than the smaller ViT-B/32 at greater compute cost.

10,932,423 ↓ · 2,039 ♡

CLIP-ViT-B-32-laion2B-s34B-b79K

OpenCLIP ViT-B/32 trained by LAION on 2 billion image-text pairs from the LAION-2B dataset. It provides open-source CLIP features comparable to OpenAI's original ViT-B/32 while being trained on a fully public dataset.

3,232,028 ↓ · 141 ♡

fashion-clip

CLIP fine-tuned on a large fashion product dataset to improve image-text alignment for apparel, accessories, and retail imagery. Standard CLIP models underperform on fashion-specific queries due to distribution shift from generic web data.

2,917,993 ↓ · 283 ♡

PickScore_v1

PickScore_v1 is a CLIP-based human preference scorer trained on the Pick-a-Pic dataset of text-image pairs with human preference labels. Given a text prompt and a set of generated images, it predicts which image humans would prefer. It is typically used as a reward model in reinforcement-learning-from-human-feedback (RLHF) pipelines for image generation, not as a standalone image generator.

2,663,393 ↓ · 52 ♡

clip-vit-large-patch14-336

OpenAI CLIP ViT-L/14 at 336×336px input resolution, a higher-resolution variant of the standard ViT-L/14 CLIP model. The larger input patch size reduces information loss during tokenization, improving performance on classification tasks requiring fine-grained visual detail. Otherwise shares the same contrastive training on 400M image-text pairs as the base ViT-L/14.

2,314,342 ↓ · 307 ♡

siglip-so400m-patch14-384

SigLIP (Sigmoid Loss for Language-Image Pre-training) SO/400M at 384px resolution is Google's vision-language model using a sigmoid binary cross-entropy loss instead of CLIP's softmax contrastive loss. It achieves stronger zero-shot classification than CLIP ViT-L at comparable scale.

1,544,045 ↓ · 677 ♡

clip-vit-base-patch16

clip-vit-base-patch16 uses a joint image-text embedding space to score unseen label categories against input images.

1,381,419 ↓ · 163 ♡

siglip-base-patch16-224

SigLIP base/patch16 at 224px resolution is the lightweight tier of Google's sigmoid-loss vision-language pretraining model. It serves as a vision encoder for multimodal pipelines and as a standalone zero-shot classifier.

1,353,204 ↓ · 83 ♡

siglip2-so400m-patch16-naflex

SigLIP2 SO400M with NaFlex (Native Resolution Flexible) encoding — the larger 400M variant of siglip2-base-patch16-naflex. NaFlex processes images at native resolution without forced resizing, preserving spatial detail. This is the strongest SigLIP2 variant for both CLIP-style tasks and as a vision encoder in multimodal LLMs.

1,098,288 ↓ · 74 ♡

marqo-fashionSigLIP

marqo-fashionSigLIP classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.

1,009,672 ↓ · 81 ♡

siglip2-so400m-patch16-256

SigLIP2 is Google's second-generation sigmoid loss vision-language contrastive model at 400M parameters, using a 16px patch size and 256px input resolution. The sigmoid loss formulation (vs softmax in CLIP) enables independent positive/negative scoring without requiring full batch negatives. Often used as the vision encoder in multimodal LLMs.

841,883 ↓ · 5 ♡

siglip2-base-patch16-naflex

SigLIP2-Base with NaFlex (Native Resolution Flexible) encoding, which processes images at their native resolution by dynamically adjusting patch sequences rather than resizing to a fixed size. This improves accuracy on images where spatial details matter. The base variant offers a smaller memory footprint than the 400M so400m variant.

801,020 ↓ · 33 ♡

BiomedCLIP-PubMedBERT_256-vit_base_patch16_224

BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 uses a joint image-text embedding space to score unseen label categories against input images.

790,786 ↓ · 410 ♡

siglip2-so400m-patch14-384

siglip2-so400m-patch14-384 performs zero-shot classification by measuring similarity between the image representation and natural-language class descriptions.

657,655 ↓ · 90 ♡

CLIP-ViT-L-14-laion2B-s32B-b82K

CLIP-ViT-L-14-laion2B-s32B-b82K classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.

614,918 ↓ · 64 ♡

siglip2-giant-opt-patch16-384

siglip2-giant-opt-patch16-384 is Google's SigLIP 2 giant variant, a contrastively trained vision-language encoder with 384px patch-16 resolution. SigLIP 2 introduces sigmoid loss instead of softmax for cross-modal alignment, improving per-example calibration and zero-shot classification accuracy over the original SigLIP. The 'opt' variant uses optimized training recipes and targets state-of-the-art zero-shot classification quality.

584,398 ↓ · 42 ♡

CLIP-convnext_base_w-laion2B-s13B-b82K-augreg

CLIP-convnext_base_w-laion2B-s13B-b82K-augreg classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.

564,440 ↓ · 8 ♡

CLIP-ViT-H-14-laion2B-s32B-b79K

CLIP-ViT-H-14-laion2B-s32B-b79K classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.

395,714 ↓ · 462 ♡

CLIP-ViT-B-16-laion2B-s34B-b88K

OpenCLIP ViT-B/16 trained on LAION-2B with 34B samples seen during training. The ViT-B/16 architecture processes 16x16 patches at 224px resolution, offering better feature quality than ViT-B/32 at moderate additional cost.

384,315 ↓ · 39 ♡

siglip2-base-patch16-224

siglip2-base-patch16-224 performs zero-shot classification by measuring similarity between the image representation and natural-language class descriptions.

368,379 ↓ · 108 ♡

PE-Core-L14-336

PE-Core-L14-336 is an open-source zero-shot-image-classification model available on HuggingFace. Details are sourced from the public model registry.

316,732 ↓ · 52 ♡

vit_base_patch16_plus_clip_240.laion400m_e31

vit_base_patch16_plus_clip_240.laion400m_e31 is an open-source zero-shot-image-classification model available on HuggingFace. Details are sourced from the public model registry.

314,216 ↓ · 1 ♡

siglip2-base-patch16-512

siglip2-base-patch16-512 is an open-source zero-shot-image-classification model available on HuggingFace. Details are sourced from the public model registry.

294,208 ↓ · 42 ♡

one-align

One-Align is a unified image and video quality assessment model from the Q-Future group, trained to score perceptual quality and alignment with human aesthetic preferences. It unifies image quality assessment (IQA) and video quality assessment (VQA) into a single model.

267,437 ↓ · 43 ♡

TinyCLIP-ViT-8M-16-Text-3M-YFCC15M

TinyCLIP-ViT-8M-16-Text-3M-YFCC15M is an open-source zero-shot-image-classification model available on HuggingFace. Details are sourced from the public model registry.

232,353 ↓ · 12 ♡