OpenAI's CLIP model using a ViT-B/32 image encoder, the smaller of the two widely deployed CLIP variants. Trained contrastively on 400 million image-text pairs, it aligns image and text representations in a shared embedding space for zero-shot classification and retrieval. The B/32 variant sacrifices accuracy versus ViT-L/14 for faster inference.
23,240,836 ↓ · 963 ♡
OpenAI's CLIP model using a ViT-L/14 image encoder, trained contrastively on 400 million image-text pairs from the internet. It aligns image and text in a shared embedding space, enabling zero-shot image classification by comparing image embeddings against text label embeddings. The ViT-L/14 variant offers higher accuracy than the smaller ViT-B/32 at greater compute cost.
10,932,423 ↓ · 2,039 ♡
OpenCLIP ViT-B/32 trained by LAION on 2 billion image-text pairs from the LAION-2B dataset. It provides open-source CLIP features comparable to OpenAI's original ViT-B/32 while being trained on a fully public dataset.
3,232,028 ↓ · 141 ♡
CLIP fine-tuned on a large fashion product dataset to improve image-text alignment for apparel, accessories, and retail imagery. Standard CLIP models underperform on fashion-specific queries due to distribution shift from generic web data.
2,917,993 ↓ · 283 ♡
PickScore_v1 is a CLIP-based human preference scorer trained on the Pick-a-Pic dataset of text-image pairs with human preference labels. Given a text prompt and a set of generated images, it predicts which image humans would prefer. It is typically used as a reward model in reinforcement-learning-from-human-feedback (RLHF) pipelines for image generation, not as a standalone image generator.
2,663,393 ↓ · 52 ♡
OpenAI CLIP ViT-L/14 at 336×336px input resolution, a higher-resolution variant of the standard ViT-L/14 CLIP model. The larger input patch size reduces information loss during tokenization, improving performance on classification tasks requiring fine-grained visual detail. Otherwise shares the same contrastive training on 400M image-text pairs as the base ViT-L/14.
2,314,342 ↓ · 307 ♡
SigLIP (Sigmoid Loss for Language-Image Pre-training) SO/400M at 384px resolution is Google's vision-language model using a sigmoid binary cross-entropy loss instead of CLIP's softmax contrastive loss. It achieves stronger zero-shot classification than CLIP ViT-L at comparable scale.
1,544,045 ↓ · 677 ♡
clip-vit-base-patch16 uses a joint image-text embedding space to score unseen label categories against input images.
1,381,419 ↓ · 163 ♡
SigLIP base/patch16 at 224px resolution is the lightweight tier of Google's sigmoid-loss vision-language pretraining model. It serves as a vision encoder for multimodal pipelines and as a standalone zero-shot classifier.
1,353,204 ↓ · 83 ♡
SigLIP2 SO400M with NaFlex (Native Resolution Flexible) encoding — the larger 400M variant of siglip2-base-patch16-naflex. NaFlex processes images at native resolution without forced resizing, preserving spatial detail. This is the strongest SigLIP2 variant for both CLIP-style tasks and as a vision encoder in multimodal LLMs.
1,098,288 ↓ · 74 ♡
marqo-fashionSigLIP classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.
1,009,672 ↓ · 81 ♡
SigLIP2 is Google's second-generation sigmoid loss vision-language contrastive model at 400M parameters, using a 16px patch size and 256px input resolution. The sigmoid loss formulation (vs softmax in CLIP) enables independent positive/negative scoring without requiring full batch negatives. Often used as the vision encoder in multimodal LLMs.
841,883 ↓ · 5 ♡
SigLIP2-Base with NaFlex (Native Resolution Flexible) encoding, which processes images at their native resolution by dynamically adjusting patch sequences rather than resizing to a fixed size. This improves accuracy on images where spatial details matter. The base variant offers a smaller memory footprint than the 400M so400m variant.
801,020 ↓ · 33 ♡
BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 uses a joint image-text embedding space to score unseen label categories against input images.
790,786 ↓ · 410 ♡
siglip2-so400m-patch14-384 performs zero-shot classification by measuring similarity between the image representation and natural-language class descriptions.
657,655 ↓ · 90 ♡
CLIP-ViT-L-14-laion2B-s32B-b82K classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.
614,918 ↓ · 64 ♡
siglip2-giant-opt-patch16-384 is Google's SigLIP 2 giant variant, a contrastively trained vision-language encoder with 384px patch-16 resolution. SigLIP 2 introduces sigmoid loss instead of softmax for cross-modal alignment, improving per-example calibration and zero-shot classification accuracy over the original SigLIP. The 'opt' variant uses optimized training recipes and targets state-of-the-art zero-shot classification quality.
584,398 ↓ · 42 ♡
CLIP-convnext_base_w-laion2B-s13B-b82K-augreg classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.
564,440 ↓ · 8 ♡
CLIP-ViT-H-14-laion2B-s32B-b79K classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.
395,714 ↓ · 462 ♡
OpenCLIP ViT-B/16 trained on LAION-2B with 34B samples seen during training. The ViT-B/16 architecture processes 16x16 patches at 224px resolution, offering better feature quality than ViT-B/32 at moderate additional cost.
384,315 ↓ · 39 ♡
siglip2-base-patch16-224 performs zero-shot classification by measuring similarity between the image representation and natural-language class descriptions.
368,379 ↓ · 108 ♡
PE-Core-L14-336 is an open-source zero-shot-image-classification model available on HuggingFace. Details are sourced from the public model registry.
316,732 ↓ · 52 ♡
vit_base_patch16_plus_clip_240.laion400m_e31 is an open-source zero-shot-image-classification model available on HuggingFace. Details are sourced from the public model registry.
314,216 ↓ · 1 ♡
siglip2-base-patch16-512 is an open-source zero-shot-image-classification model available on HuggingFace. Details are sourced from the public model registry.
294,208 ↓ · 42 ♡
One-Align is a unified image and video quality assessment model from the Q-Future group, trained to score perceptual quality and alignment with human aesthetic preferences. It unifies image quality assessment (IQA) and video quality assessment (VQA) into a single model.
267,437 ↓ · 43 ♡
TinyCLIP-ViT-8M-16-Text-3M-YFCC15M is an open-source zero-shot-image-classification model available on HuggingFace. Details are sourced from the public model registry.
232,353 ↓ · 12 ♡