clip-vit-large-patch14 vs clip-vit-large-patch14-336

clip-vit-large-patch14 and clip-vit-large-patch14-336 are both zero-shot-image-classification models. See each entry for specifics.

clip-vit-large-patch14

Pipeline: zero shot image classification
Downloads: 25,187,308
Likes: 2,000

OpenAI's CLIP model using a ViT-L/14 image encoder, trained contrastively on 400 million image-text pairs from the internet. It aligns image and text in a shared embedding space, enabling zero-shot image classification by comparing image embeddings against text label embeddings. The ViT-L/14 variant offers higher accuracy than the smaller ViT-B/32 at greater compute cost.

clip-vit-large-patch14-336

Pipeline: zero shot image classification
Downloads: 14,075,831
Likes: 304

OpenAI CLIP ViT-L/14 at 336×336px input resolution, a higher-resolution variant of the standard ViT-L/14 CLIP model. The larger input patch size reduces information loss during tokenization, improving performance on classification tasks requiring fine-grained visual detail. Otherwise shares the same contrastive training on 400M image-text pairs as the base ViT-L/14.

Key differences

See individual model pages for architecture and use cases.

Common ground

Both are open-source models on HuggingFace.

Which should you pick?

Pick based on your compute budget and specific task requirements.