image feature extraction models

18 models · ranked by HuggingFace downloads

dinov2-small

DINOv2 ViT-S is the smallest variant in Meta's DINOv2 series, offering a 21M-parameter self-supervised vision transformer suitable for resource-constrained feature extraction applications.

2,996,173 ↓ · 66 ♡

ViT-B/16 pretrained on ImageNet-21K with 21,000 classes using supervised training. A standard vision transformer backbone widely used as a starting point for fine-tuning on downstream vision classification and feature extraction tasks.

1,999,590 ↓ · 411 ♡

vit_small_patch14_reg4_dinov2.lvd142m

vit_small_patch14_reg4_dinov2.lvd142m encodes images into fixed-dimension feature vectors for downstream visual similarity and classification tasks.

1,965,345 ↓ · 7 ♡

dinov2-large

DINOv2 ViT-L is Meta's large-scale self-supervised vision transformer, offering significantly better visual representations than the base variant at 4x the parameter count. It achieves near-supervised performance on linear probing for ImageNet.

1,435,382 ↓ · 113 ♡

nomic-embed-vision-v1.5

nomic-embed-vision-v1.5 is a vision encoder from Nomic AI that produces embeddings aligned with their nomic-embed-text embedding space. Images and text can be projected into the same vector space, enabling cross-modal retrieval without separate encoders. The model is based on a modified BERT-style backbone rather than a typical CLIP ViT.

1,285,953 ↓ · 220 ♡

dinov2-base

DINOv2 ViT-B is Meta's self-supervised vision transformer trained on 142M curated images using a combination of DINO and iBOT objectives. It produces strong visual features for dense prediction tasks without any labels during pretraining.

1,190,165 ↓ · 181 ♡

vit_small_patch14_dinov2.lvd142m

A ViT-Small backbone pre-trained with DINOv2 self-supervised learning on the curated LVD-142M dataset. DINOv2 models learn dense visual features without labels, producing representations that transfer well to segmentation, depth estimation, and retrieval tasks. The small patch14 variant offers a balance between spatial resolution and inference speed.

833,671 ↓ · 6 ♡

convnext_base.clip_laion2b

convnext_base.clip_laion2b is a safetensors distribution of the base model, packaged for local or server inference. The exact pipeline type is not specified in the model card metadata, but the model targets text or multimodal generation tasks based on its architecture tags. Check the source model card for specific capability and benchmark details.

574,370 ↓ · 0 ♡

dinov3-vitb16-pretrain-lvd1689m

dinov3-vitb16-pretrain-lvd1689m encodes images into fixed-dimension feature vectors for downstream visual similarity and classification tasks.

554,391 ↓ · 164 ♡

vit_base_patch14_dinov2.lvd142m

vit_base_patch14_dinov2.lvd142m extracts compact visual representations from images, enabling content-based search and fine-tuning on top of frozen features.

552,386 ↓ · 10 ♡

dinov3-vitl16-pretrain-lvd1689m

dinov3-vitl16-pretrain-lvd1689m produces dense visual embeddings from image inputs without a task-specific head. Used for image retrieval, clustering, and transfer learning.

505,370 ↓ · 341 ♡

dino-vitb16

dino-vitb16 produces dense visual embeddings from image inputs without a task-specific head. Used for image retrieval, clustering, and transfer learning.

463,218 ↓ · 112 ♡

vit_small_patch16_dinov3.lvd1689m

vit_small_patch16_dinov3.lvd1689m is an open-source image-feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

400,963 ↓ · 6 ♡

dinov3-vits16-pretrain-lvd1689m

dinov3-vits16-pretrain-lvd1689m is an open-source image-feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

392,097 ↓ · 115 ♡

rad-dino

RAD-DINO is Microsoft's radiology-focused DINOv2 model trained on chest X-ray images to produce self-supervised visual features suited for medical imaging tasks. It enables zero-shot and few-shot learning on radiological data without labelled fine-tuning datasets. Microsoft published this model alongside a paper demonstrating its utility for X-ray report generation and pathology classification.

382,232 ↓ · 76 ♡

Search

image feature extraction models

dinov2-small

vit-base-patch16-224-in21k

vit_small_patch14_reg4_dinov2.lvd142m

dinov2-large

nomic-embed-vision-v1.5

dinov2-base

vit_small_patch14_dinov2.lvd142m

convnext_base.clip_laion2b

dinov3-vitb16-pretrain-lvd1689m

vit_base_patch14_dinov2.lvd142m

dinov3-vitl16-pretrain-lvd1689m

dino-vitb16

vit_small_patch16_dinov3.lvd1689m

dinov3-vits16-pretrain-lvd1689m

rad-dino

dinov3-vitl16-pretrain-lvd1689m

dinov2-with-registers-base

vit_base_patch16_clip_224.openai