image classification models

38 models · ranked by HuggingFace downloads

mobilenetv3_small_100.lamb_in1k

MobileNetV3 small model at 100% width multiplier, trained on ImageNet-1k using the LAMB optimizer via the timm library. At under 3M parameters, it targets image classification on mobile and edge hardware where latency and memory are primary constraints. Part of timm's standardized pretrained model zoo with consistent preprocessing and inference APIs.

14,369,555 ↓ · 78 ♡

nsfw_image_detection

Vision Transformer (ViT) fine-tuned for binary NSFW vs. safe image classification. Provides a single classifier for flagging potentially unsafe image content without category-level labeling. Built on ViT-base architecture and fine-tuned on a curated dataset of safe and unsafe images.

8,598,673 ↓ · 1,103 ♡

vit-base-patch16-224

Google's ViT-Base (Vision Transformer base model) with 16×16 pixel patch size trained at 224px resolution on ImageNet-21k and fine-tuned on ImageNet-1k. The paper introducing ViTs demonstrated that pure transformer architectures without convolutional inductive bias can match CNNs on image classification when trained on sufficient data. Widely used as a starting backbone for image classification fine-tuning.

5,553,756 ↓ · 979 ♡

fairface_age_image_detection

A ViT-base model fine-tuned on the FairFace dataset for age bracket classification from face images. It categorizes detected faces into age groups (0-2, 3-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70+). Built on google/vit-base-patch16-224-in21k and fine-tuned with Apache 2.0 license.

4,886,432 ↓ · 74 ♡

mobilevit-small

MobileViT-S is Apple's hybrid CNN-transformer vision model designed for mobile deployment. It interleaves depthwise convolutions with lightweight self-attention blocks to achieve stronger global context modeling than pure CNN architectures at comparable parameter counts.

3,424,545 ↓ · 91 ♡

resnet50.a1_in1k

resnet50.a1_in1k classifies images into predefined categories using a vision encoder. It outputs class probabilities for each input.

3,253,366 ↓ · 42 ♡

convnextv2_nano.fcmae_ft_in22k_in1k

ConvNeXtV2-Nano fine-tuned on ImageNet-1K after pretraining with FCMAE on ImageNet-22K. A small (15M parameter) pure-CNN model that matches or exceeds older ViT-Small models while being faster due to convolutional inductive biases.

2,503,315 ↓ · 4 ♡

resnet18.a1_in1k

resnet18.a1_in1k classifies images into predefined categories using a vision encoder. It outputs class probabilities for each input.

1,779,186 ↓ · 14 ♡

gender-classification

gender-classification classifies images into predefined categories using a vision encoder. It outputs class probabilities for each input.

1,440,626 ↓ · 60 ♡

rorshark-vit-base

rorshark-vit-base maps input images to class labels. Built on a ViT vision architecture and fine-tuned on labeled image datasets.

1,300,638 ↓ · 3 ♡

tf_efficientnetv2_s.in21k_ft_in1k

tf_efficientnetv2_s.in21k_ft_in1k performs image classification by encoding visual features and scoring them against a label set.

1,041,345 ↓ · 2 ♡

vit-base-nsfw-detector

vit-base-nsfw-detector classifies images into predefined categories using a vision encoder. It outputs class probabilities for each input.

1,009,538 ↓ · 79 ♡

repvgg_a0.rvgg_in1k

repvgg_a0.rvgg_in1k is a RepVGG-A0 image classification model from the timm library, trained on ImageNet-1K. RepVGG uses a multi-branch training architecture (conv + identity shortcuts) that is re-parameterized at inference time into a single 3x3 conv layer per stage, eliminating branch overhead. The A0 variant is the smallest in the RepVGG family, prioritizing speed over top-1 accuracy.

991,388 ↓ · 1 ♡

efficientnet_b0.ra_in1k

efficientnet_b0.ra_in1k classifies images into predefined categories using a vision encoder. It outputs class probabilities for each input.

973,260 ↓ · 9 ♡

CommunityForensics-DeepfakeDet-ViT

CommunityForensics-DeepfakeDet-ViT classifies images into predefined categories using a vision encoder. It outputs class probabilities for each input.

813,564 ↓ · 13 ♡

vit_small_patch16_224.augreg_in21k_ft_in1k

vit_small_patch16_224.augreg_in21k_ft_in1k is an open-source image-classification model available on HuggingFace. Details are sourced from the public model registry.

538,332 ↓ · 4 ♡

resnet34.a1_in1k

resnet34.a1_in1k performs image classification by encoding visual features and scoring them against a label set.

520,481 ↓ · 1 ♡

gender_class

gender_class classifies images into predefined categories using a vision encoder. It outputs class probabilities for each input.

485,575 ↓ · 1 ♡

nsfw_image_detector

Freepik's NSFW image classifier built on a timm-wrapped backbone for binary or multi-class content safety detection. MIT-licensed for integration into content moderation pipelines. Trained by Freepik, a major stock media platform, likely on production-scale labeled data.

451,696 ↓ · 58 ♡

vit_base_patch16_224.augreg_in21k

ViT-Base with 16×16 patches, pre-trained on ImageNet-21k using AugReg (augmentation and regularisation), producing a strong transfer learning backbone before ImageNet-1k fine-tuning. This checkpoint is the pre-fine-tuned version; it is typically fine-tuned downstream rather than used directly for classification. The AugReg recipe significantly improves transfer performance over vanilla ViT-B/16 pre-training.

442,613 ↓ · 11 ♡

swinv2-tiny-patch4-window16-256

Swin Transformer V2 Tiny at 4px patch size and 16-patch window for 256px input images. Swin V2 improves over V1 with log-spaced continuous position bias and cosine attention for better scale transfer. Apache-2.0 licensed and available via standard Transformers image-classification pipeline.

408,407 ↓ · 13 ♡

vit-base-violence-detection

A ViT-Base image classifier fine-tuned to detect violent visual content. It produces binary or multi-class predictions distinguishing violent from non-violent imagery, intended for content moderation pipelines. The fine-tune uses a curated violence detection dataset atop the standard ViT-base-patch16-224 checkpoint.

397,759 ↓ · 10 ♡

deit-base-distilled-patch16-384

DeiT-Base Distilled (patch16, 384px) is Facebook's Data-efficient Image Transformer trained with knowledge distillation from a RegNet teacher. The 384px input resolution variant produces finer-grained features than the 224px version, improving accuracy on tasks requiring spatial detail. Knowledge distillation allows DeiT to match CNN performance without large-scale extra data.

395,475 ↓ · 8 ♡

convnext_tiny.in12k_ft_in1k

ConvNeXt-Tiny pre-trained on ImageNet-12k and fine-tuned on ImageNet-1k, offered via timm. This training recipe — large-scale pre-training followed by supervised fine-tuning — significantly boosts classification accuracy compared to ImageNet-1k-only training. ConvNeXt-Tiny brings modern training techniques to a traditional convolutional architecture, making it highly deployable with standard CNN inference stacks.

390,776 ↓ · 5 ♡

mobilenetv3_large_100.ra_in1k

MobileNetV3-Large trained on ImageNet-1k with RandAugment augmentation, packaged in the timm model zoo. A practical image classification backbone balancing accuracy and inference speed, commonly used as a feature extractor in mobile computer vision pipelines.

374,590 ↓ · 39 ♡

edgenext_small.usi_in1k

EdgeNeXt-Small is a lightweight CNN-transformer hybrid architecture optimized for mobile and edge inference, pre-trained on ImageNet-1K with Universal Self-Attention Interaction (USI) training. MIT-licensed and available via timm's model registry.

372,378 ↓ · 6 ♡

vit_base_patch16_224.augreg2_in21k_ft_in1k

vit_base_patch16_224.augreg2_in21k_ft_in1k classifies images into predefined categories using a vision encoder. It outputs class probabilities for each input.

363,194 ↓ · 13 ♡

wide_resnet50_2.racm_in1k

wide_resnet50_2.racm_in1k performs image classification by encoding visual features and scoring them against a label set.

344,696 ↓ · 2 ♡

resnet-50

resnet-50 is an open-source image-classification model available on HuggingFace. Details are sourced from the public model registry.

339,305 ↓ · 495 ♡

convnext_femto.d1_in1k

ConvNeXt-Femto is the smallest variant in the ConvNeXt family, pre-trained on ImageNet-1K using the timm library's distillation training (d1). At femto scale it's designed for extreme compute efficiency with minimal accuracy. Apache-2.0 licensed and available via timm's standard model registry.

338,683 ↓ · 1 ♡

vit_tiny_r_s16_p8_224.augreg_in21k

A tiny Vision Transformer (ViT) with 16px stride and 8px patch size, pretrained on ImageNet-21k with AugReg regularization. Designed for scenarios requiring a minimal ViT with attention-based feature extraction at low compute cost.

333,719 ↓ · 0 ♡

resnet18.fb_swsl_ig1b_ft_in1k

resnet18.fb_swsl_ig1b_ft_in1k classifies images into predefined categories using a vision encoder. It outputs class probabilities for each input.

332,415 ↓ · 0 ♡

resnet-18

resnet-18 maps input images to class labels. Built on a transformer vision architecture and fine-tuned on labeled image datasets.

315,683 ↓ · 67 ♡

vit_tiny_patch16_224.augreg_in21k_ft_in1k

vit_tiny_patch16_224.augreg_in21k_ft_in1k is an open-source image-classification model available on HuggingFace. Details are sourced from the public model registry.

311,652 ↓ · 3 ♡

nsfw-image-detection-384

A fine-tuned image classifier from Marqo that flags adult or explicit content in images at 384px input resolution. It outputs probability scores for NSFW versus safe content and is commonly used as a pre-filter in content moderation pipelines before storing or serving user uploads. Apache 2.0 licensed for commercial deployment.

308,960 ↓ · 53 ♡

Search