AI Tools.

Search

Updated daily from HuggingFace

Open-source AI models,
compared at a glance.

1406 models · 45 pipelines · 2,465,758,828 total downloads tracked. Use cases, pros, cons, and alternatives for each.

Browse by pipeline

45 categories of AI models

text generation

298

Top: Qwen3-0.6B

Browse →

image text to text

199

Top: gemma-4-26B-A4B-it

Browse →

automatic speech recognition

111

Top: speaker-diarization-3.1

Browse →

sentence similarity

88

Top: all-MiniLM-L6-v2

Browse →

feature extraction

81

Top: bge-small-en-v1.5

Browse →

fill mask

53

Top: bert-base-uncased

Browse →

text classification

47

Top: bge-reranker-v2-m3

Browse →

image classification

38

Top: mobilenetv3_small_100.lamb_in1k

Browse →

time series forecasting

31

Top: chronos-2

Browse →

zero shot image classification

26

Top: clip-vit-base-patch32

Browse →

text ranking

23

Top: ms-marco-MiniLM-L6-v2

Browse →

translation

22

Top: t5-small

Browse →

any to any

22

Top: gemma-4-E4B-it

Browse →

text to image

21

Top: stable-diffusion-v1-5

Browse →

token classification

20

Top: indonesian-roberta-base-posp-tagger

Browse →

image feature extraction

18

Top: dinov2-small

Browse →

text to speech

16

Top: Kokoro-82M

Browse →

audio classification

13

Top: clap-htsat-fused

Browse →

image to text

13

Top: GLM-OCR

Browse →

object detection

9

Top: table-transformer-detection

Browse →

image segmentation

9

Top: clipseg-rd64-refined

Browse →

image to video

7

Top: LTX-2.3

Browse →

zero shot classification

6

Top: bart-large-mnli

Browse →

depth estimation

6

Top: Depth-Anything-V2-Small-hf

Browse →

question answering

5

Top: electra_large_discriminator_squad2_512

Browse →

image to image

5

Top: Qwen-Image-Edit-2509

Browse →

zero shot object detection

4

Top: grounding-dino-base

Browse →

mask generation

4

Top: sam3

Browse →

summarization

4

Top: bart-large-cnn

Browse →

audio to audio

4

Top: bigvgan_v2_22khz_80band_256x

Browse →

audio text to text

4

Top: ultravox-v0_5-llama-3_2-1b

Browse →

image to 3d

3

Top: TRELLIS-image-large

Browse →

video classification

3

Top: videomae-small-finetuned-kinetics-xd-violence-binary

Browse →

voice activity detection

2

Top: segmentation-3.0

Browse →

visual document retrieval

2

Top: jina-embeddings-v4

Browse →

keypoint detection

1

Top: vitpose-plus-base

Browse →

robotics

1

Top: openvla-7b

Browse →

text to audio

1

Top: musicgen-medium

Browse →

text to video

1

Top: Sulphur-2-base

Browse →

table question answering

1

Top: tapex-base-finetuned-wikisql

Browse →

other

1

Top: KVzap-mlp-Qwen3-8B

Browse →

tabular regression

1

Top: mitra-regressor

Browse →

tabular classification

1

Top: mitra-classifier

Browse →

visual question answering

1

Top: blip-vqa-base

Browse →

image text to image

1

Top: Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

Browse →

Top by downloads

Most popular models across all pipelines

all-MiniLM-L6-v2

sentence-similarity

Distilled BERT model that encodes sentences into 384-dimensional vectors for measuring semantic similarity. Trained on over a billion sentence pairs spanning scientific papers, web QA, NLI datasets, and community forums. At 22M parameters and 6 transformer layers, it is fast enough for CPU inference while remaining competitive on standard sentence similarity benchmarks.

243,930,327 4,980

ms-marco-MiniLM-L6-v2

text-ranking

Cross-encoder reranker trained on the MS MARCO passage retrieval dataset, designed to score query-document pairs jointly rather than encoding them independently. Distilled from a 12-layer cross-encoder into 6 layers to reduce latency while retaining re-ranking accuracy. Used as a second-stage ranker on top of fast first-stage retrieval (BM25 or bi-encoder).

78,976,309 267

bge-small-en-v1.5

feature-extraction

Small English dense embedding model from BAAI's BGE (BAAI General Embedding) series, producing 384-dimensional vectors via MIT license. Optimized for MTEB retrieval benchmarks through a retrieval-focused training strategy, it achieves competitive scores relative to its parameter count. Suited for embedding workflows where throughput and cost matter more than peak accuracy.

60,148,419 493

bert-base-uncased

fill-mask

Google's original BERT base model in uncased form, pre-trained on BookCorpus and English Wikipedia via masked language modeling. Tokens are lowercased before processing, making it insensitive to capitalization. It remains a standard fine-tuning base for classification, NER, and extractive QA, though newer encoders outperform it on most benchmarks.

57,757,042 2,686

Multilingual sentence embedding model covering 50+ languages, built on a 12-layer distilled MiniLM architecture. Produces 384-dimensional vectors designed for semantic similarity and paraphrase detection across language boundaries. Trained on multilingual paraphrase data to align semantically equivalent sentences even when expressed in different languages.

51,516,901 1,278

ELECTRA base discriminator from Google, pre-trained using replaced token detection rather than masked language modeling. A small generator produces candidate replacements; this model learns to identify which tokens were swapped — a task that uses every token for training signal, making pre-training more efficient than BERT per compute dollar. Intended as a fine-tuning base for classification and token-level tasks.

41,397,308 127

all-mpnet-base-v2

sentence-similarity

Sentence embedding model based on the MPNet architecture, producing 768-dimensional vectors. Trained on over a billion sentence pairs from MS MARCO, NLI datasets, and community QA forums, it is frequently used when accuracy matters more than inference speed among English embedding models. The MPNet backbone enables masked and permuted prediction during pre-training for stronger representations.

34,593,691 1,311

bge-m3

sentence-similarity

BAAI's BGE-M3 embedding model supporting over 100 languages with a unified architecture capable of dense, sparse (lexical), and late-interaction (ColBERT-style) retrieval modes from a single checkpoint. Built on XLM-RoBERTa with large-scale multilingual training, it targets multi-lingual and cross-lingual retrieval where a single model must handle diverse language inputs.

31,091,007 3,131

Qwen3-0.6B

text-generation

Qwen3-0.6B is the 0.6-billion-parameter instruction-tuned model from Alibaba Cloud's Qwen3 series, fine-tuned from the Qwen3-0.6B-Base for conversational and task-following use. It targets deployment in environments where even a 1B model is too large — edge hardware, mobile devices, or ultra-low-latency services. Apache 2.0 licensed.

27,358,157 1,351

clip-vit-base-patch32

zero-shot-image-classification

OpenAI's CLIP model using a ViT-B/32 image encoder, the smaller of the two widely deployed CLIP variants. Trained contrastively on 400 million image-text pairs, it aligns image and text representations in a shared embedding space for zero-shot classification and retrieval. The B/32 variant sacrifices accuracy versus ViT-L/14 for faster inference.

23,240,836 963

xlm-roberta-base

fill-mask

XLM-RoBERTa base from Facebook AI, pre-trained on 2.5TB of filtered CommonCrawl text across 100 languages using the RoBERTa training procedure. Enables cross-lingual transfer — models fine-tuned on labeled English data can infer on other languages without parallel annotations. The standard starting point for multilingual classification and token-level tasks.

20,744,002 852

nomic-embed-text-v1.5

sentence-similarity

Nomic Embed Text v1.5 is a matryoshka-capable English embedding model from Nomic AI, built on a custom nomic-BERT architecture trained with contrastive learning on large-scale text pairs. Matryoshka Representation Learning allows truncating embeddings to shorter dimensions (e.g. 64, 128, 256) without retraining, enabling flexible precision-cost tradeoffs. The model is transformers.js-compatible for browser-side inference.

18,375,459 852

Kokoro-82M

text-to-speech

Kokoro-82M is a compact 82-million-parameter text-to-speech model fine-tuned from StyleTTS2, targeting natural-sounding English speech synthesis at a size runnable on CPU or modest GPU. Released under Apache 2.0 with a HuggingFace DOI, it gained attention as a high-quality open TTS model at significantly smaller scale than most alternatives. It supports multiple English voice styles.

16,925,704 6,372

clap-htsat-fused

audio-classification

LAION's CLAP (Contrastive Language-Audio Pretraining) model using the HTSAT (Hierarchical Token-Semantic Audio Transformer) encoder, fused with a text encoder to align audio and text in a shared embedding space. Analogous to CLIP for images, it enables zero-shot audio classification and retrieval using natural language descriptions without task-specific labeled audio data.

16,636,514 106

Qwen3-4B

text-generation

Qwen3-4B is Alibaba's 4B parameter model from the Qwen3 series, which introduced a hybrid thinking mode allowing the model to switch between fast direct answering and extended chain-of-thought reasoning. It is a compact model capable of running on consumer hardware while outperforming many 7B predecessors on reasoning benchmarks. Apache 2.0 licensed.

16,075,125 640

bge-reranker-v2-m3

text-classification

BGE-Reranker-v2-M3 is BAAI's multilingual cross-encoder reranker built on XLM-RoBERTa, designed for re-ranking retrieved passages in multilingual RAG or search pipelines. It jointly encodes query-passage pairs to produce relevance scores, providing higher accuracy than bi-encoder similarity for the same candidate set. Apache 2.0 licensed with text-embeddings-inference support.

15,789,545 1,047

ColBERTv2 is a late-interaction retrieval model from Stanford that encodes queries and passages as per-token embeddings rather than a single vector, allowing MaxSim matching at retrieval time. This token-level interaction yields higher accuracy than bi-encoders on many retrieval benchmarks while remaining more efficient than cross-encoders. The model is MIT licensed and implemented in PyTorch with ONNX support.

15,023,380 362

chronos-2

time-series-forecasting

Chronos-2 is Amazon's second-generation pretrained foundation model for zero-shot time-series forecasting. It frames forecasting as a language modeling problem over quantized time-series tokens using a T5 encoder-decoder architecture, enabling it to forecast across diverse domains without per-dataset training. Released under Apache 2.0.

15,009,050 333

bge-large-en-v1.5

feature-extraction

BGE-Large-EN-v1.5 is BAAI's highest-capacity English embedding model in the v1.5 series, producing 1024-dimensional vectors. It achieves top MTEB retrieval scores among its generation of English-only embedding models, at the cost of higher compute and storage than BGE-small or BGE-base. MIT licensed with ONNX export support.

14,928,106 688

mobilenetv3_small_100.lamb_in1k

image-classification

MobileNetV3 small model at 100% width multiplier, trained on ImageNet-1k using the LAMB optimizer via the timm library. At under 3M parameters, it targets image classification on mobile and edge hardware where latency and memory are primary constraints. Part of timm's standardized pretrained model zoo with consistent preprocessing and inference APIs.

14,369,555 78

chronos-bolt-small

time-series-forecasting

Chronos-Bolt-Small is a small time-series foundation model from AutoGluon, using a T5-based encoder-decoder architecture for zero-shot forecasting. The 'Bolt' variant improves over original Chronos through training and architectural refinements for better speed and accuracy. Apache 2.0 licensed and part of the AutoGluon time-series forecasting ecosystem.

13,532,584 44

roberta-base

fill-mask

RoBERTa base from Facebook AI, trained with the same architecture as BERT base but significantly more data, longer training schedules, larger batch sizes, and dynamic masking. Pre-trained on BookCorpus, Wikipedia, CC-News, OpenWebText, and Stories — substantially more data than the original BERT. MIT licensed with multi-framework support.

13,342,794 616

gpt2

text-generation

OpenAI's original GPT-2 at 124M parameters, an autoregressive language model trained on WebText (over 8 million web documents filtered from Reddit outlinks). It generates English text continuation given a prompt using next-token prediction, trained without any instruction tuning or RLHF. MIT licensed and runnable on commodity CPU hardware.

13,231,213 3,306

ADetailer is a collection of Ultralytics YOLO-based face, body, and hand detection models distributed for use with the Stable Diffusion WebUI's ADetailer extension. The models detect regions of interest in generated images (faces, hands) to trigger targeted inpainting passes for quality improvement. Trained on WIDER FACE and anime segmentation datasets, covering both photorealistic and anime styles.

12,867,952 728