sentence similarity models

88 models · ranked by HuggingFace downloads

all-MiniLM-L6-v2

Distilled BERT model that encodes sentences into 384-dimensional vectors for measuring semantic similarity. Trained on over a billion sentence pairs spanning scientific papers, web QA, NLI datasets, and community forums. At 22M parameters and 6 transformer layers, it is fast enough for CPU inference while remaining competitive on standard sentence similarity benchmarks.

243,930,327 ↓ · 4,980 ♡

paraphrase-multilingual-MiniLM-L12-v2

Multilingual sentence embedding model covering 50+ languages, built on a 12-layer distilled MiniLM architecture. Produces 384-dimensional vectors designed for semantic similarity and paraphrase detection across language boundaries. Trained on multilingual paraphrase data to align semantically equivalent sentences even when expressed in different languages.

51,516,901 ↓ · 1,278 ♡

all-mpnet-base-v2

Sentence embedding model based on the MPNet architecture, producing 768-dimensional vectors. Trained on over a billion sentence pairs from MS MARCO, NLI datasets, and community QA forums, it is frequently used when accuracy matters more than inference speed among English embedding models. The MPNet backbone enables masked and permuted prediction during pre-training for stronger representations.

34,593,691 ↓ · 1,311 ♡

bge-m3

BAAI's BGE-M3 embedding model supporting over 100 languages with a unified architecture capable of dense, sparse (lexical), and late-interaction (ColBERT-style) retrieval modes from a single checkpoint. Built on XLM-RoBERTa with large-scale multilingual training, it targets multi-lingual and cross-lingual retrieval where a single model must handle diverse language inputs.

31,091,007 ↓ · 3,131 ♡

nomic-embed-text-v1.5

Nomic Embed Text v1.5 is a matryoshka-capable English embedding model from Nomic AI, built on a custom nomic-BERT architecture trained with contrastive learning on large-scale text pairs. Matryoshka Representation Learning allows truncating embeddings to shorter dimensions (e.g. 64, 128, 256) without retraining, enabling flexible precision-cost tradeoffs. The model is transformers.js-compatible for browser-side inference.

18,375,459 ↓ · 852 ♡

multilingual-e5-small

Multilingual-E5-Small is a compact multilingual embedding model from Microsoft Research supporting 100+ languages on a BERT-based backbone, smaller and faster than the E5-large variant. It uses the same instruction-prefix training approach as E5-large ('query:'/'passage:') for asymmetric retrieval. MIT licensed with ONNX and OpenVINO export.

9,827,894 ↓ · 340 ♡

paraphrase-multilingual-mpnet-base-v2

Multilingual MPNet embedding model from the sentence-transformers library, producing 768-dimensional vectors across 50+ languages. Uses an MPNet backbone extended to multilingual training for higher-quality multilingual embeddings than the lighter MiniLM multilingual variant. Suitable when the 384-dim paraphrase-multilingual-MiniLM-L12-v2 is insufficient in accuracy.

6,695,748 ↓ · 465 ♡

multilingual-e5-base

multilingual-e5-base is a multilingual text embedding model from Microsoft using an XLM-RoBERTa backbone, trained with E5's text-pair ranking objective across 94 languages. It produces 768-dimensional sentence embeddings for semantic search, clustering, and cross-lingual retrieval. The base variant balances embedding quality and inference cost between the small and large tiers.

6,268,627 ↓ · 367 ♡

all-MiniLM-L12-v2

A 12-layer sentence encoder producing 384-dimensional embeddings, offering a quality step up from all-MiniLM-L6-v2 at roughly 2x the inference cost. Fine-tuned on a billion sentence pairs using contrastive objectives for semantic similarity and retrieval.

5,931,298 ↓ · 317 ♡

nomic-embed-text-v1

Nomic Embed Text v1 is the original version of Nomic AI's English text embedding model based on nomic-BERT, preceding the v1.5 matryoshka update. It produces 768-dimensional embeddings via contrastive learning and is fully open — model weights, training code, and data are publicly available. Apache 2.0 licensed.

4,679,292 ↓ · 574 ♡

paraphrase-MiniLM-L6-v2

A lightweight 22M-parameter sentence encoder fine-tuned for paraphrase detection and semantic similarity, producing 384-dimensional embeddings. One of the earliest widely adopted sentence-transformers models, optimized for speed over state-of-the-art accuracy.

4,423,000 ↓ · 148 ♡

e5-large-v2

e5-large-v2 is a BERT-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

3,699,915 ↓ · 279 ♡

multi-qa-mpnet-base-dot-v1

MPNet-base fine-tuned on 215M question-answer pairs for asymmetric dense retrieval using dot-product similarity. Designed specifically for the query-document retrieval case rather than symmetric sentence similarity.

2,693,925 ↓ · 193 ♡

all-distilroberta-v1

DistilRoBERTa fine-tuned as a sentence encoder on over 1 billion sentence pairs, producing 768-dimensional embeddings. Offers a balance between the speed of DistilBERT and the richer representations of full RoBERTa.

2,455,218 ↓ · 43 ♡

e5-base-v2

e5-base-v2 is a BERT-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

2,187,355 ↓ · 156 ♡

Qwen3-VL-Embedding-8B

Qwen3-VL-Embedding-8B is a Qwen-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

1,762,980 ↓ · 439 ♡

gte-large-en-v1.5

gte-large-en-v1.5 maps sentences to fixed-length vectors for measuring semantic similarity. Trained with contrastive objectives on text-pair datasets, it optimizes for cosine-distance accuracy.

1,649,726 ↓ · 238 ♡

paraphrase-mpnet-base-v2

paraphrase-mpnet-base-v2 encodes arbitrary-length text into compact vectors. The cosine distance between two outputs reflects their semantic relatedness — closer to 0 means more similar.

1,635,538 ↓ · 49 ♡

text2vec-base-chinese

text2vec-base-chinese is a BERT-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

1,584,539 ↓ · 796 ♡

embeddinggemma-300m

embeddinggemma-300m is a Gemma-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

1,538,496 ↓ · 1,732 ♡

multi-qa-MiniLM-L6-cos-v1

multi-qa-MiniLM-L6-cos-v1 maps sentences to fixed-length vectors for measuring semantic similarity. Trained with contrastive objectives on text-pair datasets, it optimizes for cosine-distance accuracy.

1,314,409 ↓ · 137 ♡

stsb-bert-tiny-safetensors

stsb-bert-tiny-safetensors encodes arbitrary-length text into compact vectors. The cosine distance between two outputs reflects their semantic relatedness — closer to 0 means more similar.

1,283,851 ↓ · 4 ♡

gte-multilingual-base

GTE-multilingual-base is Alibaba's 305M-parameter embedding model covering 70+ languages, designed for multilingual dense retrieval and semantic similarity. It uses a modified transformer backbone with improved positional encoding for cross-lingual transfer.

1,259,686 ↓ · 365 ♡

Qwen3-VL-Embedding-2B

Qwen3-VL-Embedding-2B is a 2B multimodal embedding model that encodes both images and text into a shared vector space. Designed for multimodal retrieval tasks where visual and textual queries need to be compared against mixed corpora.

1,175,761 ↓ · 418 ♡

all-MiniLM-L6-v2-onnx

all-MiniLM-L6-v2-onnx maps sentences to fixed-length vectors for measuring semantic similarity. Trained with contrastive objectives on text-pair datasets, it optimizes for cosine-distance accuracy.

1,108,770 ↓ · 7 ♡

distiluse-base-multilingual-cased-v1

distiluse-base-multilingual-cased-v1 is a DistilBERT-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

1,099,334 ↓ · 131 ♡

LaBSE

LaBSE is a BERT-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

1,056,249 ↓ · 343 ♡

distiluse-base-multilingual-cased-v2

distiluse-base-multilingual-cased-v2 is a DistilBERT-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

1,014,460 ↓ · 209 ♡

ko-sroberta-multitask

ko-sroberta-multitask encodes arbitrary-length text into compact vectors. The cosine distance between two outputs reflects their semantic relatedness — closer to 0 means more similar.

948,387 ↓ · 149 ♡

snowflake-arctic-embed-l-v2.0

snowflake-arctic-embed-l-v2.0 maps sentences to fixed-length vectors for measuring semantic similarity. Trained with contrastive objectives on text-pair datasets, it optimizes for cosine-distance accuracy.

906,727 ↓ · 248 ♡

bge-small-en-v1.5-onnx-Q

bge-small-en-v1.5-onnx-Q maps sentences to fixed-length vectors for measuring semantic similarity. Trained with contrastive objectives on text-pair datasets, it optimizes for cosine-distance accuracy.

866,814 ↓ · 2 ♡

e5-base

E5-base is a 109M-parameter English text embedding model from Microsoft trained with a text-pair weakly-supervised approach on large-scale web data followed by BEIR fine-tuning. It requires prepending 'query: ' or 'passage: ' prefixes to inputs for optimal retrieval performance. E5-base sits between the small and large variants in the series, balancing embedding quality and inference speed.

793,957 ↓ · 25 ♡

paraphrase-MiniLM-L3-v2

paraphrase-MiniLM-L3-v2 encodes arbitrary-length text into compact vectors. The cosine distance between two outputs reflects their semantic relatedness — closer to 0 means more similar.

773,395 ↓ · 30 ♡

gte-large

gte-large is a BERT-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

762,126 ↓ · 304 ♡

nomic-embed-text-v2-moe

nomic-embed-text-v2-moe maps sentences to fixed-length vectors for measuring semantic similarity. Trained with contrastive objectives on text-pair datasets, it optimizes for cosine-distance accuracy.

723,523 ↓ · 482 ♡

bm25

bm25 encodes arbitrary-length text into compact vectors. The cosine distance between two outputs reflects their semantic relatedness — closer to 0 means more similar.

674,858 ↓ · 32 ♡

bge-micro-v2

BGE-Micro-v2 is a heavily distilled BERT embedding model targeting near-zero latency sentence encoding with acceptable MTEB scores. Extremely small footprint allows embedding generation in CPU-only or mobile environments. MIT-licensed with ONNX and transformers.js support.

652,070 ↓ · 63 ♡

gte-base-en-v1.5

gte-base-en-v1.5 maps sentences to fixed-length vectors for measuring semantic similarity. Trained with contrastive objectives on text-pair datasets, it optimizes for cosine-distance accuracy.

647,390 ↓ · 71 ♡

e5-large

E5-large is a 335M-parameter embedding model fine-tuned with contrastive learning on a mixture of web-scale text pairs. It consistently ranks near the top of the MTEB leaderboard for English text retrieval and similarity tasks.

632,198 ↓ · 80 ♡

e5-small-v2

e5-small-v2 is a BERT-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

615,650 ↓ · 119 ♡

all-roberta-large-v1

all-roberta-large-v1 encodes arbitrary-length text into compact vectors. The cosine distance between two outputs reflects their semantic relatedness — closer to 0 means more similar.

611,856 ↓ · 66 ♡

msmarco-bert-base-dot-v5

msmarco-bert-base-dot-v5 encodes arbitrary-length text into compact vectors. The cosine distance between two outputs reflects their semantic relatedness — closer to 0 means more similar.

596,321 ↓ · 21 ♡

BGE-m3-ko

BGE-m3-ko is a Korean-specialized fine-tune of BAAI's BGE-M3 multilingual embedding model, trained with additional Korean-Korean and Korean-English parallel data to improve retrieval performance in Korean. It retains the XLM-RoBERTa backbone and supports up to 8192 tokens, making it suitable for long Korean document retrieval and cross-lingual search.

589,667 ↓ · 76 ♡

finance-embeddings-investopedia

Sentence embeddings fine-tuned on Investopedia financial content, intended to improve semantic similarity for financial terminology and concepts compared to general-purpose embedding models.

579,932 ↓ · 65 ♡

ruri-v3-310m

Ruri v3 (310M) is Nagoya University's Japanese text embedding model built on the ModernBERT architecture, optimised for semantic similarity and retrieval in Japanese. It is part of the Ruri series, which targets Japanese-specific sentence embedding quality. The v3 310M variant balances embedding dimension, retrieval quality, and inference speed for production Japanese NLP pipelines.

551,844 ↓ · 79 ♡

vietnamese-bi-encoder

Vietnamese Bi-Encoder is BKAI's Vietnamese-language sentence embedding model based on PhoBERT/RoBERTa, trained with sentence-transformers for semantic similarity and retrieval in Vietnamese. Apache-2.0 licensed, it fills a gap in Vietnamese NLP tooling.

490,208 ↓ · 75 ♡

nomic-embed-code

Nomic Embed Code is Nomic AI's code-specialized embedding model built on a Qwen2 backbone, designed for code retrieval, documentation search, and code similarity tasks. Apache-2.0 licensed with text-embeddings-inference compatibility.

488,235 ↓ · 121 ♡

snowflake-arctic-embed-m

snowflake-arctic-embed-m encodes arbitrary-length text into compact vectors. The cosine distance between two outputs reflects their semantic relatedness — closer to 0 means more similar.

481,679 ↓ · 165 ♡

USER-bge-m3

USER-bge-m3 is DeepVK's Russian-enhanced version of BGE-M3, fine-tuned to improve text embedding quality on Russian-language documents and search tasks. It inherits BGE-M3's hybrid retrieval capabilities (dense + sparse + ColBERT) while boosting Slavic text representation.

468,773 ↓ · 79 ♡

BioLORD-2023

BioLORD-2023 is a sentence embedding model trained for biomedical concept representation, using a knowledge-grounded contrastive approach that anchors concept embeddings to formal ontology definitions. It produces embeddings where semantically related biomedical terms (e.g., synonymous disease names across different coding systems) cluster tightly. The model is designed for medical NLP tasks where concept normalisation and synonym matching are important.

462,753 ↓ · 53 ♡

rubert-tiny2

rubert-tiny2 maps sentences to fixed-length vectors for measuring semantic similarity. Trained with contrastive objectives on text-pair datasets, it optimizes for cosine-distance accuracy.

454,490 ↓ · 171 ♡

telugu-sentence-bert-nli

telugu-sentence-bert-nli is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

443,204 ↓ · 1 ♡

sup-SimCSE-VietNamese-phobert-base

sup-SimCSE-VietNamese-phobert-base is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

441,742 ↓ · 30 ♡

gte-base

GTE-base (General Text Embeddings) is Alibaba's 110M-parameter BERT-based embedding model trained on a large multi-task text similarity dataset. It became a popular baseline embedding model due to its strong MTEB scores relative to its size before larger models like GTE-large and e5-mistral gained traction.

430,552 ↓ · 131 ♡

pubmedbert-base-embeddings

pubmedbert-base-embeddings maps sentences to fixed-length vectors for measuring semantic similarity. Trained with contrastive objectives on text-pair datasets, it optimizes for cosine-distance accuracy.

407,146 ↓ · 190 ♡

snowflake-arctic-embed-m-v1.5

Snowflake Arctic Embed M v1.5 is Snowflake's medium-scale English embedding model, optimized for retrieval tasks with MTEB benchmark focus. Available in ONNX, GGUF, and safetensors formats with transformers.js compatibility, making it unusually portable across inference environments. Apache-2.0 licensed.

407,022 ↓ · 72 ♡

all-indo-e5-small-v4

all-indo-e5-small is LazarusNLP's Indonesian fine-tune of a small e5 embedding model, designed to improve semantic search and sentence similarity quality on Bahasa Indonesia text. v4 reflects iterative improvements over previous Indonesian embedding baselines.

395,314 ↓ · 13 ♡

klue-sroberta-base-continue-learning-by-mnr

A Korean sentence embedding model built on KLUE-RoBERTa-base, fine-tuned with Multiple Negatives Ranking (MNR) loss for continued learning after the initial sentence-transformers training. It is designed for Korean semantic similarity and retrieval tasks, extending the KLUE benchmark-trained base with better sentence-level representations. Bespin Global targets Korean enterprise NLP applications with this checkpoint.

392,771 ↓ · 31 ♡

SecureBERT2.0-biencoder

SecureBERT 2.0 biencoder is a ModernBERT-based dense retrieval model trained on cybersecurity corpora for semantic search over security documents. It uses MultipleNegativesRankingLoss fine-tuning on ~35k pairs, making it well-suited for threat intelligence retrieval.

390,839 ↓ · 5 ♡

multilingual-e5-large-onnx

multilingual-e5-large-onnx is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

387,460 ↓ · 3 ♡

gte-modernbert-base

GTE-ModernBERT-base is Alibaba's text embedding model built on the ModernBERT architecture, which extends the classic BERT design with rotary position encodings and improved attention kernels for better long-context handling. It achieves strong scores on MTEB benchmarks at the 149M-parameter base scale. The Transformers.js export makes it deployable in browser environments alongside Python serving.

383,239 ↓ · 197 ♡

gte-Qwen2-1.5B-instruct

GTE-Qwen2-1.5B-instruct is Alibaba's embedding model built on a 1.5B Qwen2 decoder backbone with instruction fine-tuning for text retrieval. It significantly outperforms encoder-only models its size on MTEB by leveraging the Qwen2 language model's broader world knowledge.

382,530 ↓ · 235 ♡

msmarco-MiniLM-L12-cos-v5

msmarco-MiniLM-L12-cos-v5 maps sentences to fixed-length vectors for measuring semantic similarity. Trained with contrastive objectives on text-pair datasets, it optimizes for cosine-distance accuracy.

381,899 ↓ · 10 ♡

paraphrase-albert-small-v2

paraphrase-albert-small-v2 is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

368,745 ↓ · 11 ♡

french-bge-m3

french-bge-m3 is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

367,535 ↓ · 0 ♡

KR-SBERT-V40K-klueNLI-augSTS

KR-SBERT-V40K-klueNLI-augSTS encodes arbitrary-length text into compact vectors. The cosine distance between two outputs reflects their semantic relatedness — closer to 0 means more similar.

352,551 ↓ · 83 ♡

snowflake-arctic-embed-xs

snowflake-arctic-embed-xs is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

351,259 ↓ · 41 ♡

SecureBERT2.0-cross_encoder

The cross-encoder companion to SecureBERT2.0-biencoder, designed for reranking in cybersecurity retrieval pipelines. Cross-encoders jointly encode query and document pairs, making them more accurate but slower than biencoder retrieval for re-scoring top candidates.

350,197 ↓ · 3 ♡

gte-small

gte-small is a BERT-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

347,021 ↓ · 188 ♡

S-PubMedBert-MS-MARCO

S-PubMedBert-MS-MARCO encodes arbitrary-length text into compact vectors. The cosine distance between two outputs reflects their semantic relatedness — closer to 0 means more similar.

343,659 ↓ · 43 ♡

bengali-sentence-similarity-sbert

An SBERT-style Bengali sentence embedding model from L3Cube Pune for semantic similarity tasks on Bengali text. Part of L3Cube's series of Indian language NLP models, targeting a language with limited NLP tooling.

340,343 ↓ · 6 ♡

embedic-base

embedic-base is a RoBERTa-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

330,069 ↓ · 2 ♡

LLM2Vec-Meta-Llama-3-8B-Instruct-mntp

LLM2Vec converts LLaMA 3 8B Instruct into a text embedding model using masked next-token prediction (MNTP) fine-tuning, enabling decoder-only LLMs to produce high-quality pooled sentence embeddings. From McGill NLP, this approach demonstrates that decoder LLMs can match or exceed encoder embedding models.

328,072 ↓ · 22 ♡

stella_en_400M_v5

stella_en_400M_v5 is a transformer-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

326,526 ↓ · 233 ♡

S-PubMedBert-MedQuAD

S-PubMedBert-MedQuAD is a sentence-transformers fine-tune of PubMedBERT trained on the MedQuAD question-answer dataset. It produces embeddings specialised for matching consumer-style medical questions to relevant answers, making it useful for FAQ retrieval in health information systems. The underlying PubMedBERT base already incorporates biomedical vocabulary, giving it an advantage over general-purpose sentence transformers on clinical text.

318,330 ↓ · 8 ♡

paraphrase-MiniLM-L12-v2

paraphrase-MiniLM-L12-v2 is a BERT-based sentence encoder. It projects text into a dense embedding space where similar sentences cluster together, making it well-suited for retrieval and deduplication.

312,771 ↓ · 7 ♡

serafim-335m-portuguese-pt-sentence-encoder-ir

serafim-335m-portuguese-pt-sentence-encoder-ir is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

301,574 ↓ · 0 ♡

gte-base-en-v1.5

gte-base-en-v1.5 is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

298,070 ↓ · 0 ♡

stsb-roberta-base

stsb-roberta-base is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

296,697 ↓ · 1 ♡

all_miniLM_L6_v2_with_attentions

all_miniLM_L6_v2_with_attentions is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

295,936 ↓ · 14 ♡

gte-Qwen2-7B-instruct

gte-Qwen2-7B-instruct is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

295,540 ↓ · 482 ♡

Vietnamese_Embedding

Vietnamese_Embedding is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

294,221 ↓ · 61 ♡

langcache-embed-v1

langcache-embed-v1 is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

293,682 ↓ · 14 ♡

distilbert-multilingual-nli-stsb-quora-ranking

distilbert-multilingual-nli-stsb-quora-ranking is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

288,656 ↓ · 10 ♡

instructor-large

instructor-large is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

286,640 ↓ · 524 ♡

msmarco-MiniLM-L6-v3

msmarco-MiniLM-L6-v3 is a compact 6-layer MiniLM sentence embedding model fine-tuned on the MS MARCO passage retrieval dataset. It produces query and passage embeddings optimised for asymmetric retrieval — finding relevant web passages for short natural language queries. The model is well-suited for latency-sensitive applications where a full BERT encoder is too slow.

237,614 ↓ · 15 ♡

bge-m3-korean

bge-m3-korean is an open-source sentence-similarity model available on HuggingFace. Details are sourced from the public model registry.

232,861 ↓ · 64 ♡

GIST-Embedding-v0

GIST-Embedding-v0 (Guided In-sample Selection of Training Negatives) is a BERT-based sentence embedding model trained with guided negative sampling to improve contrastive learning quality. It targets MTEB retrieval and similarity tasks for English. MIT-licensed and compatible with sentence-transformers and text-embeddings-inference.

229,434 ↓ · 30 ♡