feature extraction models

81 models · ranked by HuggingFace downloads

bge-small-en-v1.5

Small English dense embedding model from BAAI's BGE (BAAI General Embedding) series, producing 384-dimensional vectors via MIT license. Optimized for MTEB retrieval benchmarks through a retrieval-focused training strategy, it achieves competitive scores relative to its parameter count. Suited for embedding workflows where throughput and cost matter more than peak accuracy.

60,148,419 ↓ · 493 ♡

bge-large-en-v1.5

BGE-Large-EN-v1.5 is BAAI's highest-capacity English embedding model in the v1.5 series, producing 1024-dimensional vectors. It achieves top MTEB retrieval scores among its generation of English-only embedding models, at the cost of higher compute and storage than BGE-small or BGE-base. MIT licensed with ONNX export support.

14,928,106 ↓ · 688 ♡

Qwen3-Embedding-0.6B

Qwen3-Embedding-0.6B is Alibaba Cloud's compact embedding model from the Qwen3 series, fine-tuned from Qwen3-0.6B-Base for text embedding tasks. At 0.6B parameters it provides instruction-following embedding capability at a size deployable without dedicated GPU infrastructure. Apache 2.0 licensed.

10,265,556 ↓ · 1,075 ♡

bge-base-en-v1.5

BGE-Base-EN-v1.5 is BAAI's mid-tier English embedding model in the v1.5 series, producing 768-dimensional vectors. It balances accuracy and compute cost between the small (384d) and large (1024d) variants, making it a practical default for English retrieval tasks where storage and inference overhead of the large model are undesirable. MIT licensed with ONNX export.

8,709,060 ↓ · 439 ♡

multilingual-e5-large

Multilingual-E5-Large is a 560-million-parameter multilingual embedding model from Microsoft Research, supporting 100+ languages via an XLM-RoBERTa backbone. Trained with E5's instruction-following approach (prepending 'query:' or 'passage:' prefixes), it achieves strong MTEB multilingual retrieval scores. MIT licensed with ONNX and OpenVINO export.

7,651,319 ↓ · 1,207 ♡

mxbai-embed-large-v1

mxbai-embed-large-v1 is Mixedbread AI's English embedding model producing 1024-dimensional vectors, trained for retrieval and ranking tasks using angle-optimized contrastive learning (AnglE). It achieves strong MTEB retrieval scores among English embedding models. Apache 2.0 licensed.

5,892,037 ↓ · 810 ♡

bge-small-zh-v1.5

BGE-small-zh-v1.5 is a compact Chinese text embedding model from BAAI, producing 512-dimensional sentence vectors optimized for Chinese semantic search and retrieval tasks. Part of the BGE series that also covers multilingual and English variants.

4,740,416 ↓ · 118 ♡

w2v-bert-2.0

Meta's wav2vec-BERT 2.0 is a self-supervised speech encoder that combines contrastive learning with masked language modeling objectives. It serves as the backbone for Seamless and other Meta speech recognition and translation systems.

4,563,231 ↓ · 218 ♡

jina-embeddings-v3

Jina Embeddings v3 is a 570M-parameter text embedding model supporting 89 languages with a 8192-token context window. It uses LoRA adapters to switch between task-specific embedding modes (retrieval, similarity, classification) without separate models.

3,271,435 ↓ · 1,147 ♡

all-MiniLM-L6-v2

ONNX-converted port of sentence-transformers/all-MiniLM-L6-v2, optimized for Transformers.js to run embedding inference directly in a browser or Node.js without a Python backend. Produces 384-dimensional sentence embeddings.

3,086,259 ↓ · 125 ♡

Qwen3-Embedding-8B

Qwen3-Embedding-8B generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

2,409,502 ↓ · 712 ♡

UAE-Large-V1

UAE-Large-V1 is a BERT encoder with English support. It produces token- and sequence-level vectors that capture syntactic and semantic information, serving as a base for transfer learning.

2,250,060 ↓ · 237 ♡

granite-embedding-small-english-r2

granite-embedding-small-english-r2 outputs dense contextual embeddings from input text without a task-specific classification head. The representations are used downstream for clustering, retrieval, or fine-tuning.

2,190,481 ↓ · 71 ♡

Qwen3-Embedding-4B

Qwen3-Embedding-4B outputs dense contextual embeddings from input text without a task-specific classification head. The representations are used downstream for clustering, retrieval, or fine-tuning.

2,168,532 ↓ · 288 ♡

bge-base-en-v1.5

bge-base-en-v1.5 generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

1,812,424 ↓ · 9 ♡

bge-reranker-large

bge-reranker-large generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

1,811,696 ↓ · 465 ♡

SapBERT-from-PubMedBERT-fulltext

SapBERT-from-PubMedBERT-fulltext is a BERT encoder with English support. It produces token- and sequence-level vectors that capture syntactic and semantic information, serving as a base for transfer learning.

1,706,600 ↓ · 71 ♡

multilingual-e5-small

Transformers.js-compatible ONNX conversion of multilingual-e5-small, enabling browser and Node.js inference of a 118M-parameter multilingual embedding model covering 100+ languages.

1,637,615 ↓ · 11 ♡

multilingual-e5-large-instruct

multilingual-e5-large-instruct outputs dense contextual embeddings from input text without a task-specific classification head. The representations are used downstream for clustering, retrieval, or fine-tuning.

1,597,458 ↓ · 626 ♡

bge-multilingual-gemma2

bge-multilingual-gemma2 is a Gemma encoder. It produces token- and sequence-level vectors that capture syntactic and semantic information, serving as a base for transfer learning.

1,396,292 ↓ · 202 ♡

bge-large-zh-v1.5

bge-large-zh-v1.5 generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

1,379,253 ↓ · 635 ♡

conv-bert-base

conv-bert-base is a BERT encoder. It produces token- and sequence-level vectors that capture syntactic and semantic information, serving as a base for transfer learning.

1,319,890 ↓ · 10 ♡

wavlm-large

wavlm-large generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

1,304,332 ↓ · 110 ♡

1

1 outputs dense contextual embeddings from input text without a task-specific classification head. The representations are used downstream for clustering, retrieval, or fine-tuning.

1,277,200 ↓ · 1 ♡

mimi

mimi generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

1,231,706 ↓ · 307 ♡

jina-embeddings-v2-small-en

jina-embeddings-v2-small-en outputs dense contextual embeddings from input text without a task-specific classification head. The representations are used downstream for clustering, retrieval, or fine-tuning.

1,148,421 ↓ · 141 ♡

repeat

repeat generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

1,073,261 ↓ · 0 ♡

bge-base-zh-v1.5

BGE-Base-ZH-v1.5 is BAAI's Chinese sentence embedding model in the BGE family, trained for Chinese semantic similarity and retrieval tasks. MIT-licensed and compatible with sentence-transformers and text-embeddings-inference. Optimized for Chinese-language RAG and search.

953,718 ↓ · 107 ♡

SFR-Embedding-2_R

SFR-Embedding-2_R is Salesforce's SFR-Embedding-2_R, a Mistral-7B-based text embedding model trained for retrieval-centric tasks on the MTEB benchmark suite. The '_R' suffix indicates retrieval optimization. It achieves strong performance on passage retrieval, semantic search, and reranking when used as a bi-encoder, with full 4096-token context support.

857,807 ↓ · 94 ♡

wavlm-base-plus

wavlm-base-plus outputs dense contextual embeddings from input text without a task-specific classification head. The representations are used downstream for clustering, retrieval, or fine-tuning.

781,309 ↓ · 40 ♡

llama-nemotron-embed-1b-v2

llama-nemotron-embed-1b-v2 is a Llama encoder with multilingual coverage. It produces token- and sequence-level vectors that capture syntactic and semantic information, serving as a base for transfer learning.

660,013 ↓ · 57 ♡

jina-embeddings-v5-text-nano

jina-embeddings-v5-text-nano is Jina AI's smallest text embedding model in the v5 family, built on EuroBERT-210m with multimodal capability for both text and image feature extraction. Despite the 'nano' designation, it supports multilingual inputs and is optimized for edge and latency-sensitive retrieval scenarios where model size matters more than peak accuracy.

645,233 ↓ · 80 ♡

canine-c

canine-c is Google's CANINE-C, a character-level pre-trained encoder that operates directly on Unicode codepoints without any tokenization step. Unlike wordpiece or BPE models, it accepts raw text character sequences, making it robust to spelling variation, morphological richness, and unseen vocabularies. It supports over 100 languages by design, with no language-specific tokenizer required.

605,842 ↓ · 35 ♡

opensearch-neural-sparse-encoding-doc-v2-distill

opensearch-neural-sparse-encoding-doc-v2-distill generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

593,222 ↓ · 19 ♡

e5-base-sts-en-de

e5-base-sts-en-de outputs dense contextual embeddings from input text without a task-specific classification head. The representations are used downstream for clustering, retrieval, or fine-tuning.

569,332 ↓ · 17 ♡

specter2_base

specter2_base generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

550,262 ↓ · 46 ♡

paraphrase-albert-small-v2

paraphrase-albert-small-v2 is an ALBERT-small-v2 model fine-tuned for paraphrase detection and sentence similarity, distributed by GPTCache as a lightweight semantic cache key encoder. It encodes queries into sentence embeddings for detecting semantically equivalent user inputs, enabling cache hits in LLM serving pipelines. At ALBERT-small scale it is significantly faster than BERT-base alternatives.

541,680 ↓ · 2 ♡

signal-jepa_without-chans

signal-jepa_without-chans is a self-supervised EEG foundation model from the braindecode project, using a joint-embedding predictive architecture (JEPA) trained on unlabeled EEG recordings. It generates channel-agnostic temporal representations suitable for downstream BCI or clinical EEG classification tasks. The 'without-chans' variant drops channel position encoding, making it compatible with variable electrode montages.

530,648 ↓ · 0 ♡

ru-en-RoSBERTa

RoSBERTa is a bilingual Russian-English sentence embedding model from ai-forever, built on RoBERTa with MTEB-style training for semantic similarity. It targets retrieval and semantic search use cases in Russian-language NLP pipelines. MIT-licensed and available with text-embeddings-inference compatibility.

511,825 ↓ · 82 ♡

lambda

A LLaMA-architecture model packaged by Unsloth for feature extraction, likely used internally as a base for fine-tuning experiments. The safetensors format and Unsloth branding suggest it serves as a reference checkpoint rather than a production embedding model.

507,999 ↓ · 0 ♡

Qwen3-Embedding-4B-W4A16-G128

Qwen3-Embedding-4B-W4A16-G128 is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

504,851 ↓ · 5 ♡

TinyBERT_L-4_H-312_v2

TinyBERT_L-4_H-312_v2 outputs dense contextual embeddings from input text without a task-specific classification head. The representations are used downstream for clustering, retrieval, or fine-tuning.

503,958 ↓ · 1 ♡

vram-16

vram-16 generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

503,939 ↓ · 0 ♡

bart-base

bart-base generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

501,526 ↓ · 205 ♡

clap-htsat-unfused

clap-htsat-unfused outputs dense contextual embeddings from input text without a task-specific classification head. The representations are used downstream for clustering, retrieval, or fine-tuning.

501,229 ↓ · 75 ♡

MoLFormer-XL-both-10pct

MoLFormer-XL-both-10pct is IBM Research's MoLFormer-XL, a BERT-style molecular language model pre-trained on 1.1B SMILES strings from PubChem and ZINC. It produces molecular fingerprint-like embeddings from SMILES notation for property prediction tasks. The 'both-10pct' variant uses linear attention and rotary embeddings, trained on 10% of the full corpus mixture.

474,975 ↓ · 35 ♡

granite-embedding-311m-multilingual-r2

granite-embedding-311m-multilingual-r2 is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

471,029 ↓ · 103 ♡

other

other generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

450,025 ↓ · 0 ♡

indobert-base-p1

indobert-base-p1 outputs dense contextual embeddings from input text without a task-specific classification head. The representations are used downstream for clustering, retrieval, or fine-tuning.

444,614 ↓ · 50 ♡

rubert-base-cased

RuBERT-base-cased is DeepPavlov's BERT base model pre-trained on Russian text from Wikipedia and news corpora, with a case-sensitive vocabulary. It provides Russian-specific contextualized representations for downstream NLP tasks. PyTorch and JAX checkpoints are available.

434,091 ↓ · 129 ♡

deepset-mxbai-embed-de-large-v1

deepset-mxbai-embed-de-large-v1 is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

433,877 ↓ · 60 ♡

Qwen3-VL-Embedding-2B-AWQ-4bit

Qwen3-VL-Embedding-2B-AWQ-4bit is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

408,148 ↓ · 1 ♡

MedCPT-Query-Encoder

MedCPT is NCBI's biomedical retrieval model trained on PubMed citation data using a contrastive learning objective. The query encoder maps clinical and biomedical questions into a shared embedding space with MedCPT's article encoder for dense biomedical literature retrieval.

406,401 ↓ · 62 ♡

splade-cocondenser-ensembledistil

splade-cocondenser-ensembledistil generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

401,614 ↓ · 62 ♡

bge-base-en

bge-base-en is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

399,249 ↓ · 61 ♡

SapBERT-from-PubMedBERT-fulltext-mean-token

SapBERT-from-PubMedBERT-fulltext-mean-token generates embedding vectors from text inputs. These features can be pooled or passed directly to downstream classifiers, making it a versatile backbone for NLP pipelines.

398,224 ↓ · 2 ♡

jina-embeddings-v5-text-small

jina-embeddings-v5-text-small is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

395,242 ↓ · 179 ♡

jina-embeddings-v2-base-code

jina-embeddings-v2-base-code generates dense embeddings for mixed code-text inputs, supporting 8192-token context windows. It was trained to handle docstrings, function bodies, and natural language queries together, making it well-suited for semantic code search. The model ships ONNX and Transformers.js versions alongside the standard PyTorch weights.

387,016 ↓ · 139 ♡

e5-mistral-7b-instruct

E5-Mistral-7B-Instruct is an embedding model that leverages the full generative capacity of Mistral 7B by using decoder-only LLM representations for text embeddings. It uses instruction prompts at inference time to orient embeddings for retrieval, clustering, or classification tasks. At release it achieved state-of-the-art MTEB scores for dense retrieval, outperforming BERT-family embedding models by a significant margin on hard retrieval tasks.

377,505 ↓ · 564 ♡

bge-small-en

BGE-Small-EN is a 33M-parameter English text embedding model from BAAI, the smallest in the BGE (BAAI General Embedding) series. Despite its size it achieves competitive MTEB scores for retrieval tasks relative to larger BERT-based models. It is designed for high-throughput, memory-efficient embedding generation where larger models are too slow or expensive.

369,651 ↓ · 93 ♡

OTel-Embedding-33M

A 33M-parameter text embedding model from farbodtavakkoli specialized for OpenTelemetry (OTel) log and trace data. Designed to embed observability signals (log lines, span names, error messages) for semantic search and anomaly clustering in monitoring pipelines.

364,194 ↓ · 0 ♡

opensearch-neural-sparse-encoding-v2-distill

A distilled neural sparse encoding model from the OpenSearch project, designed for SPLADE-style learned sparse retrieval. It generates sparse token weight vectors from text, enabling neural relevance ranking within inverted index infrastructure without dense vector ANN.

342,318 ↓ · 10 ♡

bart-large

BART-large is Meta's denoising autoencoder pretrained for sequence-to-sequence tasks, excelling at abstractive summarization, translation, and text generation. The large (400M) variant is the strongest in the original BART family before fine-tuning on downstream tasks.

339,176 ↓ · 201 ♡

jina-embeddings-v2-base-de

jina-embeddings-v2-base-de is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

335,046 ↓ · 83 ♡

OTel-Embedding-34M

A 34M-parameter OTel-domain text embedding model from farbodtavakkoli, nearly identical in scope to the 33M variant but potentially a slightly different architecture or training iteration. Designed for embedding OpenTelemetry observability signals.

333,330 ↓ · 0 ♡

Solon-embeddings-large-0.1

Solon-embeddings-large-0.1 is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

333,035 ↓ · 53 ♡

codebert-base

codebert-base is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

332,290 ↓ · 288 ♡

OTel-Embedding-109M

The largest of farbodtavakkoli's OTel embedding series at 109M parameters, offering the best embedding quality among the OTel-Embedding models for OpenTelemetry log, span, and metric text.

330,494 ↓ · 1 ♡

biobert-v1.1

biobert-v1.1 outputs dense contextual embeddings from input text without a task-specific classification head. The representations are used downstream for clustering, retrieval, or fine-tuning.

319,163 ↓ · 112 ♡

sentence-bert-base-ja-mean-tokens

sentence-bert-base-ja-mean-tokens is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

317,956 ↓ · 11 ♡

OTel-Embedding-300M

OTel-Embedding-300M is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

317,491 ↓ · 0 ♡

distilhubert

distilhubert is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

316,274 ↓ · 38 ♡

distilbert-base-nli-mean-tokens

distilbert-base-nli-mean-tokens is an open-source feature-extraction model available on HuggingFace. Details are sourced from the public model registry.

313,194 ↓ · 13 ♡

bge-small-en-v1.5

Xenova's transformers.js ONNX conversion of BGE-Small-EN-v1.5 for browser and Node.js inference. BGE-Small-EN-v1.5 is BAAI's small English embedding model; this version targets client-side semantic search without server infrastructure. The ONNX format enables cross-platform deployment.

312,243 ↓ · 16 ♡