fill mask models

53 models · ranked by HuggingFace downloads

bert-base-uncased

Google's original BERT base model in uncased form, pre-trained on BookCorpus and English Wikipedia via masked language modeling. Tokens are lowercased before processing, making it insensitive to capitalization. It remains a standard fine-tuning base for classification, NER, and extractive QA, though newer encoders outperform it on most benchmarks.

57,757,042 ↓ · 2,686 ♡

xlm-roberta-base

XLM-RoBERTa base from Facebook AI, pre-trained on 2.5TB of filtered CommonCrawl text across 100 languages using the RoBERTa training procedure. Enables cross-lingual transfer — models fine-tuned on labeled English data can infer on other languages without parallel annotations. The standard starting point for multilingual classification and token-level tasks.

20,744,002 ↓ · 852 ♡

roberta-base

RoBERTa base from Facebook AI, trained with the same architecture as BERT base but significantly more data, longer training schedules, larger batch sizes, and dynamic masking. Pre-trained on BookCorpus, Wikipedia, CC-News, OpenWebText, and Stories — substantially more data than the original BERT. MIT licensed with multi-framework support.

13,342,794 ↓ · 616 ♡

roberta-large

RoBERTa large, the 355M-parameter version of Facebook AI's strongly trained BERT variant, offering doubled hidden size and additional attention heads over RoBERTa base. It provides stronger NLU accuracy at roughly 4x the inference compute cost of the base variant. Used where task accuracy on complex English language understanding outweighs latency constraints.

10,911,018 ↓ · 301 ♡

distilbert-base-uncased

DistilBERT-base-uncased is a distilled version of BERT-base-uncased, 40% smaller and 60% faster while retaining approximately 97% of BERT's language understanding performance on the GLUE benchmark. Trained via knowledge distillation from BERT using BookCorpus and Wikipedia. Commonly used when BERT's performance is needed but inference speed or resource constraints are limiting factors.

8,940,200 ↓ · 900 ♡

ModernBERT-base

ModernBERT-base performs masked token prediction with English support. The trained encoder captures deep contextual representations suitable for named entity recognition, text classification, and similarity tasks after fine-tuning.

8,281,489 ↓ · 1,058 ♡

xlm-roberta-large

XLM-RoBERTa Large, the 560-million-parameter multilingual encoder from Facebook AI, trained on 2.5TB of CommonCrawl data across 100 languages. It offers stronger multilingual language understanding than the base variant across classification, NER, and cross-lingual tasks, at roughly 4x the compute cost. MIT licensed with multi-framework support.

6,964,988 ↓ · 519 ♡

bert-large-uncased

bert-large-uncased performs masked token prediction with English support. The trained encoder captures deep contextual representations suitable for named entity recognition, text classification, and similarity tasks after fine-tuning.

5,898,810 ↓ · 147 ♡

Bio_ClinicalBERT

Bio_ClinicalBERT is BERT-base fine-tuned first on biomedical literature (PubMed) and then on MIMIC-III clinical notes. It produces contextual representations tuned for both biomedical and clinical language.

4,421,241 ↓ · 432 ♡

bert-base-multilingual-uncased

BERT-base-multilingual-uncased is Google's multilingual BERT trained on Wikipedia text from 104 languages with all text lowercased before tokenization. Lowercasing simplifies processing but removes capitalization signals that help named entity recognition. It produces 768-dimensional embeddings shared across all supported languages.

4,113,767 ↓ · 157 ♡

esm2_t33_650M_UR50D

esm2_t33_650M_UR50D is a transformer masked language model that predicts missing tokens using bidirectional context. Its encoder representations are widely used as starting points for fine-tuning.

3,677,815 ↓ · 81 ♡

bert-base-cased

Google's BERT base model in cased form, pre-trained on BookCorpus and English Wikipedia with original case preserved. Unlike bert-base-uncased, this model maintains distinctions between 'bert' and 'BERT' — essential for tasks where capitalization carries semantic information, such as named entity recognition. Same architecture as bert-base-uncased but with case-sensitive tokenization.

3,552,602 ↓ · 361 ♡

bert-base-multilingual-cased

BERT-base-multilingual-cased is Google's multilingual BERT trained on 104-language Wikipedia data with case preserved, making it better suited than the uncased variant for named entity recognition and tasks where capitalization carries semantic meaning. It shares the same 12-layer Transformer architecture and 768-dimensional embedding space as BERT-base-uncased. Despite its age, it remains a common transfer learning starting point for multilingual tasks.

3,514,834 ↓ · 593 ♡

deberta-v3-base

DeBERTa-v3-base uses disentangled attention and ELECTRA-style pretraining on diverse multilingual data, achieving state-of-the-art NLU results for a BERT-base-scale model at time of release. It consistently outperforms RoBERTa-base on GLUE benchmarks.

2,790,881 ↓ · 429 ♡

mdeberta-v3-base

mdeberta-v3-base fills in [MASK] positions in a sentence by attending to both left and right context. The internal representations are used for classification, tagging, and semantic search via fine-tuning.

2,102,238 ↓ · 225 ♡

distilroberta-base

distilroberta-base fills in [MASK] positions in a sentence by attending to both left and right context. The internal representations are used for classification, tagging, and semantic search via fine-tuning.

1,832,174 ↓ · 177 ♡

esm2_t12_35M_UR50D

ESM2-t12-35M is Meta's 35M parameter protein language model from the ESM2 family, pre-trained on the UniRef50 database of protein sequences. It generates protein residue embeddings for downstream structure prediction, function annotation, and variant effect prediction tasks. MIT-licensed.

1,616,072 ↓ · 23 ♡

bert-large-portuguese-cased

BERTimbau-large is a Portuguese BERT-large model pretrained from scratch on a 2.7B-word Portuguese corpus. It provides strong contextual representations for Brazilian and European Portuguese NLP tasks.

1,591,238 ↓ · 73 ♡

camembert-base

camembert-base is a BERT masked language model that predicts missing tokens using bidirectional context. Its encoder representations are widely used as starting points for fine-tuning.

1,110,644 ↓ · 102 ♡

ESMC-6B

ESMC-6B is EvolutionaryScale's 6B-parameter protein language model, pre-trained on diverse protein sequences with masked-language-modeling objectives. It generates high-quality residue-level embeddings suitable for variant effect prediction, protein engineering, and transfer-learning to downstream structure or function tasks. The eSM-C architecture focuses on sequence understanding rather than structure prediction.

1,048,286 ↓ · 17 ♡

japanese-roberta-base

japanese-roberta-base is Rinna's Japanese RoBERTa-base, pre-trained on Japanese Common Crawl and Wikipedia using the masked language modeling objective. Unlike multilingual models, it uses a morpheme-aware tokenizer (MeCab-based) optimized for Japanese, improving token efficiency on Japanese text. It is intended as a foundation for fine-tuning on Japanese NLP classification and NER tasks.

1,007,749 ↓ · 39 ♡

BiomedNLP-BiomedBERT-base-uncased-abstract

BiomedNLP-BiomedBERT-base-uncased-abstract is a BERT masked language model that predicts missing tokens using bidirectional context. Its encoder representations are widely used as starting points for fine-tuning.

953,746 ↓ · 94 ♡

legal-bert-base-cased-ptbr

legal-bert-base-cased-ptbr is a BERT-base model pre-trained on Brazilian Portuguese legal text — legislation, court decisions, and official government publications. It addresses the gap in Brazilian legal NLP where standard Portuguese BERT models (BERTimbau) lack the specialised legal vocabulary of the Brazilian judiciary. Downstream tasks require fine-tuning on labelled Brazilian legal datasets.

849,604 ↓ · 15 ♡

deberta-v3-large

deberta-v3-large is a DeBERTa masked language model that predicts missing tokens using bidirectional context. Its encoder representations are widely used as starting points for fine-tuning.

848,941 ↓ · 281 ♡

esm2_t6_8M_UR50D

esm2_t6_8M_UR50D performs masked token prediction. The trained encoder captures deep contextual representations suitable for named entity recognition, text classification, and similarity tasks after fine-tuning.

828,355 ↓ · 35 ♡

graphcodebert-base

graphcodebert-base is Microsoft Research's code-aware BERT variant that incorporates data-flow graphs from source code alongside token sequences during pre-training. Unlike CodeBERT which treats code as flat text, GraphCodeBERT explicitly models variable dependencies and control flow, improving performance on code search and clone detection tasks. It supports six programming languages from the CodeSearchNet benchmark.

765,341 ↓ · 90 ♡

ModernBERT-large

ModernBERT-large is a 395M encoder-only model from Answer.AI that updates BERT's architecture with flash attention, rotary position embeddings, and extended context (8192 tokens). It aims to be a drop-in improvement over BERT-large for masked language modeling and downstream encoder tasks. Apache-2.0 licensed.

695,823 ↓ · 472 ♡

deberta-v3-small

deberta-v3-small is a DeBERTa masked language model that predicts missing tokens using bidirectional context. Its encoder representations are widely used as starting points for fine-tuning.

671,613 ↓ · 77 ♡

esm2_t30_150M_UR50D

ESM-2 is Meta's protein language model trained on UniRef50, treating amino acid sequences analogously to text tokens. The t30_150M variant has 30 transformer layers at 150M total parameters, offering a practical balance between representation quality and inference speed. ESM-2 embeddings are widely used as features for protein function prediction, structure-adjacent tasks, and zero-shot fitness scoring.

652,353 ↓ · 10 ♡

bert-base-arabertv02

bert-base-arabertv02 fills in [MASK] positions in a sentence by attending to both left and right context. The internal representations are used for classification, tagging, and semantic search via fine-tuning.

611,979 ↓ · 45 ♡

bert-base-chinese

bert-base-chinese performs masked token prediction with Chinese support. The trained encoder captures deep contextual representations suitable for named entity recognition, text classification, and similarity tasks after fine-tuning.

603,579 ↓ · 1,437 ♡

esm2_t36_3B_UR50D

esm2_t36_3B_UR50D is a transformer masked language model that predicts missing tokens using bidirectional context. Its encoder representations are widely used as starting points for fine-tuning.

591,847 ↓ · 32 ♡

distilbert-base-multilingual-cased

distilbert-base-multilingual-cased performs masked token prediction with multilingual coverage. The trained encoder captures deep contextual representations suitable for named entity recognition, text classification, and similarity tasks after fine-tuning.

582,693 ↓ · 244 ♡

Clinical-Longformer

Clinical-Longformer fills in [MASK] positions in a sentence by attending to both left and right context. The internal representations are used for classification, tagging, and semantic search via fine-tuning.

526,744 ↓ · 69 ♡

bert-base-german-cased

bert-base-german-cased performs masked token prediction with German support. The trained encoder captures deep contextual representations suitable for named entity recognition, text classification, and similarity tasks after fine-tuning.

502,205 ↓ · 82 ♡

Bio_Discharge_Summary_BERT

Bio_Discharge_Summary_BERT is a BERT model pre-trained on clinical discharge summaries from MIMIC-III, providing biomedical domain adaptation specifically for clinical documentation language. It captures the informal, fragmented style of clinical notes better than PubMedBERT trained on abstracts. MIT-licensed.

456,666 ↓ · 38 ♡

bert-base-spanish-wwm-uncased

bert-base-spanish-wwm-uncased performs masked token prediction with Spanish support. The trained encoder captures deep contextual representations suitable for named entity recognition, text classification, and similarity tasks after fine-tuning.

436,707 ↓ · 75 ♡

albert-base-v2

albert-base-v2 performs masked token prediction with English support. The trained encoder captures deep contextual representations suitable for named entity recognition, text classification, and similarity tasks after fine-tuning.

436,096 ↓ · 142 ♡

juribert-base

JuriBERT-base is a BERT-base model pre-trained from scratch on French legal text, making it the primary French-language masked LM for legal NLP tasks. Standard French BERT models trained on general web text perform poorly on legal vocabulary and sentence structures; JuriBERT addresses this by training exclusively on French legal corpora including legislation, jurisprudence, and legal commentary.

420,147 ↓ · 0 ♡

bert-base-japanese

bert-base-japanese is an open-source fill-mask model available on HuggingFace. Details are sourced from the public model registry.

392,439 ↓ · 41 ♡

bert-base-japanese-whole-word-masking

Tohoku NLP Lab's Japanese BERT-base trained with whole-word masking on Japanese Wikipedia. A foundational Japanese NLP model that improved on earlier Japanese BERT variants by using morphology-aware masking rather than character-level masking.

386,282 ↓ · 76 ♡

PetBERT

PetBERT is an open-source fill-mask model available on HuggingFace. Details are sourced from the public model registry.

381,689 ↓ · 5 ♡

dummy-unknown

dummy-unknown is a RoBERTa masked language model that predicts missing tokens using bidirectional context. Its encoder representations are widely used as starting points for fine-tuning.

371,013 ↓ · 1 ♡

roberta-base

roberta-base is a RoBERTa masked language model that predicts missing tokens using bidirectional context. Its encoder representations are widely used as starting points for fine-tuning.

363,527 ↓ · 48 ♡

distilbert-base-german-cased

distilbert-base-german-cased is an open-source fill-mask model available on HuggingFace. Details are sourced from the public model registry.

351,457 ↓ · 25 ♡

BiomedVLP-CXR-BERT-specialized

BiomedVLP-CXR-BERT-specialized is an open-source fill-mask model available on HuggingFace. Details are sourced from the public model registry.

346,156 ↓ · 36 ♡

ChemBERTa-77M-MLM

ChemBERTa-77M-MLM is an open-source fill-mask model available on HuggingFace. Details are sourced from the public model registry.

342,965 ↓ · 26 ♡

mmBERT-base

mmBERT-base is an open-source fill-mask model available on HuggingFace. Details are sourced from the public model registry.

341,653 ↓ · 214 ♡

prot_bert

prot_bert is an open-source fill-mask model available on HuggingFace. Details are sourced from the public model registry.

333,974 ↓ · 134 ♡

twitter-xlm-roberta-base

XLM-RoBERTa-base fine-tuned on multilingual Twitter data by Cardiff NLP, covering sentiment, topic, and other social-media classification tasks. One of the most-cited multilingual Twitter models, with follow-on task-specific checkpoints available in the Cardiff NLP organization.

301,734 ↓ · 19 ♡

bert-base-portuguese-cased

bert-base-portuguese-cased is an open-source fill-mask model available on HuggingFace. Details are sourced from the public model registry.

300,876 ↓ · 229 ♡

chinese-bert-wwm-ext

chinese-bert-wwm-ext is an open-source fill-mask model available on HuggingFace. Details are sourced from the public model registry.

298,910 ↓ · 193 ♡

kcbert-base

KcBERT-base is a BERT-base model pre-trained on Korean news comments (Naver댓글, Daum댓글) collected from 2019-2020, giving it strong coverage of informal Korean internet language, slang, and emoticons. Unlike KoBERT trained on formal Korean text, KcBERT targets social media and user-generated content NLP tasks where colloquial Korean is predominant.

244,434 ↓ · 31 ♡