AI Tools.

Search

DFN5B-CLIP-ViT-H-14-378

Apple's DFN5B CLIP model using a ViT-H/14 encoder trained at 378×378px input, pretrained on a 5B filtered data subset using Apple's Data Filtering Networks methodology. DFN filtering selects high-quality image-text pairs from noisy web data, improving model quality per training compute compared to unfiltered pretraining. Released under the Apple AMLR license.

Last reviewed

Use cases

  • High-quality image embedding for visual similarity and retrieval tasks
  • Zero-shot image classification with improved accuracy over standard CLIP variants
  • Research into data filtering effects on CLIP pretraining quality
  • Multimodal embedding extraction as a backbone for downstream tasks
  • Benchmarking data-filtered vs. standard web-scraped CLIP pretraining

Pros

  • Data Filtering Networks pretraining improves quality per compute over unfiltered CLIP
  • ViT-H/14 at 378px provides strong visual representations
  • Open_clip compatibility for standard inference pipelines
  • PyTorch available; arxiv paper documents the DFN methodology

Cons

  • Apple AMLR license — not Apache/MIT, requires review before commercial use
  • No pipeline_tag; requires open_clip or custom PyTorch code for inference
  • ViT-H/14 scale requires significant GPU memory for inference
  • No HuggingFace Transformers native pipeline integration
  • Smaller community adoption than OpenAI CLIP variants

FAQ

What is DFN5B-CLIP-ViT-H-14-378 used for?

High-quality image embedding for visual similarity and retrieval tasks. Zero-shot image classification with improved accuracy over standard CLIP variants. Research into data filtering effects on CLIP pretraining quality. Multimodal embedding extraction as a backbone for downstream tasks. Benchmarking data-filtered vs. standard web-scraped CLIP pretraining.

Is DFN5B-CLIP-ViT-H-14-378 free to use?

DFN5B-CLIP-ViT-H-14-378 is an open-source model published on HuggingFace. License terms vary by model — check the model card for the specific license.

How do I run DFN5B-CLIP-ViT-H-14-378 locally?

Most HuggingFace models can be loaded with transformers or the appropriate framework library. See the model card for framework-specific instructions and hardware requirements.

Tags

open_clippytorchcliparxiv:2309.17425license:apple-amlrregion:us