image to text models

13 models · ranked by HuggingFace downloads

GLM-OCR

GLM-OCR is a multilingual OCR and document understanding model from ZhipuAI, built on the GLM architecture and supporting text recognition across Chinese, English, French, Spanish, Russian, German, Japanese, and Korean. It treats OCR as a sequence generation task, enabling structured text extraction from document images and screenshots. MIT licensed.

3,227,776 ↓ · 1,844 ♡

blip-image-captioning-base

BLIP (Bootstrapped Language-Image Pretraining) base model for image captioning, using a vision encoder connected to a decoder via cross-attention. It introduced a bootstrapping approach that filters noisy web-crawled image-text pairs during training.

2,139,357 ↓ · 861 ♡

PP-OCRv5_server_det

PP-OCRv5_server_det generates textual descriptions from image inputs. It is suited for captioning, OCR-style extraction, and describing visual structure.

672,092 ↓ · 69 ♡

blip-image-captioning-large

blip-image-captioning-large generates textual descriptions from image inputs. It is suited for captioning, OCR-style extraction, and describing visual structure.

661,400 ↓ · 1,475 ♡

UVDoc

UVDoc is Baidu's document image unwarping model using PaddleOCR infrastructure, designed to correct perspective distortions and page curling in scanned documents before OCR. It uses a PaddlePaddle backend rather than PyTorch or JAX. Supports Chinese and English documents. Apache-2.0 licensed.

520,343 ↓ · 11 ♡

PP-LCNet_x1_0_doc_ori

PP-LCNet_x1_0_doc_ori is a lightweight document orientation classifier from PaddleOCR that determines whether a scanned document page is upright, rotated 90°, 180°, or 270°. It is a pre-processing component in PaddleOCR's document digitalisation pipeline, ensuring OCR models receive correctly oriented input. The x1.0 scale balances classification speed and accuracy for batch document processing.

462,957 ↓ · 16 ♡

manga-ocr-base

manga-ocr-base is an open-source image-to-text model available on HuggingFace. Details are sourced from the public model registry.

432,344 ↓ · 176 ♡

trocr-base-printed

trocr-base-printed accepts image inputs and produces natural language output. Spatial precision and fine text rendering remain areas where accuracy varies.

415,165 ↓ · 208 ♡

trocr-small-handwritten

trocr-small-handwritten is an open-source image-to-text model available on HuggingFace. Details are sourced from the public model registry.

370,496 ↓ · 63 ♡

en_PP-OCRv5_mobile_rec

PP-OCRv5 mobile recognition model from Baidu PaddlePaddle for English text recognition in OCR pipelines. Optimized for mobile deployment with a lightweight backbone while targeting competitive text recognition accuracy on printed and scene text.

346,891 ↓ · 2 ♡

nougat-base

Nougat is Meta's document understanding model that converts scientific PDFs (including LaTeX equations, tables, and figures) into structured Markdown text. It uses a vision encoder to process PDF page images and a text decoder to produce formatted output.

313,087 ↓ · 189 ♡

blip2-opt-2.7b-coco

blip2-opt-2.7b-coco accepts image inputs and produces natural language output. Spatial precision and fine text rendering remain areas where accuracy varies.

310,532 ↓ · 11 ♡

pix2text-mfr

pix2text-mfr is an open-source image-to-text model available on HuggingFace. Details are sourced from the public model registry.

297,733 ↓ · 54 ♡