GLM-OCR is a multilingual OCR and document understanding model from ZhipuAI, built on the GLM architecture and supporting text recognition across Chinese, English, French, Spanish, Russian, German, Japanese, and Korean. It treats OCR as a sequence generation task, enabling structured text extraction from document images and screenshots. MIT licensed.
3,227,776 ↓ · 1,844 ♡
BLIP (Bootstrapped Language-Image Pretraining) base model for image captioning, using a vision encoder connected to a decoder via cross-attention. It introduced a bootstrapping approach that filters noisy web-crawled image-text pairs during training.
2,139,357 ↓ · 861 ♡
PP-OCRv5_server_det generates textual descriptions from image inputs. It is suited for captioning, OCR-style extraction, and describing visual structure.
672,092 ↓ · 69 ♡
blip-image-captioning-large generates textual descriptions from image inputs. It is suited for captioning, OCR-style extraction, and describing visual structure.
661,400 ↓ · 1,475 ♡
UVDoc is Baidu's document image unwarping model using PaddleOCR infrastructure, designed to correct perspective distortions and page curling in scanned documents before OCR. It uses a PaddlePaddle backend rather than PyTorch or JAX. Supports Chinese and English documents. Apache-2.0 licensed.
520,343 ↓ · 11 ♡
PP-LCNet_x1_0_doc_ori is a lightweight document orientation classifier from PaddleOCR that determines whether a scanned document page is upright, rotated 90°, 180°, or 270°. It is a pre-processing component in PaddleOCR's document digitalisation pipeline, ensuring OCR models receive correctly oriented input. The x1.0 scale balances classification speed and accuracy for batch document processing.
462,957 ↓ · 16 ♡
manga-ocr-base is an open-source image-to-text model available on HuggingFace. Details are sourced from the public model registry.
432,344 ↓ · 176 ♡
trocr-base-printed accepts image inputs and produces natural language output. Spatial precision and fine text rendering remain areas where accuracy varies.
415,165 ↓ · 208 ♡
trocr-small-handwritten is an open-source image-to-text model available on HuggingFace. Details are sourced from the public model registry.
370,496 ↓ · 63 ♡
PP-OCRv5 mobile recognition model from Baidu PaddlePaddle for English text recognition in OCR pipelines. Optimized for mobile deployment with a lightweight backbone while targeting competitive text recognition accuracy on printed and scene text.
346,891 ↓ · 2 ♡
Nougat is Meta's document understanding model that converts scientific PDFs (including LaTeX equations, tables, and figures) into structured Markdown text. It uses a vision encoder to process PDF page images and a text decoder to produce formatted output.
313,087 ↓ · 189 ♡
blip2-opt-2.7b-coco accepts image inputs and produces natural language output. Spatial precision and fine text rendering remain areas where accuracy varies.
310,532 ↓ · 11 ♡
pix2text-mfr is an open-source image-to-text model available on HuggingFace. Details are sourced from the public model registry.
297,733 ↓ · 54 ♡