AI Tools.

Search

Tarsier2-Recap-7b

Tarsier2-Recap-7B is a 7B video-language model from ByteDance Research specialized in generating dense, temporally grounded captions of video content. It extends Tarsier2's visual backbone with a recaptioning training objective, producing longer and more detailed video descriptions than general-purpose VLMs. Intended for video dataset annotation and video-to-text retrieval pipelines.

Last reviewed

Use cases

  • Automated video description for training data curation pipelines
  • Dense captioning for video accessibility and descriptive audio tracks
  • Generating text representations of video for downstream retrieval systems
  • Synthetic training data generation for video-language model research

Pros

  • Produces temporally detailed captions beyond simple scene-level labels
  • Apache 2.0 license enables commercial use in data processing pipelines
  • 7B scale runs on a single 24GB GPU for batch captioning workloads

Cons

  • No official pipeline_tag reduces HuggingFace tooling auto-routing support
  • Inference requires custom loading code rather than a standard transformers pipeline
  • Not suitable for real-time or streaming video captioning at low latency

FAQ

What is Tarsier2-Recap-7b used for?

Automated video description for training data curation pipelines. Dense captioning for video accessibility and descriptive audio tracks. Generating text representations of video for downstream retrieval systems. Synthetic training data generation for video-language model research.

Is Tarsier2-Recap-7b free to use?

Tarsier2-Recap-7b is an open-source model published on HuggingFace. License terms vary by model — check the model card for the specific license.

How do I run Tarsier2-Recap-7b locally?

Most HuggingFace models can be loaded with transformers or the appropriate framework library. See the model card for framework-specific instructions and hardware requirements.

Tags

safetensorsvideo LLMarxiv:2501.07888license:apache-2.0region:us