Ai2 Releases Molmo2: Open-Source Vision-Language Model Capable of Video Understanding and Object Tracking

Mar 04, 2026

GitHub

Article image for Ai2 Releases Molmo2: Open-Source Vision-Language Model Capable of Video Understanding and Object Tracking

Summary

Ai2 releases Molmo2, a cutting-edge open-source vision-language model capable of video understanding, object tracking, and pointing across single-image, multi-image, and video tasks, with model sizes ranging from 4B to 8B parameters and fast inference support via Hugging Face Transformers and vLLM.

Key Points

Molmo2 is a state-of-the-art open-source vision-language model from Ai2, excelling at video understanding, pointing, and tracking across single-image, multi-image, and video tasks.
Training follows three stages — pre-training on image captioning and pointing, supervised fine-tuning on a full multitask mixture, and long-context SFT for extended video sequences up to 384 frames — with released checkpoints available for 4B, 8B, and 7B model sizes.
The codebase supports fast inference via Hugging Face Transformers and vLLM, includes scripts for downloading datasets and pretrained models, and features an efficient data pipeline with message trees, sequence packing, and context parallelism for long-sequence training.

Ai2 Releases Molmo2: Open-Source Vision-Language Model Capable of Video Understanding and Object Tracking

Summary

Key Points

Tags