Tencent AI Lab Launches Penguin-VL: A Compact Vision-Language Model That Ditches Traditional Visual Encoders for LLM-Based Architecture
Summary
Tencent AI Lab launches Penguin-VL, a compact vision-language model that ditches traditional visual encoders in favor of a text-LLM-initialized architecture, delivering stronger fine-grained visual understanding and efficient long-video processing, with two model variants now live on Hugging Face.
Key Points
- Penguin-VL is a compact vision-language model family from Tencent AI Lab that pushes multimodal efficiency by replacing traditional CLIP/SigLIP vision encoders with a novel encoder initialized directly from a text-only LLM, enabling better alignment with language model representations and stronger fine-grained visual understanding.
- The system introduces key technical innovations including LLM-to-vision-encoder initialization with bidirectional attention and 2D-RoPE, mixed-supervision pretraining using reconstruction and distillation losses, and Temporal Redundancy-Aware token compression for efficient long-video processing.
- Two model variants, Penguin-VL-2B and Penguin-VL-8B, are now live on Hugging Face alongside the Penguin Vision Encoder, with inference code, a vLLM plugin, and a Gradio demo all publicly available, while training data and training code are still pending release.