Tencent AI Lab Launches Penguin-VL: A Compact Vision-Language Model That Ditches Traditional Visual Encoders for LLM-Based Architecture

Mar 10, 2026
GitHub
Article image for Tencent AI Lab Launches Penguin-VL: A Compact Vision-Language Model That Ditches Traditional Visual Encoders for LLM-Based Architecture

Summary

Tencent AI Lab launches Penguin-VL, a compact vision-language model that ditches traditional visual encoders in favor of a text-LLM-initialized architecture, delivering stronger fine-grained visual understanding and efficient long-video processing, with two model variants now live on Hugging Face.

Key Points

  • Penguin-VL is a compact vision-language model family from Tencent AI Lab that pushes multimodal efficiency by replacing traditional CLIP/SigLIP vision encoders with a novel encoder initialized directly from a text-only LLM, enabling better alignment with language model representations and stronger fine-grained visual understanding.
  • The system introduces key technical innovations including LLM-to-vision-encoder initialization with bidirectional attention and 2D-RoPE, mixed-supervision pretraining using reconstruction and distillation losses, and Temporal Redundancy-Aware token compression for efficient long-video processing.
  • Two model variants, Penguin-VL-2B and Penguin-VL-8B, are now live on Hugging Face alongside the Penguin Vision Encoder, with inference code, a vLLM plugin, and a Gradio demo all publicly available, while training data and training code are still pending release.

Tags

Read Original Article