Facebook Research Unveils Tuna-2: A Unified Multimodal AI Model That Ditches Traditional Vision Encoders for Direct Pixel Processing
Summary
Facebook Research unveils Tuna-2, a groundbreaking multimodal AI model that ditches traditional vision encoders in favor of direct pixel patch processing, outperforming predecessors on diverse benchmarks while supporting both image understanding and generation tasks in 7B and 2B parameter sizes.
Key Points
- Facebook Research releases Tuna-2, a unified multimodal model that replaces traditional vision encoders with direct pixel patch embeddings, outperforming its predecessors Tuna and Tuna-R across diverse multimodal benchmarks.
- Tuna-2 supports both image understanding and generation tasks, offering multiple model variants and resolutions through a single unified inference script, with 7B and 2B parameter sizes available.
- Due to organizational policy constraints, full production-trained model weights cannot be released, but a foundation checkpoint with a small number of layers removed is planned, along with a complete video generation codebase for researchers to train their own models.