Tsinghua University and Tencent Hunyuan Release Spatial-TTT, a Streaming Spatial Intelligence Framework Achieving State-of-the-Art Video Benchmark Results
Summary
Tsinghua University and Tencent Hunyuan unveil Spatial-TTT, a groundbreaking streaming spatial intelligence framework that uses Test-Time Training to continuously update spatial memory from live video streams, achieving state-of-the-art results on video spatial benchmarks like VSI-Bench, with code, a 97k-sample dataset, and a lightweight model now publicly available.
Key Points
- Spatial-TTT is a newly released framework from Tsinghua University and Tencent Hunyuan that enables streaming visual-based spatial intelligence by using Test-Time Training to continuously update spatial memory from long-horizon video streams.
- The system features a hybrid architecture combining TTT layers with self-attention anchor layers, large-chunk sliding-window attention, and a spatial-predictive mechanism using 3D convolutions to capture geometric and temporal structure across video frames.
- Training and evaluation code, a 97k-sample spatial dataset, and a lightweight model called Spatial-TTT-nano are now publicly available on GitHub and Hugging Face, with the framework achieving state-of-the-art performance on video spatial benchmarks like VSI-Bench.