Tsinghua University and Tencent Hunyuan Release Spatial-TTT, a Streaming Spatial Intelligence Framework Achieving State-of-the-Art Video Benchmark Results

Mar 16, 2026

GitHub

Article image for Tsinghua University and Tencent Hunyuan Release Spatial-TTT, a Streaming Spatial Intelligence Framework Achieving State-of-the-Art Video Benchmark Results

Summary

Tsinghua University and Tencent Hunyuan unveil Spatial-TTT, a groundbreaking streaming spatial intelligence framework that uses Test-Time Training to continuously update spatial memory from live video streams, achieving state-of-the-art results on video spatial benchmarks like VSI-Bench, with code, a 97k-sample dataset, and a lightweight model now publicly available.

Key Points

Spatial-TTT is a newly released framework from Tsinghua University and Tencent Hunyuan that enables streaming visual-based spatial intelligence by using Test-Time Training to continuously update spatial memory from long-horizon video streams.
The system features a hybrid architecture combining TTT layers with self-attention anchor layers, large-chunk sliding-window attention, and a spatial-predictive mechanism using 3D convolutions to capture geometric and temporal structure across video frames.
Training and evaluation code, a 97k-sample spatial dataset, and a lightweight model called Spatial-TTT-nano are now publicly available on GitHub and Hugging Face, with the framework achieving state-of-the-art performance on video spatial benchmarks like VSI-Bench.

Tsinghua University and Tencent Hunyuan Release Spatial-TTT, a Streaming Spatial Intelligence Framework Achieving State-of-the-Art Video Benchmark Results

Summary

Key Points

Tags