New 'Wall Attention' Variant Delivers Per-Channel Forgetting Rates and Efficient Autoregressive Decoding With Full GQA Support
Summary
A new attention variant called Wall Attention launches with per-channel, per-timestep multiplicative decay for independent forgetting rates, backed by optimized Triton kernels enabling efficient autoregressive decoding, full GQA support, attention sinks, sliding windows, and sequence packing with verified numerical accuracy.
Key Points
- Wall Attention is a new attention variant that applies a per-channel, per-timestep multiplicative decay to the QK inner product, giving each query channel an independent, content-dependent forgetting rate that generalizes scalar gating and RoPE-style decays.
- Two specialized Triton kernels are provided: a fused forward and backward training/prefill kernel with analytic gradients, and a single-step decode kernel that uses a pre-rescaled KV cache to make autoregressive generation efficient without recomputing the full prefix.
- The release supports Grouped Query Attention (GQA), optional attention sinks, sliding windows, and variable-length sequence packing, with all kernel paths tested for numerical correctness against an eager PyTorch reference and gradient accuracy verified via finite differences.