New 'Wall Attention' Variant Delivers Per-Channel Forgetting Rates and Efficient Autoregressive Decoding With Full GQA Support
A new attention variant called Wall Attention launches with per-channel, per-timestep multiplicative decay for independent forgetting rates, backed by optimized Triton kernels enabling efficient autoregressive decoding, full GQA support, attention sinks, sliding windows, and sequence packing with verified numerical accuracy.