New Transformer Models Tackle Quadratic Complexity, Enabling Longer Sequences
Summary
New Transformer models employ techniques like sparse attention, low-rank factorization, and kernel approximations to overcome quadratic complexity, enabling efficient processing of longer sequences while preserving long-range dependency modeling capability.
Key Points
- Standard attention mechanism has quadratic computational and memory complexity with respect to input sequence length, limiting its scalability.
- Quadratic memory requirement for storing attention scores and gradients often poses a more severe bottleneck than computational complexity.
- Truncating or segmenting long inputs to circumvent quadratic scaling undermines Transformer's ability to capture long-range dependencies.