Full GPT Transformer Runs Entirely in Verilog on FPGA, Hits 69,200 Tokens Per Second After Overcoming Critical Synthesis Bugs

Jun 16, 2026
GitHub
Article image for Full GPT Transformer Runs Entirely in Verilog on FPGA, Hits 69,200 Tokens Per Second After Overcoming Critical Synthesis Bugs

Summary

A full GPT transformer implemented entirely in RTL Verilog is now running on a Xilinx Virtex-5 FPGA, blazing through 69,200 tokens per second at 80 MHz after engineers overcame two critical synthesis bugs that silently zeroed ROM arrays and folded live registers, causing the board to hang despite passing simulation.

Key Points

  • A full transformer (microGPT) is implemented entirely in RTL Verilog and deployed on a Xilinx Virtex-5 FPGA, generating character-level names on an LCD display at up to ~69,200 tokens/second at 80 MHz.
  • The design uses a microcode-ROM sequencer driving modular datapath actuators with a persistent KV cache, achieving a 28x throughput improvement over the initial version through optimizations including parallel MAC tiles, radix-4 division, and dual-port BRAM scratchpad.
  • Two critical XST 14.7 synthesis bugs — silent zeroing of $readmemh ROM arrays and constant-folding of live registers — caused the board to hang despite passing simulation, and were fixed using combinational case functions for ROM initialization and keep constraints on base registers.

Tags

Read Original Article