Exploring the Physics of Nanosecond-Latency Network Processing
In the world of high-frequency trading, the speed of light becomes a tangible constraint. At 299,792,458 meters per second, light travels approximately 30 centimeters in one nanosecond. This means that by the time a photon traverses a single meter of fiber optic cable (~5ns), market opportunities may have already evaporated. The question becomes: How fast can we process incoming market data before physics itself becomes the bottleneck?
Traditional software-based network stacks—even those optimized with kernel bypass (DPDK), zero-copy techniques (io_uring), and RDMA—still operate in the realm of microseconds. A well-tuned C++ application with SR-IOV and CPU pinning might achieve 1-2 microseconds of processing latency. But this is three orders of magnitude slower than what's physically possible.
The shift from software to hardware represents the ultimate frontier of optimization. By moving packet processing logic into reconfigurable fabric—specifically, Field-Programmable Gate Arrays (FPGAs)—we can achieve deterministic, sub-100ns latency. This is not about incremental improvement; this is a paradigm shift.
Goal: Design a deterministic, wire-speed network stack on the Xilinx Versal VCK190 ACAP (Adaptive Compute Acceleration Platform) capable of filtering, parsing, and forwarding UDP packets in <100ns with zero CPU intervention.
This project is a personal R&D initiative to master the art of heterogeneous computing—bridging the gap between software algorithms and hardware implementation. The goal is not just to build a faster system, but to understand the engineering trade-offs involved in nanosecond-latency design: clock domain crossings, metastability, pipeline hazards, and the cruel reality of propagation delay.
The current prototype implements a receive (RX) path for processing inbound Ethernet frames. The architecture is modular, pipelined, and designed for scalability—while the current implementation uses 1G/2.5G Ethernet, the design can be extended to 40G/100G using the Versal MRMAC (Multirate Ethernet MAC) hard IP.
Xilinx IP core that handles the MAC layer (Media Access Control) and presents received frames via AXI4-Stream.
The interface runs at 125 MHz with an 8-bit data bus (TDATA[7:0]), providing 1 Gbps
of throughput (125 MHz × 8 bits = 1000 Mbps).
A byte-by-byte state machine that processes incoming Ethernet frames and performs the following operations:
0x0800 (IPv4) at byte 120x11 (UDP) at byte 23Latency: The filter makes a go/no-go decision by byte 37. At 125 MHz (8ns per byte), this is ~296ns from frame start to decision. However, the first payload byte is forwarded immediately after byte 42, resulting in ~40ns of processing latency (5 clock cycles).
A metastability-safe FIFO implementing the Cummings 2-flop synchronizer technique. This module bridges two asynchronous clock domains:
The design uses Gray code for pointer synchronization—a critical technique to avoid race conditions when crossing clock boundaries. Gray code ensures that only one bit changes between consecutive values, minimizing the risk of metastability during clock domain crossing.
CDC Hazard: Violating setup/hold times on a flip-flop during clock domain crossing can result in metastability—a state where the output oscillates unpredictably. The 2-flop synchronizer gives the signal two full clock cycles to resolve, reducing the probability of metastability to negligible levels (MTBF > 10^15 years).
A finite state machine that decodes application-layer protocols. The current implementation supports a custom framing protocol
(0xAA55 sync pattern + length + payload + checksum), but is designed to be extended for industry-standard protocols
like MoldUDP64 and NASDAQ TotalView-ITCH 5.0.
To validate the design, a Python/Scapy script running on a laptop injects raw Ethernet frames directly into the VCK190's SFP+ port. This simulates a realistic market data feed and allows for controlled testing of edge cases (fragmented packets, out-of-order delivery, checksum errors).
| Component | Specification | Details |
|---|---|---|
| Device | Xilinx Versal AI Core | xcvc1902-vsva2197-2MP-e-S |
| Fabric | 1,039,104 Logic Cells | ~899K LUTs, 1.8M Flip-Flops |
| Memory | 42.7 Mb Block RAM | 968 UltraRAM blocks (36Kb each) |
| DSP | 1,968 DSP58 Slices | INT8/FP32 MAC engines |
| AI Engines | 400 AI Engine Tiles | 8 TFLOPS @ INT8 |
| Ethernet | 4x 100GbE MRMAC | Multirate: 1G/10G/25G/100G |
| PCIe | Gen5 x16 | 64 GB/s bidirectional |
Status: Phase 1 Complete: Physical Link & Parser Active. The VCK190 1000BASE-X SFP+ link is operational, UDP packet parsing logic validated on hardware. Currently developing the downstream trading components: Book Builder, Order Manager, and Strategy Creator.
While the current implementation focuses on the packet processing pipeline (Sections 02 and 03), the broader vision is to build a complete end-to-end HFT system that spans from simulated market data generation to strategy execution. This demonstrates not just hardware expertise, but a comprehensive understanding of the full trading stack.
Xilinx Versal VCK190 Evaluation Board: 1M+ logic cells, 400 AI engines, 4x 100GbE MRMAC
A Python script running on a standard laptop generates synthetic market data packets conforming to the MoldUDP64 protocol. This simulates the NASDAQ ITCH 5.0 feed by:
Tool Stack: Python 3.11 • socket library for UDP transmission • struct for binary packing • NumPy for synthetic data generation
These modules are currently functional and documented in Sections 02-04:
The Order Book Reconstruction module maintains a real-time view of market depth by:
Challenge: Managing 10,000+ active orders in BRAM (limited to ~35 Mb on Versal). Requires efficient data structures: hash tables for O(1) lookup, binary heap for price priority sorting.
Tracks positions, P&L, and risk metrics in real-time:
The alpha generation engine—where market microstructure patterns are exploited:
Note: The strategy logic will be parameterizable via PCIe register writes, allowing dynamic reconfiguration without FPGA reprogramming.
This end-to-end implementation demonstrates:
Next Milestones:
NASDAQ TotalView-ITCH 5.0 is the de facto standard for low-latency market data feeds in U.S. equities. Broadcast from the NASDAQ data center in Carteret, New Jersey, this feed delivers:
The protocol uses UDP multicast for one-to-many delivery, with MoldUDP64 as the session-layer framing protocol. This design eliminates TCP's handshake overhead and provides deterministic latency—critical for time-sensitive strategies like market making and statistical arbitrage.
Scaling Note: While the current prototype uses 1G Ethernet, the architecture is designed to scale to 40G/100G using the Versal MRMAC hard IP. The MoldUDP64 parser and ITCH decoder can be parallelized across multiple lanes (e.g., 4x 25G links) to achieve line-rate processing at 100 Gbps.
MoldUDP64 is NASDAQ's proprietary session-layer protocol. Each UDP datagram contains:
| Field | Size (Bytes) | Description | Example |
|---|---|---|---|
| Session | 10 | ASCII session identifier (space-padded) | "NASDAQ " |
| Sequence Number | 8 | 64-bit packet sequence (big-endian) | 0x0000000000001234 |
| Message Count | 2 | Number of ITCH messages in packet (1-255) | 0x0005 (5 messages) |
| Message Block | Variable | Concatenated ITCH 5.0 messages | [Len][Msg1][Len][Msg2]... |
Each message in the block is prefixed with a 2-byte length field (big-endian) followed by the ITCH message itself. The parser must:
| Message Type | Code | Length (Bytes) | Purpose |
|---|---|---|---|
| System Event | 'S' | 12 | Start of day, end of day, trading halt |
| Add Order | 'A' | 36 | New limit order added to book |
| Order Executed | 'E' | 31 | Partial or full execution of order |
| Order Cancel | 'X' | 23 | Order canceled (full or partial) |
| Trade (Non-Cross) | 'P' | 44 | Matched trade execution |
Hardware Challenge: The variable-length message format means the parser cannot use a fixed pipeline depth. The state machine must dynamically switch between message types based on the first byte (message type code). This requires careful FSM design to avoid pipeline stalls.
NASDAQ's primary data center is located in Carteret, New Jersey, approximately 10 miles from New York City. Co-location providers (e.g., Equinix NY4, NY5) offer rack space within the same facility, providing sub-millisecond fiber latency to the exchange. For HFT strategies, proximity is critical:
With 72ns of FPGA processing latency and ~25ns of fiber delay, the total system latency can approach 100ns—faster than a single memory access on a modern CPU.
When data crosses from one clock domain to another, there's a risk of metastability—a state where a flip-flop's output oscillates unpredictably because the setup/hold timing was violated. This can propagate through the design, causing functional failures or even system crashes.
The 2-flop synchronizer (also called a "double synchronizer") is the industry-standard solution. By passing the signal through two flip-flops in the destination clock domain, we give it two full clock cycles to resolve to a stable state. The probability of metastability persisting through both stages is astronomically low (MTBF > 10^15 years for typical FPGA process nodes).
Gray code is a binary numeral system where two successive values differ in only one bit. This is critical for CDC because if multiple bits changed simultaneously during a clock edge, the synchronizer might capture an invalid intermediate state. For example:
Binary: 0011 (3) → 0100 (4) [2 bits change]
Gray: 0010 (3) → 0110 (4) [1 bit changes]
If the synchronizer captures the binary transition at the wrong moment, it might see 0000, 0001,
0101, or 0111—none of which are valid. With Gray code, the only possible captured values are
0010 (old value) or 0110 (new value)—both correct.
The async FIFO uses separate logic for detecting full and empty conditions:
Insight: The FIFO depth must be a power of 2 for Gray code pointer arithmetic to work correctly. Additionally, the pointers must be 1 bit wider than the address width to distinguish between "full" (write caught up to read after wraparound) and "empty" (write and read at same address).
This project builds upon decades of research in asynchronous design, network protocols, and quantitative finance. The following texts are essential reading for anyone serious about low-latency systems engineering:
The definitive paper on clock domain crossing (CDC) for FIFOs. Cummings' work on Gray code synchronizers is cited in nearly every FPGA vendor's design guide. This paper explains why naive CDC techniques (e.g., single-flop synchronizers) fail and provides Verilog implementations of production-grade async FIFOs. A must-read for understanding metastability, MTBF calculations, and proper synchronizer design.
The authoritative reference for NASDAQ's market data feed protocol. Covers all 23 message types, byte-level encoding, multicast group assignments, and retransmission request (MITCH) procedures. Essential for implementing ITCH decoders in hardware or software. Available from NASDAQ's developer portal.
A comprehensive guide to building HFT systems from first principles. Covers market microstructure, order book dynamics, low-latency networking (kernel bypass, RDMA), and C++ optimization techniques. While focused on software, the system architecture principles apply equally to FPGA-based designs. Excellent for understanding the why behind nanosecond-latency requirements.
A legendary (and hard-to-find) PDF that circulated in quantitative trading circles circa 2010. Dama's writing demystifies the engineering challenges of automated trading systems: feed handlers, risk checks, order routing, and latency measurement. The section on "The Cost of a Microsecond" is particularly relevant for understanding the economic incentives driving FPGA adoption in HFT.
While not exclusively about latency, Chan's book provides context on when speed matters. Statistical arbitrage, mean reversion, and momentum strategies all have different latency sensitivities. Understanding the alpha decay curve helps justify the engineering effort (and cost) of sub-microsecond systems. Essential for aligning technical capabilities with trading strategy requirements.
Xilinx's official guide to clock management on 7-series, UltraScale, and Versal devices. Covers MMCMs (Mixed-Mode Clock Managers), PLLs, clock domain crossing constraints, and timing closure techniques. Critical for understanding how to achieve the 125 MHz and 250 MHz clocks used in this design without introducing jitter or skew.
Additional Resources: The Xilinx Developer Zone hosts excellent application notes on high-speed Ethernet (XAPP1082), PCIe DMA (PG195), and Versal AI Engine programming (UG1076). For hardware timestamping and PTP (Precision Time Protocol), see IEEE 1588-2019.
Achievement Unlocked: VCK190 1000BASE-X SFP+ link operational. UDP packet parser validated on hardware. The FPGA is now successfully receiving, parsing, and processing real network traffic at line rate.
The complete packet processing pipeline spans six modules, transforming raw Ethernet frames at 125 MHz into structured market tick data ready for the order book builder. This architecture demonstrates the full journey from physical layer (PHY) to application layer—every byte accounted for, every clock cycle optimized.
| Module | Input Format | Output Format | Key Function | Latency |
|---|---|---|---|---|
udp_filter |
32-bit words Full Ethernet frame |
32-bit words UDP payload only |
• Validates EtherType = 0x0800 (IPv4) • Checks IP Protocol = 0x11 (UDP) • Filters UDP Port = 1234 • Strips 42-byte header (14+20+8) |
~40ns (5 cycles) |
axis_data_fifo |
32-bit words @ 125 MHz |
32-bit words @ 100 MHz |
• Clock domain crossing (125→100 MHz) • 2048-word buffering • Gray code pointer sync • Backpressure handling |
~24ns (3 cycles) |
packet_parser |
32-bit words Parallel data |
8-bit bytes Serial stream |
• 32→8 bit serialization • Finds [AA 55] sync header • Reads length byte • Validates XOR checksum |
~72ns (9 cycles for 9 bytes) |
parser_shim |
8-bit byte stream Sequential bytes |
Structured tick Parallel fields |
• Assembles 9-byte payload: 4B price (big-endian) 4B quantity 1B side ('B'/'S') • Converts to parallel output |
~8ns (1 cycle) |
book_builder |
Tick data (price, qty, side) |
Best Bid/Offer (BBO) |
• Maintains best_bid (highest buy) • Maintains best_ask (lowest sell) • Generates bbo_updated pulse |
~8ns (1 cycle) |
Performance Metrics: Total pipeline latency from Ethernet MAC to BBO update is approximately
152ns (19 clock cycles @ 125 MHz). The serialization stage (packet_parser) accounts
for nearly half of this delay—a deliberate trade-off to simplify downstream byte-oriented protocol parsing. For
ultra-low latency applications, the parser could be re-architected to operate on 32-bit chunks directly, reducing
latency to <50ns at the cost of significantly more complex state machine logic.
The current implementation uses a lightweight custom framing protocol optimized for simplicity and determinism:
[Byte 0-1] Header: 0xAA55 (sync pattern for frame alignment)
[Byte 2] Length: N (payload size in bytes, excludes header/checksum)
[Byte 3..N+2] Payload: N bytes of application data
[Byte N+3] Checksum: XOR of all payload bytes (simple error detection)
Example: Market Tick (9-byte payload)
AA 55 09 | 00 00 27 10 | 00 00 03 E8 | 42 | A3
^Header ^Len ^Price=$10000 ^Qty=1000 ^Buy ^XOR
This format provides:
Production Note: For deployment with real NASDAQ ITCH 5.0 feeds, this custom framing layer would
be replaced by MoldUDP64 decoding (session ID + sequence number + message count), followed by
ITCH message type parsing. The serialization architecture remains unchanged—only the state machine logic in
packet_parser would be updated to handle the 23 ITCH message types.
Hardware bring-up is where theory meets silicon—and where assumptions are ruthlessly tested. The initial attempt to establish the 1000BASE-X link failed spectacularly: the link stayed stubbornly down despite correct clock frequencies, proper transceiver configuration, and verified cable connections.
The root cause was a classic Reset Polarity mismatch. The Xilinx 1G/2.5G Ethernet Subsystem IP expects
pma_reset (Physical Medium Attachment reset) to be driven by peripheral_reset from the Processor
System Reset IP. However, the initial design erroneously connected it to interconnect_aresetn, which has
inverted polarity. The transceiver was perpetually held in reset, preventing link negotiation.
Adding to the chaos: a wiring conflict where tvalid (active-high data valid signal) and aresetn
(active-low reset) were shorted together on the PCB. This created a paradoxical state where asserting reset
would inadvertently signal "data valid," corrupting the AXI-Stream handshake. Isolating and rewiring these signals restored
proper reset behavior.
Standard Ethernet communication requires an ARP handshake to map IP addresses to MAC addresses before packets can be exchanged. However, in a pure FPGA-to-PC test environment (no OS network stack on the FPGA side), implementing a full ARP responder would be overkill for initial validation.
The solution: Enable Promiscuous Mode on the Xilinx Ethernet MAC. In this mode, the FPGA accepts
all incoming packets regardless of destination MAC address—no ARP required. This allowed the PC to blast UDP packets
directly to the FPGA's physical port, bypassing Layer 2 address resolution entirely. Think of it as the hardware equivalent
of tcpdump mode: listen to everything, filter in software (or hardware, in this case).
Here's where things got interesting. The initial assumption was that the AXI-Stream interface from the Ethernet MAC would deliver data as a simple byte stream: 0x55, 0xBB, 0xCC, 0xDD, etc. Architecturally clean, easy to parse.
Reality check: The interface runs in 32-bit Little Endian mode at 125 MHz. A single AXI-Stream transaction
presents four bytes simultaneously on tdata[31:0], with byte ordering reversed. For example, the byte
sequence 0x55 0xBB 0xCC 0xDD appears on the bus as:
tdata[31:0] = 0xDDCCBB55 // Little Endian: LSB (0x55) in tdata[7:0]
This is not a bug—it's the standard AXI-Stream convention for maximizing throughput. At 1 Gbps (125 MHz × 8 bits), the MAC naturally emits data in 32-bit chunks to match the fabric clock rate. However, the downstream UDP parser state machine expects sequential 8-bit bytes for header field extraction (IP addresses, port numbers, checksums).
The disconnect is profound: You can't simply wire tdata[7:0] to the parser's byte input. You'd only see the
first byte of every 4-byte word, skipping 75% of the packet. The parser would interpret every fourth byte as
consecutive, producing gibberish.
The fix required embedding a word-to-byte unpacking mechanism directly into the packet parser module. This "internal serializer" buffers incoming 32-bit words and processes them one byte at a time, effectively converting the parallel 32-bit AXI-Stream interface into a sequential byte stream for the parser state machine.
// Internal Serializer: 32-bit Word → 8-bit Byte Stream
reg [31:0] data_buffer; // Holds the 4 bytes we just received
reg [2:0] byte_index; // Tracks which byte (0-3) we are processing
reg buffer_valid; // Do we have data in the buffer?
wire [7:0] current_byte; // The specific byte we are looking at
// Extract the current byte (LSB of buffer)
assign current_byte = data_buffer[7:0];
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
s_axis_tready <= 0;
buffer_valid <= 0;
byte_index <= 0;
data_buffer <= 0;
end
else begin
// ---------------------------------------------------------
// SERIALIZER: Ingest 32-bit Words → Output 8-bit Bytes
// ---------------------------------------------------------
// If buffer is empty, try to grab new data from UDP Filter
if (!buffer_valid) begin
s_axis_tready <= 1; // "Give me data"
if (s_axis_tvalid && s_axis_tready) begin
data_buffer <= s_axis_tdata; // Capture 32 bits
buffer_valid <= 1;
byte_index <= 0; // Start at Byte 0
s_axis_tready <= 0; // Hold off on next word
end
end
// If buffer has data, process 1 byte per clock
else if (buffer_valid && m_axis_tready) begin
// Process current_byte with parser state machine...
// (Parser logic operates on current_byte)
// Shift buffer to next byte
if (byte_index == 3) begin
buffer_valid <= 0; // Done with this 32-bit word
end else begin
data_buffer <= data_buffer >> 8; // Shift right by 8 bits
byte_index <= byte_index + 1;
end
end
end
end
This implementation uses a right-shift register approach: each clock cycle, the buffer shifts right by 8 bits,
exposing the next byte at data_buffer[7:0]. The byte_index counter tracks progress through the 4-byte
word, and buffer_valid signals when the serializer needs to fetch the next 32-bit chunk from the MAC.
Critically, the serializer respects AXI-Stream backpressure: it only advances when m_axis_tready
is asserted (downstream is ready to accept data). This prevents data loss during clock domain crossings, FIFO congestion, or
when the parser state machine is busy processing headers. The handshake mechanism ensures lossless, deterministic
packet processing at line rate.
Performance Analysis: The serializer adds zero latency to the first byte of each word
(it's immediately available in data_buffer[7:0] after capture). Subsequent bytes within the same word are
exposed at 1 byte per clock cycle (8ns @ 125 MHz). For a 1500-byte packet, this adds ~12 microseconds
of processing time—acceptable for our sub-100ns per-packet target since the parser operates in parallel with packet reception.
Claiming "it works" requires evidence. Three tools provided irrefutable validation:
On the PC side, Wireshark captured outgoing UDP packets on the Ethernet interface connected to the VCK190. The capture confirmed:
Wireshark Capture: UDP packets successfully transmitted from PC to FPGA over 1000BASE-X SFP+ link
Xilinx's Integrated Bit Error Rate Tester (IBERT) confirmed the 1.25 Gbps physical link (1000BASE-X uses 8b/10b encoding: 1 Gbps data rate + 25% overhead = 1.25 Gbps line rate). Key metrics:
Vivado IBERT: 1.25 Gbps physical link validated with clean eye diagram and zero bit errors
The Integrated Logic Analyzer (ILA)—Xilinx's on-chip oscilloscope—provided the smoking gun. By instrumenting the AXI-Stream bus inside the FPGA fabric, the ILA captured live packet data at the parser's input. The critical waveform showed:
tdata[31:0] = 0xDDCCBB55
tvalid = 1
tready = 1
Translation: The FPGA was actively receiving 32-bit words from the Ethernet MAC with valid data present. This confirmed end-to-end signal integrity from the PC's NIC → SFP+ fiber → VCK190 transceiver → AXI-Stream fabric. The packet wasn't just "arriving"—it was parsable and actionable.
This single ILA screenshot represents hundreds of hours of debugging, schematic review, and constraint tweaking. It's the
hardware equivalent of a successful printf("Hello, World!")—except at 125 MHz and with <72ns latency.
With the physical link validated and the parser receiving clean byte streams, the foundation is set for:
Next Steps:
The hardest part of any hardware project isn't the algorithm—it's getting that first LED to blink (or in this case, that first packet to parse). With the physical layer proven, the real fun begins. 🚀