⚠️ WORK IN PROGRESS • Active Development

72ns Gateway: A Versal ACAP HFT Accelerator

Exploring the Physics of Nanosecond-Latency Network Processing

01. The Motivation: Chasing Zero Latency

In the world of high-frequency trading, the speed of light becomes a tangible constraint. At 299,792,458 meters per second, light travels approximately 30 centimeters in one nanosecond. This means that by the time a photon traverses a single meter of fiber optic cable (~5ns), market opportunities may have already evaporated. The question becomes: How fast can we process incoming market data before physics itself becomes the bottleneck?

Traditional software-based network stacks—even those optimized with kernel bypass (DPDK), zero-copy techniques (io_uring), and RDMA—still operate in the realm of microseconds. A well-tuned C++ application with SR-IOV and CPU pinning might achieve 1-2 microseconds of processing latency. But this is three orders of magnitude slower than what's physically possible.

The shift from software to hardware represents the ultimate frontier of optimization. By moving packet processing logic into reconfigurable fabric—specifically, Field-Programmable Gate Arrays (FPGAs)—we can achieve deterministic, sub-100ns latency. This is not about incremental improvement; this is a paradigm shift.

Goal: Design a deterministic, wire-speed network stack on the Xilinx Versal VCK190 ACAP (Adaptive Compute Acceleration Platform) capable of filtering, parsing, and forwarding UDP packets in <100ns with zero CPU intervention.

This project is a personal R&D initiative to master the art of heterogeneous computing—bridging the gap between software algorithms and hardware implementation. The goal is not just to build a faster system, but to understand the engineering trade-offs involved in nanosecond-latency design: clock domain crossings, metastability, pipeline hazards, and the cruel reality of propagation delay.

Target Latency
72ns
Clock Frequency
125 MHz
Throughput
1 Gbps
Platform
Versal ACAP

02. The Architecture: From Wire to Application

System Overview

The current prototype implements a receive (RX) path for processing inbound Ethernet frames. The architecture is modular, pipelined, and designed for scalability—while the current implementation uses 1G/2.5G Ethernet, the design can be extended to 40G/100G using the Versal MRMAC (Multirate Ethernet MAC) hard IP.

graph LR A[SFP+ Cage
1000BASE-X] -->|SGMII| B[AXI 1G/2.5G
Ethernet Subsystem] B -->|AXI-Stream
8-bit @ 125MHz| C[UDP Filter
RTL Module] C -->|Payload Only
Header Stripped| D[Async FIFO
Gray Code CDC] D -->|250MHz → 100MHz
Clock Domain Cross| E[Packet Parser
FSM Decoder] E -->|Decoded Messages| F[Application Logic
Order Book Update] style A fill:#1f2937,stroke:#00d9ff,stroke-width:2px style B fill:#1f2937,stroke:#00d9ff,stroke-width:2px style C fill:#1f2937,stroke:#00ff9f,stroke-width:3px style D fill:#1f2937,stroke:#ff9500,stroke-width:2px style E fill:#1f2937,stroke:#00ff9f,stroke-width:2px style F fill:#1f2937,stroke:#00d9ff,stroke-width:2px

Component Breakdown

1. AXI 1G/2.5G Ethernet Subsystem

Xilinx IP core that handles the MAC layer (Media Access Control) and presents received frames via AXI4-Stream. The interface runs at 125 MHz with an 8-bit data bus (TDATA[7:0]), providing 1 Gbps of throughput (125 MHz × 8 bits = 1000 Mbps).

2. UDP Filter (Custom RTL)

A byte-by-byte state machine that processes incoming Ethernet frames and performs the following operations:

Latency: The filter makes a go/no-go decision by byte 37. At 125 MHz (8ns per byte), this is ~296ns from frame start to decision. However, the first payload byte is forwarded immediately after byte 42, resulting in ~40ns of processing latency (5 clock cycles).

3. Async FIFO (Clock Domain Crossing)

A metastability-safe FIFO implementing the Cummings 2-flop synchronizer technique. This module bridges two asynchronous clock domains:

The design uses Gray code for pointer synchronization—a critical technique to avoid race conditions when crossing clock boundaries. Gray code ensures that only one bit changes between consecutive values, minimizing the risk of metastability during clock domain crossing.

CDC Hazard: Violating setup/hold times on a flip-flop during clock domain crossing can result in metastability—a state where the output oscillates unpredictably. The 2-flop synchronizer gives the signal two full clock cycles to resolve, reducing the probability of metastability to negligible levels (MTBF > 10^15 years).

4. Packet Parser (FSM Decoder)

A finite state machine that decodes application-layer protocols. The current implementation supports a custom framing protocol (0xAA55 sync pattern + length + payload + checksum), but is designed to be extended for industry-standard protocols like MoldUDP64 and NASDAQ TotalView-ITCH 5.0.

5. Testbench: Laptop as "Exchange Venue"

To validate the design, a Python/Scapy script running on a laptop injects raw Ethernet frames directly into the VCK190's SFP+ port. This simulates a realistic market data feed and allows for controlled testing of edge cases (fragmented packets, out-of-order delivery, checksum errors).

Hardware Specifications

Component Specification Details
Device Xilinx Versal AI Core xcvc1902-vsva2197-2MP-e-S
Fabric 1,039,104 Logic Cells ~899K LUTs, 1.8M Flip-Flops
Memory 42.7 Mb Block RAM 968 UltraRAM blocks (36Kb each)
DSP 1,968 DSP58 Slices INT8/FP32 MAC engines
AI Engines 400 AI Engine Tiles 8 TFLOPS @ INT8
Ethernet 4x 100GbE MRMAC Multirate: 1G/10G/25G/100G
PCIe Gen5 x16 64 GB/s bidirectional

02.5 The Complete Vision: From Market Feed to Strategy Execution

Status: Phase 1 Complete: Physical Link & Parser Active. The VCK190 1000BASE-X SFP+ link is operational, UDP packet parsing logic validated on hardware. Currently developing the downstream trading components: Book Builder, Order Manager, and Strategy Creator.

System Architecture Overview

While the current implementation focuses on the packet processing pipeline (Sections 02 and 03), the broader vision is to build a complete end-to-end HFT system that spans from simulated market data generation to strategy execution. This demonstrates not just hardware expertise, but a comprehensive understanding of the full trading stack.

Xilinx Versal VCK190 Evaluation Board

Xilinx Versal VCK190 Evaluation Board: 1M+ logic cells, 400 AI engines, 4x 100GbE MRMAC

graph TB subgraph Laptop["🖥️ Laptop: Market Data Simulator"] A[Python Script
MoldUDP64 Generator] A2[Synthetic Order Book
ITCH 5.0 Messages] end subgraph Network["🌐 Network Layer"] B[1G Ethernet
UDP/IP Stack] end subgraph VCK190["⚡ Xilinx Versal VCK190 FPGA"] C[1. UDP Filter
Header Stripping] D[2. Async FIFO
Clock Domain Crossing] E[3. MoldUDP64 Parser
Packet Decoder] F[4. ITCH Message Parser
Message Type FSM] subgraph Trading["Trading Logic (In Development)"] G[📖 Book Builder
Order Book Reconstruction] H[📋 Order Manager
Position Tracking] I[🎯 Strategy Creator
Signal Generation] end J[Strategy Output
Trade Signals] end subgraph Output["📊 Output / Monitoring"] K[ChipScope ILA
Debug Probes] L[PCIe DMA
to Host Software] end A --> A2 A2 --> B B -->|Ethernet Frames| C C -->|Payload Only| D D -->|Synchronized Data| E E -->|Unpacked Messages| F F -->|Add Order
Execute
Cancel| G G -->|Best Bid/Ask
Depth Levels| H H -->|Position State| I I -->|Buy/Sell Signals| J J --> K J --> L style Laptop fill:#1f2937,stroke:#f093fb,stroke-width:3px style Network fill:#1f2937,stroke:#4cc9f0,stroke-width:2px style VCK190 fill:#0a0e14,stroke:#00d9ff,stroke-width:4px style Trading fill:#151b24,stroke:#ff9500,stroke-width:3px,stroke-dasharray: 5 5 style Output fill:#1f2937,stroke:#00ff9f,stroke-width:2px style C fill:#1f2937,stroke:#00ff9f,stroke-width:2px style D fill:#1f2937,stroke:#ff9500,stroke-width:2px style E fill:#1f2937,stroke:#00ff9f,stroke-width:2px style F fill:#1f2937,stroke:#00ff9f,stroke-width:2px style G fill:#1f2937,stroke:#ff9500,stroke-width:2px style H fill:#1f2937,stroke:#ff9500,stroke-width:2px style I fill:#1f2937,stroke:#ff9500,stroke-width:2px style J fill:#1f2937,stroke:#00d9ff,stroke-width:2px

Component Details

📡 Stage 1: Market Data Simulation (Laptop)

A Python script running on a standard laptop generates synthetic market data packets conforming to the MoldUDP64 protocol. This simulates the NASDAQ ITCH 5.0 feed by:

Tool Stack: Python 3.11 • socket library for UDP transmission • struct for binary packing • NumPy for synthetic data generation

⚡ Stage 2-4: Packet Processing Pipeline (FPGA - Implemented)

These modules are currently functional and documented in Sections 02-04:

📖 Stage 5: Book Builder (In Development)

The Order Book Reconstruction module maintains a real-time view of market depth by:

Challenge: Managing 10,000+ active orders in BRAM (limited to ~35 Mb on Versal). Requires efficient data structures: hash tables for O(1) lookup, binary heap for price priority sorting.

📋 Stage 6: Order Manager (In Development)

Tracks positions, P&L, and risk metrics in real-time:

🎯 Stage 7: Strategy Creator (In Development)

The alpha generation engine—where market microstructure patterns are exploited:

Note: The strategy logic will be parameterizable via PCIe register writes, allowing dynamic reconfiguration without FPGA reprogramming.

Why This Matters

This end-to-end implementation demonstrates:

Next Milestones:

  • Complete Book Builder BRAM implementation (ETA: Jan 2025)
  • Integrate Order Manager with position tracking (ETA: Feb 2025)
  • Deploy first strategy: Market making with 2-tick spread (ETA: Mar 2025)
  • Backtest using historical ITCH data feeds (ETA: Apr 2025)

03. The Protocol: NASDAQ TotalView-ITCH 5.0

Why NASDAQ ITCH?

NASDAQ TotalView-ITCH 5.0 is the de facto standard for low-latency market data feeds in U.S. equities. Broadcast from the NASDAQ data center in Carteret, New Jersey, this feed delivers:

The protocol uses UDP multicast for one-to-many delivery, with MoldUDP64 as the session-layer framing protocol. This design eliminates TCP's handshake overhead and provides deterministic latency—critical for time-sensitive strategies like market making and statistical arbitrage.

Scaling Note: While the current prototype uses 1G Ethernet, the architecture is designed to scale to 40G/100G using the Versal MRMAC hard IP. The MoldUDP64 parser and ITCH decoder can be parallelized across multiple lanes (e.g., 4x 25G links) to achieve line-rate processing at 100 Gbps.

MoldUDP64 Frame Structure

MoldUDP64 is NASDAQ's proprietary session-layer protocol. Each UDP datagram contains:

Field Size (Bytes) Description Example
Session 10 ASCII session identifier (space-padded) "NASDAQ "
Sequence Number 8 64-bit packet sequence (big-endian) 0x0000000000001234
Message Count 2 Number of ITCH messages in packet (1-255) 0x0005 (5 messages)
Message Block Variable Concatenated ITCH 5.0 messages [Len][Msg1][Len][Msg2]...

Each message in the block is prefixed with a 2-byte length field (big-endian) followed by the ITCH message itself. The parser must:

  1. Verify the session ID matches the expected value
  2. Check sequence numbers for gaps (indicating packet loss)
  3. Iterate through the message block, parsing each ITCH message type

ITCH 5.0 Message Types (Sample)

Message Type Code Length (Bytes) Purpose
System Event 'S' 12 Start of day, end of day, trading halt
Add Order 'A' 36 New limit order added to book
Order Executed 'E' 31 Partial or full execution of order
Order Cancel 'X' 23 Order canceled (full or partial)
Trade (Non-Cross) 'P' 44 Matched trade execution

Hardware Challenge: The variable-length message format means the parser cannot use a fixed pipeline depth. The state machine must dynamically switch between message types based on the first byte (message type code). This requires careful FSM design to avoid pipeline stalls.

Target Deployment: Carteret, NJ

NASDAQ's primary data center is located in Carteret, New Jersey, approximately 10 miles from New York City. Co-location providers (e.g., Equinix NY4, NY5) offer rack space within the same facility, providing sub-millisecond fiber latency to the exchange. For HFT strategies, proximity is critical:

With 72ns of FPGA processing latency and ~25ns of fiber delay, the total system latency can approach 100ns—faster than a single memory access on a modern CPU.

04. The Engineering: Clock Domain Crossing & Metastability

Cummings 2-Flop Synchronizer: The Gold Standard for CDC

When data crosses from one clock domain to another, there's a risk of metastability—a state where a flip-flop's output oscillates unpredictably because the setup/hold timing was violated. This can propagate through the design, causing functional failures or even system crashes.

The 2-flop synchronizer (also called a "double synchronizer") is the industry-standard solution. By passing the signal through two flip-flops in the destination clock domain, we give it two full clock cycles to resolve to a stable state. The probability of metastability persisting through both stages is astronomically low (MTBF > 10^15 years for typical FPGA process nodes).

sequenceDiagram participant WD as Write Domain
(250 MHz) participant GC as Binary→Gray
Converter participant S1 as Sync Stage 1
(FF @ 100 MHz) participant S2 as Sync Stage 2
(FF @ 100 MHz) participant RD as Read Domain
(100 MHz) WD->>GC: Write Pointer (Binary) GC->>GC: Convert to Gray Code Note over GC: Only 1 bit changes
per increment GC->>S1: Gray Code Pointer Note over S1: Metastability Zone
Setup/Hold Risk S1->>S2: Synchronized (Stage 1) Note over S2: Second Sync Stage
Resolves Metastability S2->>RD: Stable Gray Pointer RD->>RD: Convert Gray→Binary RD->>RD: Compare with Read Ptr
Generate FIFO Empty

Why Gray Code?

Gray code is a binary numeral system where two successive values differ in only one bit. This is critical for CDC because if multiple bits changed simultaneously during a clock edge, the synchronizer might capture an invalid intermediate state. For example:

Binary:  0011 (3) → 0100 (4)  [2 bits change]
Gray:    0010 (3) → 0110 (4)  [1 bit changes]

If the synchronizer captures the binary transition at the wrong moment, it might see 0000, 0001, 0101, or 0111—none of which are valid. With Gray code, the only possible captured values are 0010 (old value) or 0110 (new value)—both correct.

FIFO Full/Empty Logic

The async FIFO uses separate logic for detecting full and empty conditions:

Insight: The FIFO depth must be a power of 2 for Gray code pointer arithmetic to work correctly. Additionally, the pointers must be 1 bit wider than the address width to distinguish between "full" (write caught up to read after wraparound) and "empty" (write and read at same address).

System Data Flow (Detailed)

flowchart TD A[SFP+ Transceiver
1000BASE-X Physical Layer] -->|SGMII Serial| B[PCS/PMA Sublayer
8B/10B Decode] B -->|GMII Parallel 8-bit| C[Ethernet MAC
CRC Check, Preamble Strip] C -->|AXI-Stream
TDATA TVALID TREADY TLAST| D{UDP Filter FSM} D -->|EtherType ≠ 0x0800| E[Drop Packet
Non-IPv4] D -->|IP Proto ≠ 0x11| F[Drop Packet
Non-UDP] D -->|Port ≠ 1234| G[Drop Packet
Wrong Port] D -->|Valid UDP Payload| H[Strip 42-byte Header
Forward Payload Only] H -->|AXI-Stream 8-bit| I[Async FIFO Write
@ 250 MHz] I -->|Gray Code Sync| J[Async FIFO Read
@ 100 MHz] J -->|AXI-Stream 8-bit| K[Packet Parser FSM
0xAA55 Framing] K -->|Decoded Message| L[Application Logic
Market Data Handler] style D fill:#00ff9f,stroke:#00ff9f,color:#0a0e14,stroke-width:3px style I fill:#ff9500,stroke:#ff9500,color:#0a0e14,stroke-width:3px style J fill:#ff9500,stroke:#ff9500,color:#0a0e14,stroke-width:3px style K fill:#00ff9f,stroke:#00ff9f,color:#0a0e14,stroke-width:3px

05. The Library: Recommended Reading

This project builds upon decades of research in asynchronous design, network protocols, and quantitative finance. The following texts are essential reading for anyone serious about low-latency systems engineering:

Simulation and Synthesis Techniques for Asynchronous FIFO Design
Clifford E. Cummings, Sunburst Design, Inc. | SNUG 2002

The definitive paper on clock domain crossing (CDC) for FIFOs. Cummings' work on Gray code synchronizers is cited in nearly every FPGA vendor's design guide. This paper explains why naive CDC techniques (e.g., single-flop synchronizers) fail and provides Verilog implementations of production-grade async FIFOs. A must-read for understanding metastability, MTBF calculations, and proper synchronizer design.

NASDAQ TotalView-ITCH 5.0 Specification
NASDAQ OMX Group | Official Protocol Specification

The authoritative reference for NASDAQ's market data feed protocol. Covers all 23 message types, byte-level encoding, multicast group assignments, and retransmission request (MITCH) procedures. Essential for implementing ITCH decoders in hardware or software. Available from NASDAQ's developer portal.

Developing High-Frequency Trading Systems: Learn How to Implement High-Frequency Trading from Scratch with C++ or Java Basics
Sebastien Donadio, Sourav Ghosh, Romain Vernois | Packt Publishing, 2022

A comprehensive guide to building HFT systems from first principles. Covers market microstructure, order book dynamics, low-latency networking (kernel bypass, RDMA), and C++ optimization techniques. While focused on software, the system architecture principles apply equally to FPGA-based designs. Excellent for understanding the why behind nanosecond-latency requirements.

Max Dama on Automated Trading
Max Dama | Self-Published White Paper (Archived)

A legendary (and hard-to-find) PDF that circulated in quantitative trading circles circa 2010. Dama's writing demystifies the engineering challenges of automated trading systems: feed handlers, risk checks, order routing, and latency measurement. The section on "The Cost of a Microsecond" is particularly relevant for understanding the economic incentives driving FPGA adoption in HFT.

Algorithmic Trading: Winning Strategies and Their Rationale
Ernest P. Chan | Wiley Trading, 2013

While not exclusively about latency, Chan's book provides context on when speed matters. Statistical arbitrage, mean reversion, and momentum strategies all have different latency sensitivities. Understanding the alpha decay curve helps justify the engineering effort (and cost) of sub-microsecond systems. Essential for aligning technical capabilities with trading strategy requirements.

UltraScale Architecture Clocking Resources (UG572)
Xilinx (AMD) | User Guide v1.11, 2023

Xilinx's official guide to clock management on 7-series, UltraScale, and Versal devices. Covers MMCMs (Mixed-Mode Clock Managers), PLLs, clock domain crossing constraints, and timing closure techniques. Critical for understanding how to achieve the 125 MHz and 250 MHz clocks used in this design without introducing jitter or skew.

Additional Resources: The Xilinx Developer Zone hosts excellent application notes on high-speed Ethernet (XAPP1082), PCIe DMA (PG195), and Versal AI Engine programming (UG1076). For hardware timestamping and PTP (Precision Time Protocol), see IEEE 1588-2019.

06. The Milestone: First Packet & The 32-bit Trap

Achievement Unlocked: VCK190 1000BASE-X SFP+ link operational. UDP packet parser validated on hardware. The FPGA is now successfully receiving, parsing, and processing real network traffic at line rate.

Phase 1 Complete: End-to-End Pipeline Architecture

The complete packet processing pipeline spans six modules, transforming raw Ethernet frames at 125 MHz into structured market tick data ready for the order book builder. This architecture demonstrates the full journey from physical layer (PHY) to application layer—every byte accounted for, every clock cycle optimized.

graph TD subgraph PHY["Physical Layer - 1000BASE-X @ 125 MHz"] A[Ethernet MAC
AXI-Stream Output] end subgraph FILTER["Layer 3/4 Filtering - 32-bit Words"] B[udp_filter Module
Header Validation and Stripping] end subgraph CDC["Clock Domain Crossing - Buffering"] C[axis_data_fifo
2048 Words Deep] end subgraph PARSE["Protocol Decoding - Serialization Bottleneck"] D[packet_parser Module
32-bit to 8-bit Serializer] end subgraph SHIM["Data Assembly - Tick Structuring"] E[parser_shim Module
Byte Stream to Tick Data] end subgraph BOOK["Order Book Management"] F[book_builder Module
Best Bid/Offer Tracker] end A -->|"32-bit TDATA
Full Ethernet Frame
1518 bytes max"| B B -->|"32-bit Payload Only
Headers Stripped
Starts at Byte 42"| C C -->|"32-bit Buffered
Backpressure Handling
TVALID/TREADY"| D D -->|"8-bit Byte Stream
AA 55 Len Payload CS
9 clocks for 9 bytes"| E E -->|"Structured Tick
Price 32b + Qty 32b + Side 1b
Valid Pulse"| F F -->|"Best Bid/Ask
BBO Updated
bbo_updated Pulse"| G[Trading Strategy
Signal Generation] style A fill:#1f2937,stroke:#00d9ff,stroke-width:2px style B fill:#1f2937,stroke:#00ff9f,stroke-width:3px style C fill:#1f2937,stroke:#ff9500,stroke-width:2px style D fill:#1f2937,stroke:#ff4757,stroke-width:3px style E fill:#1f2937,stroke:#00ff9f,stroke-width:2px style F fill:#1f2937,stroke:#00d9ff,stroke-width:3px style G fill:#1f2937,stroke:#f093fb,stroke-width:2px,stroke-dasharray: 5 5

Pipeline Stage Breakdown

Module Input Format Output Format Key Function Latency
udp_filter 32-bit words
Full Ethernet frame
32-bit words
UDP payload only
• Validates EtherType = 0x0800 (IPv4)
• Checks IP Protocol = 0x11 (UDP)
• Filters UDP Port = 1234
• Strips 42-byte header (14+20+8)
~40ns
(5 cycles)
axis_data_fifo 32-bit words
@ 125 MHz
32-bit words
@ 100 MHz
• Clock domain crossing (125→100 MHz)
• 2048-word buffering
• Gray code pointer sync
• Backpressure handling
~24ns
(3 cycles)
packet_parser 32-bit words
Parallel data
8-bit bytes
Serial stream
32→8 bit serialization
• Finds [AA 55] sync header
• Reads length byte
• Validates XOR checksum
~72ns
(9 cycles for 9 bytes)
parser_shim 8-bit byte stream
Sequential bytes
Structured tick
Parallel fields
• Assembles 9-byte payload:
  4B price (big-endian)
  4B quantity
  1B side ('B'/'S')
• Converts to parallel output
~8ns
(1 cycle)
book_builder Tick data
(price, qty, side)
Best Bid/Offer
(BBO)
• Maintains best_bid (highest buy)
• Maintains best_ask (lowest sell)
• Generates bbo_updated pulse
~8ns
(1 cycle)

Performance Metrics: Total pipeline latency from Ethernet MAC to BBO update is approximately 152ns (19 clock cycles @ 125 MHz). The serialization stage (packet_parser) accounts for nearly half of this delay—a deliberate trade-off to simplify downstream byte-oriented protocol parsing. For ultra-low latency applications, the parser could be re-architected to operate on 32-bit chunks directly, reducing latency to <50ns at the cost of significantly more complex state machine logic.

Protocol Format: Custom Framing

The current implementation uses a lightweight custom framing protocol optimized for simplicity and determinism:

[Byte 0-1]   Header:   0xAA55 (sync pattern for frame alignment)
[Byte 2]     Length:   N (payload size in bytes, excludes header/checksum)
[Byte 3..N+2] Payload:  N bytes of application data
[Byte N+3]   Checksum: XOR of all payload bytes (simple error detection)

Example: Market Tick (9-byte payload)
AA 55 09 | 00 00 27 10 | 00 00 03 E8 | 42 | A3
^Header  ^Len ^Price=$10000 ^Qty=1000   ^Buy ^XOR
            

This format provides:

Production Note: For deployment with real NASDAQ ITCH 5.0 feeds, this custom framing layer would be replaced by MoldUDP64 decoding (session ID + sequence number + message count), followed by ITCH message type parsing. The serialization architecture remains unchanged—only the state machine logic in packet_parser would be updated to handle the 23 ITCH message types.

The Bring-up Challenge: When Reset Meets Reality

Hardware bring-up is where theory meets silicon—and where assumptions are ruthlessly tested. The initial attempt to establish the 1000BASE-X link failed spectacularly: the link stayed stubbornly down despite correct clock frequencies, proper transceiver configuration, and verified cable connections.

The root cause was a classic Reset Polarity mismatch. The Xilinx 1G/2.5G Ethernet Subsystem IP expects pma_reset (Physical Medium Attachment reset) to be driven by peripheral_reset from the Processor System Reset IP. However, the initial design erroneously connected it to interconnect_aresetn, which has inverted polarity. The transceiver was perpetually held in reset, preventing link negotiation.

Adding to the chaos: a wiring conflict where tvalid (active-high data valid signal) and aresetn (active-low reset) were shorted together on the PCB. This created a paradoxical state where asserting reset would inadvertently signal "data valid," corrupting the AXI-Stream handshake. Isolating and rewiring these signals restored proper reset behavior.

MAC Configuration: The Promiscuous Mode Hack

Standard Ethernet communication requires an ARP handshake to map IP addresses to MAC addresses before packets can be exchanged. However, in a pure FPGA-to-PC test environment (no OS network stack on the FPGA side), implementing a full ARP responder would be overkill for initial validation.

The solution: Enable Promiscuous Mode on the Xilinx Ethernet MAC. In this mode, the FPGA accepts all incoming packets regardless of destination MAC address—no ARP required. This allowed the PC to blast UDP packets directly to the FPGA's physical port, bypassing Layer 2 address resolution entirely. Think of it as the hardware equivalent of tcpdump mode: listen to everything, filter in software (or hardware, in this case).

The "32-bit Trap": When Bytes Aren't Bytes

Here's where things got interesting. The initial assumption was that the AXI-Stream interface from the Ethernet MAC would deliver data as a simple byte stream: 0x55, 0xBB, 0xCC, 0xDD, etc. Architecturally clean, easy to parse.

Reality check: The interface runs in 32-bit Little Endian mode at 125 MHz. A single AXI-Stream transaction presents four bytes simultaneously on tdata[31:0], with byte ordering reversed. For example, the byte sequence 0x55 0xBB 0xCC 0xDD appears on the bus as:

tdata[31:0] = 0xDDCCBB55  // Little Endian: LSB (0x55) in tdata[7:0]
            

This is not a bug—it's the standard AXI-Stream convention for maximizing throughput. At 1 Gbps (125 MHz × 8 bits), the MAC naturally emits data in 32-bit chunks to match the fabric clock rate. However, the downstream UDP parser state machine expects sequential 8-bit bytes for header field extraction (IP addresses, port numbers, checksums).

The disconnect is profound: You can't simply wire tdata[7:0] to the parser's byte input. You'd only see the first byte of every 4-byte word, skipping 75% of the packet. The parser would interpret every fourth byte as consecutive, producing gibberish.

The Solution: An "Internal Serializer" in Verilog

The fix required embedding a word-to-byte unpacking mechanism directly into the packet parser module. This "internal serializer" buffers incoming 32-bit words and processes them one byte at a time, effectively converting the parallel 32-bit AXI-Stream interface into a sequential byte stream for the parser state machine.

// Internal Serializer: 32-bit Word → 8-bit Byte Stream
reg [31:0] data_buffer;      // Holds the 4 bytes we just received
reg [2:0]  byte_index;       // Tracks which byte (0-3) we are processing
reg        buffer_valid;     // Do we have data in the buffer?
wire [7:0] current_byte;     // The specific byte we are looking at

// Extract the current byte (LSB of buffer)
assign current_byte = data_buffer[7:0];

always @(posedge clk or negedge rst_n) begin
    if (!rst_n) begin
        s_axis_tready <= 0;
        buffer_valid  <= 0;
        byte_index    <= 0;
        data_buffer   <= 0;
    end 
    else begin
        // ---------------------------------------------------------
        // SERIALIZER: Ingest 32-bit Words → Output 8-bit Bytes
        // ---------------------------------------------------------
        
        // If buffer is empty, try to grab new data from UDP Filter
        if (!buffer_valid) begin
            s_axis_tready <= 1; // "Give me data"
            if (s_axis_tvalid && s_axis_tready) begin
                data_buffer   <= s_axis_tdata; // Capture 32 bits
                buffer_valid  <= 1;
                byte_index    <= 0; // Start at Byte 0
                s_axis_tready <= 0; // Hold off on next word
            end
        end 
        
        // If buffer has data, process 1 byte per clock
        else if (buffer_valid && m_axis_tready) begin
            // Process current_byte with parser state machine...
            // (Parser logic operates on current_byte)
            
            // Shift buffer to next byte
            if (byte_index == 3) begin
                buffer_valid <= 0; // Done with this 32-bit word
            end else begin
                data_buffer <= data_buffer >> 8; // Shift right by 8 bits
                byte_index  <= byte_index + 1;
            end
        end
    end
end
            

This implementation uses a right-shift register approach: each clock cycle, the buffer shifts right by 8 bits, exposing the next byte at data_buffer[7:0]. The byte_index counter tracks progress through the 4-byte word, and buffer_valid signals when the serializer needs to fetch the next 32-bit chunk from the MAC.

Critically, the serializer respects AXI-Stream backpressure: it only advances when m_axis_tready is asserted (downstream is ready to accept data). This prevents data loss during clock domain crossings, FIFO congestion, or when the parser state machine is busy processing headers. The handshake mechanism ensures lossless, deterministic packet processing at line rate.

Performance Analysis: The serializer adds zero latency to the first byte of each word (it's immediately available in data_buffer[7:0] after capture). Subsequent bytes within the same word are exposed at 1 byte per clock cycle (8ns @ 125 MHz). For a 1500-byte packet, this adds ~12 microseconds of processing time—acceptable for our sub-100ns per-packet target since the parser operates in parallel with packet reception.

Validation: Three Layers of Proof

Claiming "it works" requires evidence. Three tools provided irrefutable validation:

1. Wireshark: Packet Egress Confirmation

On the PC side, Wireshark captured outgoing UDP packets on the Ethernet interface connected to the VCK190. The capture confirmed:

Wireshark capture showing UDP packets egressing from PC

Wireshark Capture: UDP packets successfully transmitted from PC to FPGA over 1000BASE-X SFP+ link

2. IBERT: Physical Layer Verification

Xilinx's Integrated Bit Error Rate Tester (IBERT) confirmed the 1.25 Gbps physical link (1000BASE-X uses 8b/10b encoding: 1 Gbps data rate + 25% overhead = 1.25 Gbps line rate). Key metrics:

Vivado IBERT showing 1.25 Gbps link with clean eye diagram

Vivado IBERT: 1.25 Gbps physical link validated with clean eye diagram and zero bit errors

3. Vivado ILA: The "Money Shot"

The Integrated Logic Analyzer (ILA)—Xilinx's on-chip oscilloscope—provided the smoking gun. By instrumenting the AXI-Stream bus inside the FPGA fabric, the ILA captured live packet data at the parser's input. The critical waveform showed:

tdata[31:0] = 0xDDCCBB55
tvalid      = 1
tready      = 1
            

Translation: The FPGA was actively receiving 32-bit words from the Ethernet MAC with valid data present. This confirmed end-to-end signal integrity from the PC's NIC → SFP+ fiber → VCK190 transceiver → AXI-Stream fabric. The packet wasn't just "arriving"—it was parsable and actionable.

This single ILA screenshot represents hundreds of hours of debugging, schematic review, and constraint tweaking. It's the hardware equivalent of a successful printf("Hello, World!")—except at 125 MHz and with <72ns latency.

What This Milestone Unlocks

With the physical link validated and the parser receiving clean byte streams, the foundation is set for:

Next Steps:

  • Phase 2: Implementing the MoldUDP64 and NASDAQ ITCH 5.0 protocol handlers
  • Phase 3: Building the Limit Order Book in BRAM with efficient price-level indexing
  • Phase 4: Developing the strategy execution engine and order management system

The hardest part of any hardware project isn't the algorithm—it's getting that first LED to blink (or in this case, that first packet to parse). With the physical layer proven, the real fun begins. 🚀