Ethernet Packet Analysis Engine, Latency Optimized
Virtex-6 HXT FPGA PCIe card
Infiniband and Triple 10 GbE Interfaces
- 3 separate 10GbE LAN/WAN using SFP+ modules
- Customized IP for packet analysis with minimum latency
- Hosted in a an 8-lane GEN1 or GEN2 PCIe slot
- Stand-alone operations supported with external ATX power supply
- Xilinx Virtex-6 HXT FPGA (FF1923) :
- HX565T-2,-1 (fastest to slowest)
- 2M ASIC gates (ASIC measure) when stuffed with Virtex-6 HX565T
- 354k flip-flop/6-input LUTs (708k total FFs)
- 4Kb total FPGA block memory (1824, 18 kbit blocks)
- 864, 25×18 multipliers
- Bulk memory: DDR3 DIMM
- 72-bit data width (64-bit with 8-bit ECC)
- 533MHz operation, PC2-10600
- Addressing/power to support 4GB
- DDR3 Verilog/VHDL reference design provided (no charge)
- Optimized DDR3 controller for lowest latency bulk memory access
- RLDRAM Option for ultra-low latency
- 3 independent QDR II SRAM memory channels
- Two 4M x 36 (144Mb) channels
- One 4M x 72 (288Mb) channel
- Separate 36-bit read and write ports
- 350 MHz bus operation, DDR (double data rate)
- Fast enough to be clocked at 312.50 MHz
- Eliminates clock synchronization delays between memory and Ethernet clock
- Fast enough to be clocked at 312.50 MHz
- Full support for embedded logic analyzers via JTAG interface
- ChipScope and other third-party debug solutions:
- The OPTIONAL FIX board support package (DN_FBSP) for the DNPCIe_10G_HXT_LL is a functioning reference design with the following components:
- 10-Gigabit Ethernet MAC
- OPTIONAL TCP/IP Offload Engine (TOE)
- FIX protocol parser
- Tick Filter (optional)
- PCIe Interface (8-lane, GEN2)
- QDR2 II+ Controller
- DDR3 Controller
- Status FPGA-controlled LEDs
- Enough light to cause severe migraine headaches
Small Rackmount Servers the board will work with:
- SuperMicro X8DTG-D
- DELL R720 2U
- HP DL380 Gen8 2U
- It is possible, although with some quirks, to use HP SL390 with the SL6500 chassis or IBM BladeCenter HS12 or HS22 with the blade extension.
The DNPCIe_10G_HXT_LL will need to have the 10G connections re-routed.
- Any HP, IBM, DELL shoud work
The DNPCIe_10G_HXT_LL is a PCIe-based FPGA board designed to minimize input to output processing latency on 10Gb Ethernet packets. The primary application is for ultra low latency, high throughput trading without CPU intervention. Every possible variable that affects input to output latency has been analyzed and minimized. Raw 10 GbE Ethernet packets can be analyzed and acted upon without a MAC, interrupts, or an operating system adding delay to the process. This configurable hardware computing platform has the ability to achieve the theoretical minimum Ethernet packet processing latency.
The FPGA – Xilinx Virtex-6 HXT
We use a single FPGA from the HXT sub-family of Xilinx Virtex-6 in the FFG1923 package. This package supports 720 I/O with the majority utilized. Most are dedicated to a variety of off chip memory peripherals including QDR II+ for low-latency, high speed look-up, and DDR3 for performance oriented bulk storage. The HXT FPGAs contain high-speed transceiver PHYs of two different types. GTX transceivers are capable or handling data rates of 150 MB/s to 6.5 Gb/s, making these useful for lower speed Ethernet and GEN1/GEN2 PCI Express. The GTH transceivers are tuned higher, 2.488 to 11 GB/s, making them applicable to 10 gigabit Ethernet (10 GbE). Eight of the GTX transceivers are used for GEN2-capable PCIe. Four of the GTH transceivers are connected to 10 GbE SFP+ sockets. Another 8 GTX transceivers are connected to our standard GTX expansion connector, allowing for peripheral expansion but most applicable to in-chassis, board to board data daisy chaining.
Two possible FPGAs can be stuffed: HX380T or the HX565T. The HX380T comes in three speeds grades, with -3 being the fastest. The larger HX565T is limited to the -2 speed grade. This means the smaller device can be clocked at a higher frequency at the cost of slightly fewer FPGA logic resources. Table 1 depicts the resources of the two FPGAs with the Xilinx marketing exaggerations removed. These are both large FPGAs. The HX565T is capable of handling >4M ASIC gates of logic and is among the largest of the FPGAs shipping from any vendor in 2011. Features of the Virtex-6 HXT FPGAs include the efficient, dual-register 6-input look-up table (LUT) logic, 18 Kb (2 x 9 Kb) block RAMs, and second generation DSP48E1 slices (includes 25 x 18 multipliers). Floating point functions can be implemented using these DSP slices.
To give you an idea as to how large these FPGAs are, Xilinx has embedded processor IP called MicroBlaze. This processor is implemented in FPGA logic gates. Fifty (50!) or more of these MicroBlaze processors can be stuffed into an HXT565T with room to spare. Somewhat fewer if you incorporate IEEE 754 floating point.
Three Channels of 10 GbE
The HXT FPGAs have transceivers capable of 10 GbE. The physical interface is handled using SFP+ modules. This allows you to bypass a MAC if necessary and process raw Ethernet packets. The DNPCIe_10G_HXT_LL has 3, 10 GbE channels.
QDR II+ SSRAM
We use 4 individual quad data rate static RAMs (QDR II+ SSRAM) in the 2M x 36 configuration. This style of memory has separate input and output data paths, enabling maximum read/write data bandwidth with minimum latency. These four separate memories can be controlled individually, but any two (2M x 72), three (2M x 108), or four (2M x 144) of the QDRII+ SRAMs can be treated as a single memory. The maximum tested frequency of this memory is 400 MHz. To minimize processing latency, we suspect it will be best to clock these QDRII+ SRAMs at 312.50 MHz, exactly twice the internal Ethernet controller frequency of 156.25 MHz. The HXT FPGAs are capable of generating internal 2x clocks that are phase synchronous, eliminating the latencies associated with the tricky re-synchronization of data moving between different clock frequencies. The internal controller can be optimized in any way you choose. We, of course, provide several verilog examples for no charge that you are welcome to use. All functions of the QDR II+ SSRAM can be exploited, including concurrent read and write operations and four-tick bursts. The only real limitation is the amount of time and effort spent in customizing the individual memory controllers.
A single DDR3 DIMM socket enables up to 4GB of memory for bulk storage and lookup. Assuming a 4GB DIMM, the memory configuration is 512M x 72. Assuming a -2 or -3 speed grade FPGA, this interface is tested at the maximum FPGA I/O frequency: 533 MHz (1066 Mb/s with DDR). You are welcome to use this memory as 64-bits with 8 bits of error correction (ECC), or as a 72-bit memory without correction.
To minimize data synchronization across clock boundaries, it probably makes sense to clock this DDR3 interface at a 3x multiple of the base Ethernet frequency of 156.25 MHz, which is 468.75 MHz. A 3x phase synchronous clock can be easily generated internal to the FPGA, allowing zero latency synchronous data transfers between the Ethernet packet receiving logic and the DDR3 memory controller. The DDR3 controller can be optimized in any way you choose. We, of course, provide several verilog examples for no charge that you are welcome to use. All functions of the DDR3 DRAM can be exploited and optimized. Up to 8 banks can be open at once. Timing variables such as CAS latency and precharge can be tailored to the minimum given your operating frequency and the timing specification of the exact DDR3 memory utilized. As with the QDRII+ SRAM, the only real limitation is the amount of time and effort spent customizing the DDR3 memory controller to your needs.
PCIe – Customizable 8-lane, GEN2 PCI Express
PCIe is connected directly to the FPGA via 8-lanes of GTX transceivers. The interface is fully GEN2 capable. We ship PCIe IP that is a full function, fixed, 8-lane master/target. To gain access to the PCIe interface, this IP must be integrated with your application. We can help configure this IP to your needs, including BAR sizes. Additionally we can optionally add or subtract DMA engines, scratchpad memories, interrupts, and other host-related functions to maximize the performance, while utilizing the minimum FPGA resources. Drivers for ‘C’ source for several operating systems are included no charge. Partial reconfiguration of the FPGA is supported via the PCIe interface.
Board to Board Daisy Chaining and Expansion
These boards can be stacked in a PCIe system utilizing the GTX Expansion Header. We connect 8-lanes of the GTX transceivers to a high speed connector. This enables high board to board communication at the rate of 10 GB/s.
How Everything Works …
With direct data feeds such as NASDAQ ITCH and OUCH,or Financial Information Exchange (FIX), the DNPCIe_10G_HXT_LL contains all of the basic functions required to minimize the amount of time it takes to receive Ethernet packets, process them, and respond deterministically. The MAC, operating system et al, can be bypassed. There are no interrupts. No operating system. Not a single clock cycle is wasted here, enabling a near theoretical minimum in-to-out response time. For algorithms requiring processing, FPGA resources can be hard coded to perform the task, including real-time Monte Carlo analysis and floating point. This makes DNPCIe_10G_HXT_LL specifically suitable for compliance checking, high frequency trading, low latency trading, derivative pricing and risk management.
Specs of FPGAs Available on the DNPCIe_10G_HXT_LL