Daughter of Godzilla’s Bad Hair Day
Ethernet Packet Analysis Engine, Latency Optimized
Kintex-7 FPGA PCIe card with QSFP+
Four 10 GbE or single 40 GbE interface
- QSFP+ socket
- 4 ports 10GbE LAN/WAN using SFP+ modules OR
- 1 port 40 GbE
- Hosted in a 4-lane GEN1/GEN2/GEN3 PCIe slot
- 16-lane mechanical
- Low profile, short length form factor
- GEN1/GEN2 PCIe bridge provided
- GEN3 supplied by user
- Fully compatible with our TCP Offload Engine (TOE)
- FIX board support package (DN_FBSP). Functioning reference design with:
- 10-Gigabit Ethernet MAC and 40 GbE MAC
- TCP/IP Offload Engine (TOE)
- FIX protocol parser
- Tick Filter (optional)
- PCIe Interface (4-lane, GEN2)
- QDR2 II+ Controller
- DDR3 Controller
- Xilinx Kintex-7 FPGA (FFG676) :
- 7K410T-3,-2,-2L (fastest to slowest)
- 3M ASIC gates (ASIC measure) when stuffed with Kintex-7 7K410T
- 254k flip-flop/6-input LUTs (708k total FFs)
- 3.578 Kbytes total FPGA block memory (1590, 18 kbit blocks)
- 1540, 25×18 multipliers
- Bulk memory: DDR3 VLP Mini-uDIMM
- 72-bit data width (64-bit with 8-bit ECC)
- 666.5MHz operation, PC3-10600 (single rank)
- Addressing/power to support 4GB
- DDR3 interface compatible with Vivado MIG
- Optimized DDR3 controller for lowest latency bulk memory access
- Optional RLDRAM Mini-UDIMM instead of DDR3 for ultra low latency
- QDRII+ SRAM memory: 4M x 18 (72Mb)
- Separate 18-bit read and write ports
- 500 MHz bus operation, DDR (double data rate)
- Fast enough to be clocked at 312.50 MHz
- Eliminates clock synchronization delays between memory and Ethernet clock
- SMBus-based thermal management
- GPS input for precise message time stamping and tracking
- Full support for embedded logic analyzers via JTAG interface
- ChipScope and other third-party debug solutions:
- Tektronix Certus
- ChipScope and other third-party debug solutions:
- Eight FPGA-controlled LEDs
- Enough light to make your houseplants happy.
Small Rackmount Servers the board will work with:
- SuperMicro X8DTG-D
- DELL R720 2U
- HP DL380 Gen8 2U
- It is possible, although with some quirks, to use HP SL390 with the SL6500 chassis or IBM BladeCenter HS12 or HS22 with the blade extension.
The DNPCIe_10G_K7_LL will probably need to have the 10G connections re-routed.
- Any HP, IBM, DELL shoud work
The DNPCIe_10G_K7_LL_QSFP is a PCIe-based FPGA board designed to minimize input to output processing latency on 10Gb or 40-Gbit Ethernet packets. The primary application is for ultra low latency, high throughput trading without CPU intervention. Every possible variable that affects input to output latency has been analyzed and minimized. Raw 10 or 40 GbE Ethernet packets can be analyzed and acted upon without a MAC, interrupts or an operating system adding delay to the process. This configurable hardware computing platform has the ability to achieve the theoretical minimum Ethernet packet processing latency.
The FPGA – Xilinx Kintex-7
We use a single FPGA from the Xilinx Kintex-7 in the FFG676 package. This package supports 400 I/O with the majority utilized. Most are dedicated to off chip memory peripherals including a single QDR II+ for low-latency, high speed look-up, and DDR3 Mini-uDIMM for performance oriented bulk storage. The Kintex-7 FPGA contains high-speed transceivers capable of 10GbE without need for an external PHY. Four of these transceivers are used for 4-lanes of GEN2/GEN3-capable PCIe. Four of the transceivers are connected to a single QSFP+ sockets.
Two possible FPGAs can be stuffed: 7K410T or the 7K325T. Both FPGAs come in a variety of speed grades (-1,-2/2L, -3) with -3 being the fastest. The -1 speed grade is not rated for 10 GbE transceiver operation, so isn’t applicable to this application. Table 1 depicts the resources of the two FPGAs with the Xilinx marketing exaggerations ruthlessly amputated. These are both large, but low-cost FPGAs. The 7K410T is capable of handling ~3M ASIC gates of logic, with the 7K325T capable of ~2.3 million gates. Features of the Kintex-7 FPGAs include efficient, dual-register 6-input look-up table (LUT) logic, 18 Kb (2 x 9 Kb) block RAMs, and second generation DSP48E1 slices (includes 25 x 18 multipliers). Floating point functions can be implemented using these DSP slices.
Four Channels of 10 GbE or a single channel of 40 GbE
The Kintex-7 FPGA have transceivers capable of 10 GbE. The physical interface is handled using a single QSFP+ module. This allows you to bypass a MAC if necessary and process raw Ethernet packets.
QDR II+ SSRAM – Memory with the lowest latency
We use a single quad data rate static RAMs (QDR II+ SSRAM) in the 4M x 18 size (72Mbit). This type of memory has separate input and output data paths enabling maximum read/write data bandwidth with minimum latency. The maximum tested frequency of this memory is 400 MHz. To minimize processing latency, we suspect it will be best to clock this QDRII+ SRAM at 312.50 MHz, exactly twice the internal Ethernet controller frequency of 156.25 MHz. The Kintex-7 FPGAs are capable of generating internal 2x clocks that are phase synchronous, eliminating the latencies associated with the tricky re-synchronization of data moving between different clock frequencies. The internal controller can be optimized in any way you choose. We, of course, provide several verilog examples for no charge that you are welcome to use. All functions of the QDR II+ SSRAM can be exploited, including concurrent read and write operations and four-tick bursts. The only real limitation is the amount of time and effort spent in customizing the individual memory controllers.
DDR3 DRAM – A large amount of local, bulk memory
A single PC3-10600 DDR3 VLP Mini-uDIMM socket enables up to 4GB of memory for bulk storage and lookup. Assuming a 4GB DIMM, the memory configuration is 512M x 72. Using a -2 or -3 speed grade FPGA, this interface is tested at the maximum FPGA I/O frequency: 666.5 MHz (1333 Mb/s with DDR). You are welcome to use this memory as 64-bits with 8 bits of error correction (ECC), or as a 72-bit memory without correction.
To minimize data synchronization across clock boundaries, it probably makes sense to clock this DDR3 interface at a 3x multiple of the base Ethernet frequency of 156.25 MHz, which is 468.75 MHz. A 3x phase synchronous clock can be easily generated internal to the FPGA, allowing zero latency synchronous data transfers between the Ethernet packet receiving logic and the DDR3 memory controller. The DDR3 controller can be optimized in any way you choose. We, of course, provide several verilog examples for no charge that you are welcome to use. All functions of the DDR3 DRAM can be exploited and optimized. Up to 8 banks can be open at once. Timing variables such as CAS latency and precharge can be tailored to the minimum given your operating frequency and the timing specification of the exact DDR3 memory utilized. As with the QDRII+ SRAM, the only real limitation is the amount of time and effort spent customizing the DDR3 memory controller to your needs.
PCIe – Customizable 4-lane, GEN2 PCI Express
PCIe is connected directly to the FPGA via 4-lanes of GTX transceivers. Note that the board has a 16-bit mechanical finger for stability. The interface is fully GEN2 and GEN3 capable. We ship GEN2 PCIe IP that is a full function, fixed, 4-lane master/target. If you want GEN3, you will have to supply the IP. To gain access to the PCIe interface, this IP must be integrated with your application. The Dini Group PCIe IP provides a flexible interface that allows the user access to multiple DMA engines, scratchpad memories, interrupts, and other endpoint-related functions to maximize performance while utilizing minimal FPGA resources. Drivers for ‘C’ source for several operating systems are included no charge.
How Everything Works …
With direct data feeds such as NASDAQ ITCH and OUCH the DNPCIe_10G_K7_LL_QSFP+ contains all of the basic functions required to minimize the amount of time it takes to receive Ethernet packets, process them, and respond deterministically. By using the FPGA to process Ethernet packets, the processor and operating system are removed from the critical path and traditional sources of latency such as interrupts and context switching no longer hinder performance. Not a single clock cycle. For algorithms requiring processing, FPGA resources can be hard coded to perform the task. This includes real-time Monte Carlo analysis, and floating point.
Specs of FPGAs Available on the DNPCIe_10G_K7_LL_QSFP+
- Product Brief (PDF)
- Cooling Requirements for DNPCIE_10G_K7_LL
- EEPROM AT24C256C
- DDR3 miniDIMM MT9JBF25672AKZ
- QDR CY7C25632KV18
- BPIFlash PC28F00AG18FE