Please fill in the form below, so we can support you in your request.
Please fill in the form below, so we can support you in your request.


    ASICAMD (Xilinx)AchronixIntel (Altera)LatticeMicrochip (MicroSemi)Other

    X
    CONTACT MLE
    CONTACT MLE
    We are glad that you preferred to contact us. Please fill our short form and one of our friendly team members will contact you back.


      ASICAMD (Xilinx)AchronixIntel (Altera)LatticeMicrochip (MicroSemi)Other

      X
      CONTACT MLE

      Increase Speed and Save Resources with Simple Coding Style Changes

      Our Mission: If It Is Packets, We Make It Go Faster!

      And with packets we mean: Networking using TCP/UDP/IP over 10G/25G/50G/100G Ethernet; PCI Express (PCIe), CXL, OpenCAPI; data storage using SATA, SAS, USB, NVMe; video image processing using HDMI, DisplayPort, SDI, FPD-III.

      Over the last decade, we have become experts in accelerating software-rich system stacks via offloading CPUs using so-called Domain-Specific Architectures for computing. For implementation, we make heavy use of heterogeneous processing devices such as FPGAs which we program using C++/C/SystemC as well as VHDL and Verilog HDL.

      ASIC vs. FPGA in Process Acceleration

      Compared to ASICs, FPGAs are a much more versatile option when it comes to accelerating processes with hardware, as an FPGA can be reconfigured as often as needed.

      However, one large benefit to ASICs is the possible maximum clock speed that can be reached. As its circuit is optimized for its specific function, it has a smaller footprint, resulting in a faster maximum clock speed.

      So one aspect of accelerating a process with FPGAs is not only just to redesign that process in hardware and hoping for faster results, but to smartly redesign that process to use as little hardware space as possible, resulting in a higher maximum clock speed.

      As engineers we know, there is always room for improvement, so we at Missing Link Electronics strive to continuously improve our existing product lineup.

      MLE Smart Process Redesign for Resource Saving and Speed Increasing

      In one of these development cycles, we encountered a simple, yet very effective improvement of the synchronous reset logic hardware descriptions in our TCP/UDP/IP Network Protocol Accelerator (NPAP).

      Up to that point, the reset logic in pretty much all of our modules was described like this:

      At the clock cycle, we handled the reset inside an if-statement, whereas all other logic was treated in the other case.

      That specific code is synthesized by Vivado 2020.2 to the schematic shown in Figure 1.

      Now looking at Code 2, semantically speaking the code does almost the same. The reset is still handled inside the if-statement, however, all other logic is handled before the reset case. Looking at the synthesized result of that code in Figure 2, we see that by just removing the else statement, we also removed a whole LUT from the design! And thinking how we only used two LUTs in that example, we just shaved off 50% of our LUT usage! Isn’t that amazing?

      And this does not only work in Vivado, but in Libero SoC and Quartus Prime as well – we tested it!

      How can this work? As seen in the code, the data_b_out register is not affected by the reset at all. The difference between the code samples in case of a reset is that in Code 2, the data_b_out register is still set to the data_b_in register, whereas in Code 1, the data_b_out register is not touched at all, so it stays the same. And it is this “staying the same” which is the reason for the extra LUT, as the synthesizer creates a feedback loop from data_b_out to be used as an input for the data_b_out register in case of a reset.

      So depending on how many of these structures you have, you might be able to reduce your hardware footprint and increase your maximum frequency by quite a lot.

      In our specific use case with NPAP, we were able to go from using 127096 LUTs to using 122228 LUTs, which is a decrease in LUT usage of almost 4%

      So in the future, keep in mind that coding style not only affects readability, but can also affect hardware usage and speed of the synthesized design.

      Molex Mini-Fit Jr – Know your Power, and Don’t Get Confused – “PCIe” and “Xilinx Not PCIe” Power Connector

      Our Mission: If It Is Packets, We Make It Go Faster – today more on PCIe, this time, Powering PCIe Cards with FPGAs…

      And with packets we mean: Networking using TCP/UDP/IP over 10G/25G/50G/100G Ethernet; PCI Express (PCIe), CXL, OpenCAPI; data storage using SATA, SAS, USB, NVMe; video image processing using HDMI, DisplayPort, SDI, FPD-III.

      To move packets you need power, but how much can you draw from an interface?

      PCI Express Cards Consumption Limit

      PCIe has been around for a long time, since 2003, and many people know the maximum power draw of 75W per PCIe slot, but is it that simple? The short answer is no, but let’s have a look into the spec. It states that all cards can consume up to 3A on the 3.3 V power rail with the following restrictions applying:

      • x1 cards: 0,5A on 12V, but overall consumption limit is 10W
      • x4 – x16 cards: 2,1A on 12V, but overall consumption limit is 25W 

      But where does the 75W come from? Well, exceptions apply:

      High power devices can draw more power after the initialization and configuration:

      • x1 cards can consume 25W 
      • x16 cards can consume 75W

      Using Additional Connectors for Higher PCIe Card Consumption

      With additional connectors (6pin and 8 pin) cards can consume up to 300W. A 6 pin adds 75W while a 8 pin adds 150W.

      But be careful, in the FPGA world there are two versions of the 6/8 pin Molex Mini-Fit Jr connectors, the Xilinx and the PCIe option. 

      The 6 Pin option has 2 pins (3 and 4) swapped. If you mix up the connectors 12V will directly connect to Ground, this can destroy your device.

      6 Pin Power Connector
      Pin PCIe Spec AMD/Xilinx Dev Boards
      1 +12 V +12 V
      2 +12 V +12 V
      3 +12 V Ground
      4 Ground +12 V
      5 Sense Sense
      6 Ground Ground

      The 8 Pin option the changes are less subtle, all pins have a different function and cause an electrical short.

      8 Pin Power Connector
      Pin PCIe Spec AMD/Xilinx Alveo Cards
      1 +12 V Ground
      2 +12 V Ground
      3 +12 V Ground
      4 Sense1 Ground
      5 Ground +12 V
      6 Sense0 +12 V
      7 Ground +12 V
      8 Ground +12 V

      A Deep Dive into AMD/Xilinx AXI Bridge for PCI Express (AMD/Xilinx PG194) and Why We Tweaked C_M_AXI_NUM_READQ

      Executive Summary

      AMD/Xilinx’ AXI Bridge for PCI Express (PG194) implements a bi-directional communication channel from and to FPGA internal memory mapped AXI4 masters and slaves to and from external PCIe connected memory mapped devices, with the FPGA operating as PCIe endpoint or root port.

      In many scenarios the performance of forwarding communication between the two protocols, AXI4 and PCIe, is sufficient and the AMD/Xilinx IP core can be used as well. However, in certain cases tweaking is necessary to achieve the expected throughput.

      Depending on the amount of extra performance required the modification ranges from simply tuning a hidden parameter to patching the IP’s HDL sources. In the example project used for this description the PCIe peer to peer (P2P) write performance from an FPGA to a 12 NVMe SSD RAID0 increased from 2.700 MiB/s to 4.900 MiB/s to 8.600 MiB/s.

      AMD/Xilinx AXI Bridge for PCI Express Overview

      The AMD/Xilinx AXI Bridge for PCI Express is implemented differently for different AMD/Xilinx FPGA families. This description focuses on the “AMD/Xilinx DMA//Bridge Subsystem for PCI Express in AXI Bridge mode” implementation as found in AMD/Xilinx UltraScale+ devices.

      The bridge IP core comprises the actual bridging logic, converting PCIe TLPs to AXI4 transactions and vice versa, and a PCIe IP core implementing the physical PCIe interface and forwarding PCIe TLPs via multiple AXI4-Stream interfaces. The PCIe IP core for Ultrascale+ is described in PG213. The bridging logic has two main and distinct blocks, the so-called Slave Bridge and the so-called Master Bridge. The Slave Bridge receives AXI4 memory mapped transactions from FPGA internal AXI4 master IP cores and converts them to PCIe request TLPs, operating as a PCIe bus master or “Requester”. The Master Bridge on the other hand receives PCIe request TLPs as a “Completer” and converts them to AXI4 memory mapped transactions targeting FPGA internal AXI4 slave IP cores.

      For this description we will only look at the Master Bridge and specifically only at PCIe read request TLPs resulting in AXI4 reads initiated by the AMD/Xilinx AXI Bridge for PCI Express. The AMD/Xilinx PG213 PCI Express IP interfaces involved are the two AXI4-Stream interfaces CQ (Completer reQuest) and CC (Completer Completion). Remember, for these kinds of transfers the FPGA is the Completer of PCI traffic, and it first receives a PCIe Read Request TLP on CQ, and then responds with a PCIe Completion TLP on CC.

      The CQ interface of the AMD/Xilinx PG213 PCIe IP core has some sideband signals independent of the AXI4-Stream interface to handle the flow of TLPs. This is necessary to cope with some of the PCIe ordering rules. Specifically PG213 allows connected logic, the AMD/Xilinx PG194 AXI bridge logic in this case, to hold off any PCIe Read Request TLPs (called non-posted requests or np) and instead receive PCIe Write Request TLPs that arrived at the PCIe core later but can now skip the line and move forward in the receive queue. The AMD/Xilinx PG213 PCIe IP core implements a flow control mechanism at its AXI4-Stream CQ interface to achieve this. This flow control is not to be confused with PCIe flow control, and the credits mentioned in this context are not PCIe flow control credits, although the concept is very similar.

      The AXI4-Stream CQ slave, the Master Bridge logic in this case, can provide up to 32 credits for non-posted requests to the AMD/XIlinx PG213 PCIe IP Core. Each non-posted request consumes one credit when sent over the CQ interface. And each assertion of the pcie_cq_np_req two-bit signal grants one or two credits depending on the signal’s value. The AMD/Xilinx PG213 PCIe IP Core maintains an internal counter named pcie_cq_np_req_count to track the number of available credits. As mentioned, each non-posted request decrements the counter and pcie_cq_np_req assertions increment it. The counter can assume values of 0 to 32. If the counter hits 0, no non-posted requests are sent over the CQ interface any more and the AMD/Xilinx PG213 PCIe IP Core instead provides posted requests if available.

      Performance Limitations in certain Scenarios

      The AMD/Xilinx PG194 Master Bridge by default only provides a maximum of 2 credits for non-posted requests to the AMD/Xilinx PG213 PCIe IP Core. These two credits are initially granted as a baseline and then it seems that there is one credit grant following each non-posted request reception. Since there is latency involved in the loop from request reception to granting the credit to the AMD/Xilinx PG213 incrementing the credit counter and then to finally providing another non-posted request based on the counter update, this basically implements a non-posted request rate limit. In combination with small non-posted requests this can significantly limit the achievable bandwidth. In one example of 12 NVMe SSDs RAID0 reading from the FPGA, bandwidth was limited to 2.700 MiB/s for a PCIe Gen3 x16 link, which is far less than expected in this setup.

       In the below screenshot of the Vivado ILA debugger it can be seen that two read transactions follow each other closely, are then followed by a pause, and then another two close by read transactions. This pattern continues and obviously the AXI4 possible bandwidth (matching the PCIe bandwidth) is not fully utilized.

      Looking at the simulation of this same setup, the root cause can be seen. The AMD/Xilinx PG194 Master Bridge tries to maintain a pcie_cq_np_req credit level of 2, as mentioned above. These two credits are rapidly consumed by two consecutive PCIe non-posted read requests. After receiving each of them the AMD/Xilinx PG194 Master Bridge grants a new credit soon thereafter by asserting pcie_cq_np_req, but the latency of the path described above until the next request is provided is just too high to prevent a long pause. The pcie_cq_np_req_counter drops to 0 and transfers are throttled.

      The setting defining the pcie_cq_np_req credit level initially granted and maintained by the AMD/Xilinx PG194 Master Bridge is configured by the hidden AMD/Xilinx PG194 configuration parameter C_M_AXI_NUM_READQ (not to be confused with C_M_AXI_NUM_READ). As mentioned, it defaults to 2. This is also the only value officially supported by AMD/Xilinx (which is why the parameter is hidden). However, one other value, 4, can be set via TCL. Doing so bumps the NVMe SSD RAID0 bandwidth to 4.900 MiB/s.

      set_property CONFIG.C_M_AXI_NUM_READQ 4 [get_ips xdma_0]

      A further increment is not possible via TCL. Instead, the HDL has to be patched to use a value outside the TCL allowed range. An increment to 8 increased the bandwidth of our example setup to 8.600 MiB/s, which is very close to the expected, as the pattern generator providing the SSDs DMA data is limited to 9.000 MiB/s.

      set hdl_synth_top \
      
      [get_files -all -used_in synthesis \
      
      -compile_order sources \
      
      -of [get_ips xdma_0] xdma_0.sv]
      
      package require fileutil
      
      ::fileutil::updateInPlace $hdl_synth_top \
      
      [list string map \
      
      [list {.C_M_AXI_NUM_READQ(2),} \
      
      ".C_M_AXI_NUM_READQ(8),"]]

      Learn more about our capabilities in PCI Express (PCIe) Connectivity to achieve extra performance.

      Stop Wobbling Around – A Salute From The MLE Embedded Lab

      Our Mission: If It Is Packets, We Make It Go Faster – today, how to properly mount those large FPGA-based PCIe Cards … theory and practice!

      Everyone working with evaluation hardware knows the struggles: On its own it works, but when you assemble an entire setup you have many loosely or weakly connected parts. This easily can damage a part or much worse, you spend an entire day debugging. 

      To overcome this issue, MLE sometimes gets creative and designs additional carriers to securely mount extension cards or additional hardware. 

      As a small example, the following images show a 3D Model of how we securely mount additional Hardware to the Xilinx ZCU102:  

      • SD-Mux, which we use to swap SD-Card images
      • PCIe Card

      Here’s the theory:

      The stabilizations for PCIe and SD-Mux are 3D printed by our engineers (yes they have some neat toys at home). This helps MLE to have a clean stable lab setup.

      In this case we tested a SAS 12G storage controller on the Xilinx ZCU102.  

      And the first prototype looks like this:

      If you want to enhance your Lab as well, we give you a kickstart. You can find our printable 3D Models at thingiverse:

       What do you do to make your work easier/better in the lab?

        Learn more about our PCIe IP core offerings.

      Picking The Right Granularity When Buffering PCIe/NVMe Data

      You know our Mission: If It Is Packets, We Make It Go Faster – today the many flavors of memory for buffering data in FPGAs.

      Non-Volatile Memory Express (NVMe) is an interface specification often used with PCIe. Its goal is to leverage the parallelism and low latency of modern SSDs. A typical PCIe payload data transfer happens in data chunks of either 128 Byte or 256 Byte.

      SSDs deploy several tricks (wear leveling, SLC to TLC conversion) to enhance the read and write speeds as well as their lifespan. One downside is that their read and write speed is not constant over a long write/read period which might result in backpressure.

      Some applications do not support back pressure that can lead to an erroneous state if one employs a standard SSD system.  

      One possible mitigation strategy is to have an elastic buffer between the SSD and the data source.  Using an FPGA, there are different possibilities to implement an elastic buffer. At MLE, we investigated BlockRAM (BRAM), UltraRAM (URAM), Dynamic RAM (DRAM) and the second generation of High Bandwidth Memory  (HBM2). Each memory technology has its advantages and disadvantages regarding its capabilities to handle different data chunk sizes.  We will present our findings below. 

      BlockRAM (BRAM)

      BRAM is a RAM module which can be found on every FPGA. It consists of two ports meaning that in each cycle it is possible to access two different locations. BRAM can be configured as an 18 Kb or 36 Kb FIFO. 

      BRAM is a viable option for small data chunks but might be too precious for large chunks.

      The ZCU106 evaluation board has a total of 11 Mb of BRAM. 

      UltraRAM (URAM)

      Similar to BRAM, URAM is a dual-ported RAM. In contrast to BRAM ports, the two URAM ports can only perform one operation, read or write, per clock-cycle. This is due to the fact that they operate internally on a single memory cell and the operation of port A is performed before port B in the same clock cycle. 

      The AMD/Xilinx ZCU106 evaluation board has a total of 27 Mb of URAM. This is ~2.5 times more than the available BRAM. 

      URAM is a good middle ground between DRAM and BRAM. This is due to the fact that more URAM is available than BRAM but in contrast to DRAM it is still on-chip memory and works well with smaller data chunks.

      DDRx DRAM

      BRAM and URAM are on-chip memory. DDR3/DDR4/DDR5 DRAM on the other hand is off-chip memory which means that some form of interconnect has to be between the PL or PS and DRAM. While BRAM and URAM can store data in the Mb region, with DRAM it is possible to store multiple GiB of data. 

      DRAM is useful for large data chunks as its efficiency goes down for small chunk sizes.

      High Bandwidth Memory (HBM2)

      HBM2 is similar to DRAM as it already needs some form of interconnect between the PL or PS and itself. HBM2 is not available on all devices and is expensive but offers high bandwidth as the name suggests. 

      Similar to DRAM, HBM2 works best with large data chunks. In contrast to DRAM, higher bandwidths can be achieved and is thus best suited for integrating it into a system which requires higher bandwidth than DRAM can provide.

      Conclusion

      In this article, an overview of different memory technologies and their use case for different data chunk sizes have been presented. Small chunk sizes are difficult for DRAM but are possible in URAM or BRAM. For large data chunks, BRAM and URAM are also viable options but might be too precious. HBM2 is a good option if the bandwidth of DRAM is not sufficient. 

      In one of our next posts we will discuss how to combine different types of memory (BRAM, URAM and DRAM, for example) to have a hybrid memory subsystem for a high speed NVMe Storage system.

      Learn more about our IP core offerings in NVMe Streaming.