FPGA Demo Shows Efficiency Gains Compared to x86 Chip

Sphery Vs Shapes
(Image credit: Victor Suarez Rovere and Julian Kemmerer)

An FPGA - that’s a field-programmable gate array, a sort of reconfigurable microchip - has been shown to run a 3D, ray-traced game written in C 50 times more efficiently than an x86 CPU while using a fraction of the energy and perhaps pointing the way to future programming efficiency gains. The claims are made in a white paper [PDF] by Victor Suarez Rovere, a developer from Argentina, and Julian Kemmerer, a systems engineer from Pennsylvania, and brought to our attention by CNX Software.

The FPGA in question is the Arty A7, a Xilinx Artix-7 100T FPGA development board that sells for around $280 and features 101,440 logic cells (an FPGA’s logic cells contain a look-up table that can implement any logic function, giving the chip its programmability) on a 28 nanometer process, and which pulls less than a watt of power. The CPU it was pitted against (without, it must be said, troubling the chip’s iGPU) was a Ryzen 7 4800H, an eight-core 16-thread laptop processor that was built on a 7 nm process and has a default TDP of 45W. That's a laptop chip that's not available on its own, but the R7 4700G is currently available for about $240.

The game that was compiled to run on the two very different platforms is "Sphery Vs Shapes," and doesn’t appear to contain much in the way of plot, characters or actual gameplay, but does have lots of ray-tracing, as a shiny metallic ball bounces its way across a chessboard-like environment, which is reflected in its shiny spherical surface.

Both platforms rendered the game at 1080p and 60 frames per second without a problem, but the FPGA did it using 660 mW, while the R7 needed 35W, a difference of 53x. It is speculated that, were the FPGA to use the same 7nm process as the CPU, this figure could be six times higher.

The keys to the whole thing are Pipeline C, an invention of Kemmerer’s, and CflexHDL from Suarez. You can find them both on GitHub. "The game’s pixel rendering and animation logic is based on floating point and vector math operations. All of the game code is expressed using a clean syntax that translates directly to a digital circuit. The current target of this design is a FPGA board with Full HD digital video output, and the workflow also allows running the game in realtime on a regular PC using the unmodified source," they write in their paper. "This allows for much faster development-test iterations than with traditional hardware design tools. For the same workload, the computing efficiency resulted in more than 50X better than using a modern CPU, in a chip an order of magnitude smaller."

"Sphery Vs Shapes" stands up pretty well as a graphics demo, but what it means for the future of programming is more interesting - especially as FPGAs are going to start appearing in AMD chips. There are plans to port the whole thing to RISC-V, and to design an open-source ASIC (application-specific integrated circuit) that supports the pipeline, and there are possibilities for the world of microcontrollers too." The code can be translated to a logic circuit, run on a[n] off-the-shelf CPU, or on a microcontroller to develop hardware/software peripherals without changes to the code,” Suarez and Kemmerer write in their conclusion. "The results we obtained are readily reproducible, as materials are easy to obtain and not expensive."

Ian Evenden
Freelance News Writer

Ian Evenden is a UK-based news writer for Tom’s Hardware US. He’ll write about anything, but stories about Raspberry Pi and DIY robots seem to find their way to him.

  • Geef
    I think the big hurdle here won't be the hardware, it will be programmers needing to rewrite their software for the new setup.
    Reply
  • TJ Hooker
    From what I can tell, the main point of the whitepaper isn't that the FPGA can run the program more efficiently. After all, once the FPGA has been programmed it's essentially an ASIC, and saying that an ASIC executes some function far faster/more efficiently than a general purpose CPU isn't a revelation.

    It seems the important thing they achieved is being able use the C same source code to either compile and run in software, generate a hardware model and simulate performance, and/or to generate the FPGA image (directly from C code rather than an HDL language). This improves ease and speed of development. Maybe to the point where it starts making sense to use FPGAs in heretofore impractical use cases.
    Reply
  • suarezvictor
    Hi TJ, author here. I'm glad to know you're aware of the gist of the project: improve development time.
    Doing a project like such demo would be impossible using traditional hardware design tools, since they're really slow as compared with a modern compiler. The game required many development-test iterations that would render next to impossible if those iterations weren't done fast. As showed in the video, we can compile the project in just a second, and once we're sasisfied with results, go to better simulations and ultimately, to the hardware where it runs as simulated. The code base is exactly the same, with no single line changed.
    Reply
  • bit_user
    TJ Hooker said:
    From what I can tell, the main point of the whitepaper isn't that the FPGA can run the program more efficiently.
    Well, they do seem to place a lot of emphasis on power-efficiency.

    A few things I noticed:
    They don't use the same data types between the CPU and FPGA version. In some cases, fixed-point could still be more power-efficient on the CPU, depending how it's used.
    Their code doesn't appear to be SIMD. It uses vector data types, but I don't see any intrinsic instructions and it's not written in a proper data-parallel SIMD fashion. The paper mentions using SIMD, but I'm not really sure what they mean by it.
    I can't see how they do threading or load-balancing, but this could cause lots of overhead if not done well.
    I see other potential optimizations in their C++ code. They don't use any restricted pointers. They have some unnecessary fdivs and sqrt. I also see places they do bit-banging on floats (usually not a win, in modern CPUs), but I can't easily tell how much of the fast path code is doing it.
    The data model of the game is trivial. Here's where I expected they might be getting a big win from on-chip scratchpad memory of the FPGA, but the game is almost too simple for it to help much.
    Similar to the above point, they kept the game logic within the complexity sweet spot of the FPGA.If I had a couple weeks to burn, I could probably speed up the CPU version by something like 2-5x. Possibly more, depending on some of the unknowns I mentioned above. Granted, that wouldn't entirely close the efficiency gap vs. the FPGA, but again the game does play into the FPGA's strengths.

    A better experiment might be to take an app that already quite well-optimized for CPU, and port it to FPGA.

    TJ Hooker said:
    once the FPGA has been programmed it's essentially an ASIC,
    Not at all. A true ASIC is much faster, primarily due to having much better-optimized layout and probably lots of other bespoke customizations.

    For a long time, people have used FPGAs to validate ASIC RTL, before tapeout. In an example of just how much more efficient ASICS are, you need multiple giant FPGAs just to simulate a modest-sized ASIC.

    TJ Hooker said:
    and saying that an ASIC executes some function far faster/more efficiently than a general purpose CPU isn't a revelation.
    Depends on whether they're using fixed-function arithmetic or whether they have to synthesize it. I think FPGAs have been on a trend of incorporating more fixed-function units for a while. If they have to synthesize things like multipliers, then that puts the FPGA at a potential disadvantage relative to a CPU, which has far more optimized implementations of these building blocks.

    TJ Hooker said:
    It seems the important thing they achieved is being able use the C same source code to either compile and run in software, generate a hardware model and simulate performance, and/or to generate the FPGA image (directly from C code rather than an HDL language).
    People have been using tools like OpenCL to target FPGAs for more than a decade.

    I think there are also tools for doing this that are closer to C. I assumed that was what SystemC was all about, but it seems to be more of a simulation and modeling tool. However, that article linked to SpecC, which sounds more production-oriented.

    https://en.wikipedia.org/wiki/SpecC
    TJ Hooker said:
    This improves ease and speed of development. Maybe to the point where it starts making sense to use FPGAs in heretofore impractical use cases.
    The thing that jumps out at me is that we know hardware is better for ray tracing. People have been doing it in ASICs since about 20 years ago, and GPUs since almost 5. We've also been getting more purpose-built ASICs for other high-profile applications of FPGAs, like crypto-mining and AI. That leaves the CPU to do what it does best: be the generalist and the traffic cop. The main places I see where FPGAs are still relevant are things like DSP in ultra low-power, low-volume embedded devices and when you need the absolute minimum latency, for things like software-defined networking and high-frequency trading.

    The other point I'd make about your conclusion is the trivial complexity of their demo. As soon as you start to run out of gates, you'll have to do a lot of restructuring and indeed architecting of the code to cater to the FPGA's limits.
    Reply
  • suarezvictor
    bit_user: thanks for your extensive review. I feel you may be missing the objectives of the project, so I'll try to explain.

    The main one is that this project is not at all about optimizing a raytraced game to be run in a CPU, but to be able to implement in hardware (with no need of software) algorithms that are usually complex. It's a hardware design tool, not a software platform of any kind.

    Contrary to your analysis, we do use multiple threads based on openmp (see simulator_main.cpp#L247), indeed, the CPU shows all cores running at close to 100% usage (power is measured only when the CPU are doing the calculations). And we use intrinsic vector types (see c_compat.h#L51)

    Since it's a hardware design tool, data types are not larger than needed, on purpose. We don't waste hardware resources when not strictly required. In a CPU, you need to chose among a limited amount of widths/precisions, so we used the ones that fit best. Running the design in a CPU is just a convenience for fast development of hardware, and we do it fast enough to be able to run the design in realtime in the CPU. This project doesn't aim to replace CPUs at all, indeed, you can design a CPU in C using this tool (ongoing project) and have it conveniently simulated on another CPU, in a fast way, with no code changes. And we show that for some algorithms, including ones using math calculations, you can save power in a significant way.

    I don't see where we use "unnecessary fdivs and sqrt". In any case, those operations are executed both in the CPU and in the FPGA. If you optimize them, the power ratio may stay similar.
    Is the game logic simple? Yes, on purpose. We haven't seen any FPGA game of more complexity yet, not based a CPU and stored program instructions. The only objective of this project is to ease hardware design and empower designers.

    About OpenCL, you can run it in a FPGA, but we think it's too high level and not convenient for general hardware design. If you read the article, you'll see that we designed a UART communications core that's translated to a circuit in the FPGA, or can be run as a bit-banged peripheral in a microcontroller, with no code changes. I bet that's not possible with OpenCL or, if you get close to that, how much resources it will unnecesarily spend. Regarding other "C to FPGA" tools, they may not offer conveniences as we do, like ease to work with vector types with a clean syntax. If you know of a project similarly complex as ours, that you can also run in a CPU, please let us now.

    In regards to the many ASICs the world has, including crypto cores, GPUs and the like, we propose that the language they use for design are too complex (Verilog, VHDL, etc), and much less known than C, and definitely slower to simulate than using an advanced compiler like we do (in this case, clang).

    About out demo being "trivial", as told we consider it a feature, but we'll be eager to know if there exist a more complex one capable of running in a CPU or target a FPGA, with no code changes.
    Reply
  • bit_user
    suarezvictor said:
    Contrary to your analysis, we do use multiple threads based on openmp (see simulator_main.cpp#L247), indeed, the CPU shows all cores running at close to 100% usage (power is measured only when the CPU are doing the calculations). And we use intrinsic vector types (see c_compat.h#L51)
    I didn't say you weren't using threads. I just said I didn't see where/how it was done, and that it could be a bottleneck depending on that.

    I'm glad you mentioned you're using OpenMP, because I have some experience as a user of OpenMP applications, and I can tell you that libgomp (one of the most common implementations) does something extremely naughty that could be blowing your performance/efficiency comparison out of the water. By default, it uses spinlocks, which is absolutely atrocious, in this day and age.

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71781

    To disable them, export the environment variable:

    export OMP_WAIT_POLICY=passive

    For more info, see: https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html#OMP_005fWAIT_005fPOLICY
    We got a significant performance boost, when we set this! Some Linux distros set it by default, but others don't.


    suarezvictor said:
    I don't see where we use "unnecessary fdivs and sqrt".
    I'm not sure which source files I should be looking at, but your float_shift() function asks for a true division, rather than multiplying by the reciprocal. Whether or not it's a real divide depends on your compiler and options. I saw where you use -ffast-math, in the Makefile, but I don't know any details about how clang implements it. Check the generated assembly to know for sure.

    As for the sqrt(), your sphere intersection test is using it. You can eliminate it by squaring both sides. That's practically the oldest trick in the book. Depending on which C standard you're compiling for, I think it might even be doing a double-precision sqrt(). Traditionally, if you want the float version, you have to call fsqrt().

    suarezvictor said:
    About OpenCL, you can run it in a FPGA, but we think it's too high level and not convenient for general hardware design.
    Yes, it's not for general hardware design. It's mainly for using FPGAs to accelerate compute-intensive algorithms. A lot of FPGAs will have general-purpose, hardwired CPU cores (usually ARM A53 or similar) which can do the housekeeping and orchestrate the compute-intensive parts running on the gate array.

    suarezvictor said:
    About out demo being "trivial", as told we consider it a feature,
    Yeah, it wasn't meant as an insult. Small is always a good place to start, but you ought to have a plan for how to scale up. My point was really that it's not clear to me how you handle the situation where you start to run out of gates. There are/were companies who would take source code and compile part of it into a hardware design and part of it into executable code. I'm thinking that's where you probably want to go. However, like I said, it's not an untrodden path. I know of two examples, and I'm not even a hardware engineer/researcher.
    https://www.cadence.com/en_US/home/tools/ip/tensilica-ip/tensilica-xtensa-controllers-and-extensible-processors.htmlhttps://www.sciencedirect.com/science/article/pii/S1877050912001421Stretch, Inc. (the second link) claimed to have tools that could take C code, analyze it for hot spots, and then synthesize the hotspots into one or more custom configurations for something like a FPGA. Not only that, but their hardware was much faster to reprogram than conventional FPGAs. So much so that their H.264 encoder supposedly reconfigured the hardware multiple times per frame, and could sustain throughput of several hundred FPS. That might not sound too impressive, but it was back in 2010 or so.

    I should add that you definitely deserve kudos for turning your vision into reality. I'm sure you learned a lot by doing it, and you have something interesting to show people. That's worth a lot, regardless of where the project ends up.
    Reply
  • suarezvictor
    bit_user
    We're welcome to good criticism but we suggest that you need to review the code in full to spot real issues with the project. For example, we don't use stock sqrt but an optimized one, so the possible pitfalls that you mention aren't really there. In our case we selected a fast algorithm that gives enough precision and is implemented using just 4 multiplications. And we don't use any double type in the implementation, either. Additionally, the "float_shift" function is always called with a constant as second argument since that's almost free in hardware and in software the compiler optimized the situation nicely. In case of replacing sqrt by squaring both terms, we know it's an easy and old trick but we need the distance measurement in a linear domain so we don't do just a "test" for intersection.
    About not using OpenCL, we see that you now realize doesn't fit requirements. We'll check about wether OMP_WAIT_POLICY is enabled or not, thanks.
    About running "out of gates" with larger design, well, we are not wasting resources. The objective of this is not to implement a top of the line raytracer but just to show how you can easily implement hardware design that runs as software as-is. By no means we want to translate a random C design to FPGA, this is jost a hardware design tool that may take some C code as is, and is able to run it fast in a CPU for simulation purposes. We don't believe a tool that tries to map any C code is anything good, like the mentioned tools that you says works as that. If they're able to reconfigure the chip fast, we think this is a good feature, everyone wants that.
    About resource usage, when you get out of them, it's obvious you need to look for optimizations or change the chip. This is natural and our tool doesn't interfere with that at all.
    Thanks for your kudos! And let's keep learning.
    Reply
  • bit_user
    suarezvictor said:
    we suggest that you need to review the code in full to spot real issues with the project.
    It did become apparent there was more going on in your project's C/C++ code than I was fully able to review. However, I have limited time and provided what feedback I could. At least, for now.

    suarezvictor said:
    For example, we don't use stock sqrt but an optimized one, so the possible pitfalls that you mention aren't really there. In our case we selected a fast algorithm that gives enough precision and is implemented using just 4 multiplications.
    I would just point out that the VSQRTPS instruction can operate on 8x fp32 elements with a throughput of > 1 per 6 cycles, on Skylake cores. I haven't had much luck finding instruction throughput & latency data on AMD CPUs, but perhaps Zen 2 is similar?

    For even higher throughput, Intel recommends VRSQRTPS + one Newton-Raphson iteration.

    I think this highlights some of the benefits a pure SIMD implementation can provide.


    BTW, I'd have loved to have a project like this about 15 years ago. I was very interested in video processing algorithms, like deinterlacing and motion interpolation. A decent FPGA seemed just the thing for doing this stuff in realtime. Of course, nowadays, I'd just use a GPU... ¯\(ツ)
    Reply
  • suarezvictor
    Indeed we take care in writing this code to be optimized enough. In regards to the SIMD sqrt, if we need to hand-code a SIMD version reviewing instruction set manuals or writing assembly code, the whole point of "easy write" would be lost. A hand-coded Verilog implementation of hardware could be optimized too, at the cost of much higher development time. So we rely on currently compiler optimized and I think this is fair enough for a comparison. Anyways, assuming a hand-coded SIMD version would double performance, we are still using a 28nm FPGA vs a 7nm CPU so there's plenty enough to keep being much lower power. And we're not counting other options like FPGAs including floating point hard cores. The main power benefit comes from the architecture: the long pipeline, with hardwired logic, running at a much lower clock rate, and avoiding instruction decoding and the like brings a lot of advantages. If you translate that to an ASIC the gains could be hard to believe... And we provide an easy way to access all those benefits from clean C code that you can run so fast in a CPU that you can play a full HD game smoothly. This is no easy tasks, if possible at all, with current hardware design tools.
    Reply