Skip to main content

DARPA Wants To Build a 1,000x More Efficient Graph Analytics Processor With Intel, Qualcomm's Help

The Defense Advanced Research Projects Agency (DARPA) announced that it has selected five participants for its Hierarchical Identify Verify Exploit (HIVE) program to develop a high-performance data handling platform. Intel and Qualcomm will be among the five participants that will help the agency build the new graph analytics platform.

“The HIVE program is an exemplary prototype for how to engage the U.S. commercial industry, leverage their design expertise, and enhance U.S. competitiveness, while also enhancing national security,” saidWilliam Chappell, director of MTO, in the release announcing the selections. “By forming a team with members in both the commercial and defense sectors, we hope to forge new R&D pathways that can deliver unprecedented levels of hardware specialization,” he added.

A main objective of the HIVE program is to create a graph analytics processor, which can more efficiently find and represent links between data elements and categories. These could include person-to-person interactions, and disparate links such as geography, change in doctor visit trends, or social media and regional strife.

Unlike traditional analytic tools that study one-to-one or one-to-many relationships, graph analytics can use algorithms to process and interpret data in “many to many” relationships. An example of this would be all the Amazon users and all the products they’ve bought on the site.

In combination with other machine learning techniques that can categorize raw data elements and update the elements as new data arrives, the graph processors should be able to discern hidden causal relationships among the data elements.

DARPA believes that such a graph processor could achieve a “thousandfold improvement in processing efficiency,” over today’s best processors. That should enable the real-time identification of strategically important relationships as they unfold in the field, rather than after-the-fact in data centers.

“By mid-2021, the goal of HIVE is to provide a 16-node demonstration platform showcasing 1,000x performance-per-watt improvement over today’s best-in-class hardware and software for graph analytics workloads,” said Dhiraj Mallick, vice president of the Data Center Group and general manager of the Innovation Pathfinding and Architecture Group at Intel.“Intel’s interest and focus in the area may lead to earlier commercial products featuring components of this pathfinding technology much sooner,” he noted.

  • bit_user
    I know this is graph processing - not graphics - but I'd be surprised if they could build hardware 1000x the perf/W of today's GPUs w/ HBM2.

    https://devblogs.nvidia.com/parallelforall/gpus-graph-predictive-analytics/
    Reply
  • SockPuppet
    This is custom-purpose hardware. It will be many times faster than a GPU at the job it was designed for.
    Reply
  • gc9
    GPUs have higher memory bandwidth available than CPUs, and tolerate longer memory latency, but they don't address the problem that in graph processing much of the bandwidth is wasted on data that is compared and discarded. One idea is smarter memory chips that can compare and discard internally, reducing the time and energy shuttling results back to the processors.

    A motivating example seems to be that the full bandwidth can be used to fetch useful values when traversing dense arrays, but not sparse-matrices.

    Consider a dense 2D-matrix that contains elements of a row in adjacent memory locations (a row-major representation). Processors can automatically predict the address of the next few elements and prefetch them, and memory can send a series of adjacent values. When striding (such as following down a column), the processor can still predict the next address and prefech it, although only one or two values in the cache line are needed, wasting bandwidth.

    In a sparse matrix, the elements may be respresented as rows that alternate column number and value (or as a row of column numbers and a row of values). The processor cannot easily predict the address of the desired values. It has to scan down the row (or do a binary search) to find if an element exists, and where in the row it exists. During this scan or binary search, most of the results might be discarded. Also, in the binary search case, additional time is lost waiting for each round trip to memory and back.

    Graph networks might be represented similarly to a sparse matrix. Each node has a list of its neighbors labeled with a relationship id and neighbor node id. So again the processor frequently scans down a list of neighbors until it finds the relationship id. A hash table could be used, but they may not fill cache memory densely, and they also might contain lists when there are multiple neighbors for the same relationship.

    To make memory smarter, it should return only matching values, not values that will be immediately discarded and never used. So one approach is to put a filter on the memory output buffer to offload some of the associative matching and discarding. Another approach would be to add simple processing to do a scan or binary search. Another approach is to build memories that are addressed associatively, so a whole row could be queried and only values with a matching id or column number are queued in the output buffer.

    To make processors better for graphs, they might have shorter cache lines, or at least the ability not to fill the whole cache line if it is not wanted (multiple presence and dirty bits). Also they need to communicate with the smarter memory.

    On the other hand, the applications appear to be for analyzing large networks, so these are specialized hardware devices for governments and organizations with financially deep pockets. Unless someone can come up with a consumer application as intensive as video games and cryptomining have been for GPUs...
    Reply
  • bit_user
    19780223 said:
    This is custom-purpose hardware. It will be many times faster than a GPU at the job it was designed for.
    Yeah, I get that. I just said not 1000x. I take the 1000x figure as maybe applying to server CPUs.
    Reply
  • bit_user
    19780410 said:
    GPUs have higher memory bandwidth available than CPUs, and tolerate longer memory latency, but they don't address the problem that in graph processing much of the bandwidth is wasted on data that is compared and discarded. One idea is smarter memory chips that can compare and discard internally, reducing the time and energy shuttling results back to the processors.
    The reason I specifically mentioned it is that I see HBM2 as a partial solution to this problem.

    I was imagining the ultimate solution might look like a smart memory. However, there might be another way to slice it. Instead of using a few channels of fast DRAM that relies on large bursts to achieve high throughput, one could use many more channels of slower memory that has lower overhead and a correspondingly lower penalty for small reads.

    Intel's 3D Xpoint is supposedly bit-addressable. Perhaps their selection of Intel, as a partner, wasn't quite as arbitrary as it might seem.

    19780410 said:
    On the other hand, the applications appear to be for analyzing large networks, so these are specialized hardware devices for governments and organizations with financially deep pockets. Unless someone can come up with a consumer application as intensive as video games and cryptomining have been for GPUs...
    Google's development of their own TPUs has shown even a single cloud operator can drive enough demand for specialized analytics hardware.
    Reply