Researchers from MIT want to speed up the process of monitoring, diagnosing, and fixing problems with multi-billion-dollar supercomputers by visualizing the hardware in the Unity 3D Game engine, similar to the one found in some game titles. Supercomputers are an extremely complex collection of hardware, as they operate with thousands of interconnected systems. However, there might be bottlenecks in the system, and diagnosing those bottlenecks can cost supercomputing facilities time and reduced performance.
The average supercomputer has plenty of components in the system. Each part of the system is called a node, and each node contains a specific set of hardware components. As a vastly oversimplified and basic explanation, some nodes are designed for storing data while other nodes are for computing. These compute nodes typically contain processors and main system memory.
Engineers continuously test the machine during the installation process, encountering problems along the way. For example, there could be storage, processor, and even networking problems in the system, and diagnosing the root cause can be difficult with such large-scale systems. For example, the upcoming Frontier supercomputer should have around 100 racks containing 10s of nodes each, resulting in thousands of nodes to diagnose and monitor.
To help streamline these types of efforts, researchers from MIT have developed a new technology to visualize node monitoring, offering real-time system reporting in the Unity 3D game engine found in many video games. Called the MM3D, it is a part of Data Center Infrastructure Management (DCIM) tools developed by the MIT SuperCloud division.
The paper notes: "The combination of supercomputing analytics and 3D gaming visualization enables real-time processing and visual data display of massive amounts of information that humans can process quickly with little training. Our system fully utilizes the capabilities of modern 3D gaming environments to create novel representations of computing hardware which intuitively represent the physical attributes of the supercomputer while displaying real-time alerts and component utilization."
This means that this 3D engine can display component utilization and any alerts from the system in real-time. For instance, if an alert pops up, a specific node could be overheating, and the system administrator would be alerted instantly in the 3D engine application. You can see a system demonstration below, with the first image being the actual hardware, while the second represents visualized elements.
While this is not a commercial application yet, this academic project could represent a step forward in the supercomputer monitoring department that helps ease system administration. Given that academic institutions share their work with other entities, it's probably only a matter of time before we see a similar solution in the wild.