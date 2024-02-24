Simultaneous and Heterogeneous Multithreading (SHMT) may be the solution that can harness the power of a device's CPU, GPU, and AI accelerator all at once, according to a research paper from the University of California, Riverside. The paper claims that this new multithreading technique can double performance and halve power consumption, which results in four times the efficiency. However, as a proof-of-concept, don't get excited too fast; it's just in the early stages.

Many devices already use multithreading techniques like Simultaneous Multithreading (SMT), which divides a processor core into two threads for more efficient computing. However, SHMT spans multiple devices: a CPU, a GPU, and at least one AI-powered accelerator. The idea is to get each processor working on separate things simultaneously and even spread GPU and AI resources across multiple tasks.

According to the paper Hung-Wei Tseng and Kuan-Chieh Hsu authored, SHMT can improve performance by 1.95 times and reduce power consumption by 51%. These results were recorded on Nvidia's Maxwell-era Jetson Nano, which features a quad-core Cortex A57 Arm CPU, 4GB of LPDDR4, and a 128-core GPU. Additionally, the researchers installed a Google Edge TPU into the Jetson's M.2 slot to provide the AI accelerator, as the Jetson comes with one.

The researchers achieved this result by creating a quality-aware work-stealing (QAWS) scheduler. Essentially, the scheduler is tuned to avoid high error rates and to balance the workload evenly among all components. Under QAWS policies, tasks that require high precision and accuracy won't be assigned to sometimes error-prone AI accelerators, and tasks will be dynamically reassigned to other components if one isn't meeting performance expectations.

(Image credit: University of California, Riverside)

With double the performance, half the power, and four times the efficiency, you might be wondering what the catch is. According to the paper, "the limitation of SHMT is not the model itself but more on whether the programmer can revisit the algorithm to exhibit the type of parallelism that makes SHMT easy to exploit." This statement refers to how software must be written to take advantage of SHMT and that not all software can utilize it to maximum effect.

Rewriting software is known to be a pain; for instance, Apple had to do lots of legwork when it switched from Intel to its in-house Arm chips for Mac PCs. Regarding multithreading specifically, it can take a while for developers to adjust. It took several years for software to take advantage of multi-core CPUs, and we may be looking at a similar timeline for developers to utilize multiple components for the same task.

Additionally, the paper details how SHMT's performance uplift hinges on problem size. The figure of 1.95 times faster comes from the maximum problem size the paper tested, but smaller problem sizes see lower performance gains. At the lowest problem size, there was essentially no performance benefit since lower problem sizes offer fewer opportunities for all components to work in parallel.

As computers of all sorts are increasingly shipped with multiple computing devices like AI processors, it's probably inevitable that developers will want to use more hardware to speed things up. Even if SHMT doesn't live up to the best-case scenario that the paper outlines, it could still boost PCs and smartphones if and when it or a similar technology gains mainstream momentum.