Arm ML: Low-Power, High-Performance Machine Learning at The Edge

Editor's Note: This article is sponsored content from Arm and was not reported or published by the Tom's Hardware staff.

The machine learning (ML) market is advancing rapidly; by one estimate, the AI software market will grow to $59.8 billion by 2025. While ML-powered AI is already impacting almost every sector imaginable, much of this value is derived from the recent explosion in the numbers of connected devices that are now prevalent everywhere, from farming to industry through healthcare and even our own connected homes.

Virtual assistants, for example, were practically unheard of a decade ago yet, at the start of 2019, Amazon announced that an incredible 100 million Alexa-equipped devices had been sold. It’s a vivid illustration of our enthusiasm for connected devices – and for issuing commands to a small, inanimate object. (Other virtual assistants are available. Stand up Apple’s Siri, Microsoft’s Cortana and the Google Assistant.)

But as connected devices proliferate and grow ever-smarter, the race is on for developers to deliver innovation while simultaneously honing their platforms for optimal performance within an increasingly constrained power-budget.

And as more and more ML happens on-device, the requirements of the platform shift to accommodate the challenges of edge-based compute.

Why ML is Moving to the Edge

On-device inference has been gaining ground recently, with an increasing shift of functionality from the cloud to the edge device. The benefits of edge ML are well documented: reliability; consistent performance, even without an internet connection; reduced latency, since all the processing is done right there on the device; and privacy, since data is less exposed when it stays on the device.

So what does a platform need to run ML workloads at the edge? The answer is dependent on the application, as well as the power and area constraints: many embedded devices will need nothing more than a small, low-power microcontroller unit (MCU), and the vast majority of smartphones on the market today are running accurate, performant ML on a CPU.

Because of continuous innovation to drive performance and efficiency, the CPU has evolved to become a kind of mission control for ML, either single-handedly managing entire ML workloads or distributing selected tasks to specialized ML processors. Where responsiveness or power efficiency is critical, however, CPUs may struggle to meet the high-performance requirements, and a dedicated neural processing unit (NPU) may be the most appropriate solution. An NPU comes into its own wherever the most intensive and efficient performance is required. But how do you pick the right one?

A Dedicated Processor for High Performance Requirements

There are four key features that must be present in any good NPU:

Static Scheduling

NNs are statically analyzable: neural networks are deterministic – allowing everything to be laid out in memory ahead of time – and an ML processor design must take advantage of this. Where CPUs would typically have complex cache hierarchies, and are designed around optimizing non-deterministic memory accesses, an ML processor takes a deterministic NN and a command stream (designed carefully with the compiler) and uses simplified flow control with simplified hardware to maintain a relatively predictable performance. This approach nets a reduction in power and area.

Efficient Convolutions

While network architectures have evolved over the years, certain things have remained constant and are expected to remain the same over the next few years. Convolution is one such item and any ML processor must be optimized for it.

The Arm ML processor is optimized for CNN (convolutional neural networks) and RNN (recurrent neural networks) support, with a high degree of stability built into the architecture. It comes with a compiler that takes an NN and maps it to a command stream, which is then consumed by the ML processor – taking full advantage of the statically analyzable nature of NNs. It is also highly optimized for efficiently executing convolutions.

Bandwidth Reduction Mechanisms

Memory bandwidth is an important criterion in any ML system design: many NNs become memory bound/constrained, leaving any extra processing capability of the processor unused. To reduce memory bandwidth, the trips to DRAM must be reduced and computation can be made more efficient through optimizations like pruning, clustering and compression.

To tackle this challenge, the Arm ML processor has a number of mechanisms built into its processing capacity to reduce bandwidth. The system uses weight compression, activation compression, and tiling to manage DRAM power that can be nearly as high as the processor power. This can save an average of 3x with lossless compression. Likewise, onboard memory helps by reducing traffic to the external memory, and so reducing power.

Weight bandwidth can dominate later layers of networks, and pruning during the training phase increases the number of zeros. The processor clusters to “snap” the non-zero weights to a smaller set of possible non-zero values. Models are compressed offline during the compilation phase, which exploits both clustering and pruning. Weights stay compressed until read from the internal SRAM. With tiling, scheduling is tuned to keep working set in SRAM. Tiled or wide scheduling avoids trips to DRAM, while multiple outputs are calculated in parallel from the same input.

Program Flexibility

NNs are still in the early stages and it’s impossible to predict what operators or layers may need to be supported in the future. Yet any ML processor being deployed today must be able to support the operators of the future. The capacity to add operator support after the tape-out of an SoC is key to the success of any ML processor design.

Arm’s programmable layer engine is an important feature in this system: it helps futureproof the design and benefits from existing Arm technology with no hardware assumptions built in. It provides a way to run future operators on the processor after tape-out.

The Importance of Software

Hardware alone isn’t enough to deliver an effective solution for machine learning at the edge; software is also an important component. The right software allows application developers to write ML applications using their favorite NN frameworks — such as Google’s TensorFlow or Facebook’s Caffe2 — and target a variety of processor types.

Arm NN is open-source software designed to bridge the gap between existing neural network (NN) frameworks and the underlying IP. It provides a translation layer between existing NN frameworks – including TensorFlow, TFLite, PyTorch, ONNX, Caffe and Caffe2, Android NNAPI, and MXNet – giving developers the capability to maintain existing workflow and tools. Through Arm NN, SoC and NN vendors can continue to differentiate with their key competitive advantage, while leveraging the open-source minimizing duplication of effort.

Flexible Innovation for the Future

With an eye to driving future innovation, Arm donated Arm NN to Linaro’s Machine Intelligence Initiative, enabling the wider industry to benefit from the 100 man-years of effort and over 445,000 lines of code that have created this optimized, open framework for ML at the edge. With the support of key players in the ecosystem, Arm will continue to invest significantly in Arm NN and its supporting libraries, while allowing third-party IP developers to add their own support to the Arm NN framework.

As more and more ML moves to the edge, this kind of collaboration – and a standardized, open-source software approach – will become increasingly important. It’s estimated that Arm NN is already shipping in over 250 million Android devices; as the move to the edge extends to the very smallest microcontroller CPUs, the reach of Arm NN will grow to billions – and, eventually, trillions – of secure, connected devices.

Low-Power, High-Performance Machine Learning at The Edge

Selecting the right solution for each application entails a series of trade-offs: from cost- and power constraints to performance and programmability. ML is a heterogenous compute problem that requires diverse, ML-optimized hardware solutions and a common, open-source software framework to provide a spectrum of performance, power and cost points. Arm is unique in addressing all these challenges with a flexible, scalable platform that delivers the highest throughput and most efficient processing for ML inference at the edge.

To find out more, visit Arm’s ML Developer Community.