Imagination shifts AI strategy and abandons NPUs — company secures $100M in financing

APXM-6200 (Image credit: Imagination Technologies)

Imagination Technologies, the company nowadays most known for its GPUs used by Apple’s iPhones and iPads, has just made a giant leap in its AI strategy: it got rid of its dedicated NPUs and added similar functionality to its GPUs, according to EE Times. Coincidentally, it just received $100 million in funding from Fortress Investment Group.

Imagination has been around for about 40 years and has done many things. Still, it became known for its PowerVR Kyro-branded graphics processors for PCs and its PowerVR-badged GPU IP, which was used by Apple and Intel (and dozens more companies) for its GPUs. Imagination is also a player in the AI field. It has ‘rebooted’ its approach to AI by discontinuing its standalone neural network accelerators and concentrating on embedding AI capabilities into its GPU IP products.

Imagination admitted in the EE Times interview that the strategic shift was made over the past 18 months due to complexities with a proprietary software stack that could not keep pace with the rapid evolution of AI customers’ diverse needs, which required custom solutions.

Given the dominance of Nvidia’s CUDA, the challenges extended beyond hardware development; creating an AI software stack that could expose the capabilities of ImgTec’s NPUs to developers and customers proved to be a significant hurdle. Customers often presented their unique AI models and expected Imagination to optimize them for the company’s hardware. This led to intense competition and pressure on Imagination’s development team, which stressed the team without any financial output.

Software is key.

Recognizing these difficulties, Imagination shifted its focus to GPUs, which are naturally multithreaded and are naturally suited for tasks requiring efficient parallel processing and data movement. With their flexibility, the company believes that GPUs can be enhanced with additional AI-specific compute capabilities, making them well-suited for edge AI applications, especially in devices with existing GPUs, such as smartphones, the home turf for ImgTec.

To support this transition, Imagination is repurposing technology developed for its now-discontinued accelerators, such as its graph compiler, and integrating it into its GPU stack. This allows the company to leverage its strengths while competitively addressing AI workloads.

For example, Imagination collaborated with the UXL Foundation on SYCL, a framework designed to rival Nvidia’s CUDA. This shift also aligns with the needs of Imagination’s customers, who already use on-chip GPUs for AI and graphics processing.

Perhaps that openness has enabled Imagination to secure a $100 million convertible term loan from Fortress Investment Group affiliates.

While Imagination is committed to this GPU-focused approach, the company remains open to revisiting dedicated AI accelerators in the future, depending on how AI software infrastructure evolves, according to EE Times. For now, the company believes that compute-oriented GPUs offer the best platform to meet their customers’ AI needs.

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

3 Comments Comment from the forums

bit_user

I always preferred to see phone SoC's double down on iGPUs as a way of increasing AI performance than to integrate special-purpose NPUs that can't be used for anything else.

I think the main disadvantage of GPUs might be that their data movement primitives are mostly DRAM-oriented and not tile-to-tile, like how a neural network prefers to operate. This loses data locality and ends up making GPUs less efficient.
Reply
AkroZ

bit_user said:
I always preferred to see phone SoC's double down on iGPUs as a way of increasing AI performance than to integrate special-purpose NPUs that can't be used for anything else.

I think the main disadvantage of GPUs might be that their data movement primitives are mostly DRAM-oriented and not tile-to-tile, like how a neural network prefers to operate. This loses data locality and ends up making GPUs less efficient.
From my knowledge there is no "tile-to-tile" instruction on NPU, to make that possible you will have to interconnect every tile with each other.
The architecture of a NPU is the same than a GPGPU (just watered down), the load/store instructions work on L2 or DDR (or sometimes TCM like Qualcomm).
Compared to GPU, NPU currently have smaller operations sets (more density) and bigger L2 cache.
It's why NPU are contested, there selling point is that it's more efficient for AI but it's not like it is used 24h/24 and for softwares it's a nightmare to have code that can run on all Processing Units for compatibility.
So only the builders make some functionalities use it, the developpers are not interested, they use the CPU or GPU if they need a neural network.
Reply
bit_user

AkroZ said:
From my knowledge there is no "tile-to-tile" instruction on NPU, to make that possible you will have to interconnect every tile with each other.
I'm sure they have DMA engines. Data access tends to follow more regular and predictable patterns than in either CPUs or GPUs. So, they're generally not going to be using regular processor instructions for main memory access .

In that case, the DMA engine is just for communication between the unified scratchpad SRAM and main memory.

As for data organization, NPUs tend to be organized in tiles, with each tile usually having its own direct-mapped memory (i.e. not a cache). Here's the Xilinx AI Engine that formed the basis of the XDNA NPU in AMD's Phoenix :

You can see the data paths between the tiles and you can see the so-called memory tiles, but this view shown those "DM" blocks are are 64 kiB of local memory that's exclusive to each tile. The article text says each "Memory Tile" has 512 kiB of SRAM .

What I find interesting about that is how they apparently have two levels of DMA engines! Each AI Engine has one, and then the global memory tiles also each seem to have one.

AkroZ said:
The architecture of a NPU is the same than a GPGPU (just watered down),
Sure, if you squint, they have a lot of similarities. That's why AI works pretty well on GPUs, in the first place. But, instead of looking to see how many similarities they have and then concluding they're the same, it's more profitable to look at where & why they differ. That is, if you'd like to gain any sort of insight into why so many companies independently arrived at the conclusion that AI is best served by a bespoke architecture.

AkroZ said:
the load/store instructions work on L2 or DDR (or sometimes TCM like Qualcomm).
TCM is somewhat akin to the local SRAM in the Xilinx/XDNA blocks, and it's clearly drained/filled to/from DRAM using DMA, not individual load/store instructions :

AkroZ said:
Compared to GPU, NPU currently have smaller operations sets (more density) and bigger L2 cache.
I think you mistakenly assume NPUs' local RAM is cache. It's not. Wherever you see a DMA engine, that's there to service a direct-mapped memory. The difference might seem subtle, but it's not - either from a power perspective or programmability.

By contrast, GPUs have local scratchpad memories, but not fine-grained DMA engines, as far as I've seen. They use ordinary load/store instructions, but then have a very deep SMT implementation to hide the long DRAM read latencies. SMT is cheap, but not free. You can see one of its costs in the massive size of GPUs' register files.

AkroZ said:
It's why NPU are contested, there selling point is that it's more efficient for AI but it's not like it is used 24h/24
Intel developed an entire, dedicated block for "ambient AI", called GNA.
"Intel® GNA is not intended to replace typical inference devices such as the CPU, graphics processing unit (GPU), or vision processing unit (VPU). It is designed for offloading continuous inference workloads including but not limited to noise reduction or speech recognition to save power and free CPU resources."
I guess they feel those use cases can adequately be addressed by their NPU, because they've ceased developing it beyond 3.0. However, Qualcomm also added a dedicated, always-on AI engine they call the "Sensing Hub" :
"we added a dedicated always-on, low-power AI processor and we’re seeing a mind-blowing 5x AI performance improvement.

The extra AI processing power on the Qualcomm Sensing Hub allows us to offload up to 80 percent of the workload that usually goes to the Hexagon processor, so that we can save even more power. All the processing on the Qualcomm Sensing Hub is at less than 1 milliamps (mA) of power consumption."

AkroZ said:
for softwares it's a nightmare to have code that can run on all Processing Units for compatibility.
So only the builders make some functionalities use it, the developpers are not interested, they use the CPU or GPU if they need a neural network.
Android has a simplified API for using NPUs, that's akin to popular deep learning frameworks. It's the NPU equivalent of what APIs like OpenGL and Vulkan are for GPUs.

As mentioned above, Qualcomm flexibly migrates AI jobs between its sensing hub and Hexagon DSP, which they can only do because they both have the same API. They or others also enable flexible AI workload sharing/migration involving the GPU block.

References:
https://www.anandtech.com/show/21425/intel-lunar-lake-architecture-deep-dive-lion-cove-xe2-and-npu4/4 https://chipsandcheese.com/2023/09/16/hot-chips-2023-amds-phoenix-soc/ https://chipsandcheese.com/2023/10/04/qualcomms-hexagon-dsp-and-now-npu/hexagon_npu_overview/ https://www.qualcomm.com/news/onq/2020/12/exploring-ai-capabilities-qualcomm-snapdragon-888-mobile-platform
Reply