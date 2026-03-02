Rebellions, a designer of AI inference accelerators from South Korea, recently detailed its multi-chiplet Rebel 100 AI accelerator that relies on Unified Chiplet Interconnect Express (UCIe) technology at the International Solid-State Semiconductor Conference (ISSCC). The processor is one of the industry's first multi-chiplet designs to rely on UCIe-A interconnects to stitch four chiplets together.

Multi-chiplet designs are the future of high-performance AI and HPC accelerators, now that demand for performance by far outpaces the ability of foundries to scale their process technologies. Large developers of CPUs and GPUs like AMD, Intel, and Nvidia have recognized the value of multi-chiplet designs and their latest products fully embrace the methodology.

The industry-standard approach to multi-chiplet interconnections — the UCIe interface —is meant to enable high bandwidth and low latency interconnection between chiplets. However, so far, the standard has been subject to slow adoption, which makes the ISSCC 2026 paper from Rebellions even more valuable.

The Rebel 100: A quad-chiplet, 2 FP8 PFLOPS accelerator at glance

The Rebellions Rebel100 is a four-chiplet AI accelerator built for large language model inference that adopts a multi-chiplet design in to maximize die yield and performance, to ultimately offer the right balance between price and throughput.

The Rebel 100 system-in-package (SiP) comprises of four 320mm2 neural processing unit (NPU) dies, each of which is equipped with a 12Hi HBM3E 36 GB memory stack (for 144 MB of HBM3E per package) and interconnected using a mesh topology with one another. The NPU dies are made using Samsung's performance-enhanced SF4X process technology and packaged using Samsung's I-CubeS (CoWoS-S–class) advanced packaging method using an interposer. For power integrity reasons, the SiP also features four integrated silicon capacitor (ISC) dies that also serve for mechanical purposes.

The chiplets are interconnected using a UCIe-Advanced die-to-die interface running at 16Gbps and providing an aggregated bandwidth of 4 TB/s. The interconnect achieves roughly 11ns Flit-Aware Die-to-Die (FDI) to FDI latency, which extends memory load–store semantics transparently across chiplets to enable the SiP to behave as a single processor, rather than a cluster of discrete dies.

On the system side, Rebel100 connects to hosts via two PCIe 5.x x16 interfaces that support SR-IOV and peer-to-peer operation.

One Rebel 100 SiP can deliver 2 FP8 PFLOPS or 1 FP16 PFLOPS of performance without sparsity at 600W, which is in line with what Nvidia's H200 can deliver at 700W. Rebellions also claims that the unit can achieve 56.8TPS on LLaMA v3.3 70B with single-batch 2k/2k input/output sequences, though these are the numbers from the vendor itself, not from an independent tester. Furthermore, the focus of the story is to reveal how one of the first multi-chiplet UCIe-based AI accelerators works.

The company positions its Rebel 100 quad-chiplet package as a foundational unit for cross-node and rack-level systems capable of supporting trillion-parameter models and million-token contexts. So, while it is unclear whether Rebellions plans to build bigger SiPs using existing chiplets. But, it certainly envisions its partners building scale-up and scale-out clusters containing from dozens to tens of thousands of such AI accelerators.

Rebel 100 NPU and data movement

Each chiplet integrates two Neural Core Clusters, each packing eight neural cores and 32 MB of shared memory. According to the ISSCC paper, the shared memory is partitioned into 16 slices and features an aggregate bandwidth of 64 TB/s, and the chiplet contains 64 routers thatform an 8×4 granular mesh topology with three logically separate channels: Data (D), Request (R), and Control (C). In addition, each SiP contains 256 MB of scratchpad memory (at 128 TB/s).

The on-chip 2D network-on-chip (NoC) uses a straightforward XY routing scheme, so packets first travel along one axis and then the other, with turn restrictions applied to avoid deadlocks. Arbitration inside routers is handled using a weighted round-robin mechanism, so traffic from different sources gets serviced fairly, but with adjustable priority. The quality-of-service weights can be modified at runtime to make the system favor certain traffic types depending on whether the workload is compute-heavy or memory-intensive.

The 2D NoC mesh inside each chiplet logically expands over UCIe, so the full quad-chiplet system-in-package behaves like one large mesh-connected processor on the logical level. Keeping in mind the low chiplet-to-chiplet latency (or rather FDI-to-FDI latency), this greatly simplifies life for software developers. Interestingly, while all chiplets feature three UCIe-A interfaces for versatility (or maybe redundancy?), the full configuration scales to 256 routers across the entire mesh, so it remains to be seen whether Rebellions can build accelerators with more than four chiplets using the existing architecture.

Although the UCIe 1.0 specifications include mappings for the CXL.io, CXL.mem, and CXL.cache protocols on top of a PCIe 6.0 interconnection, those are optional protocol mappings, not mandatory requirements. The spec also supports vendor-defined streaming and memory-semantics protocols, which is exactly what Rebellions did with the Rebel 100.

Rebellions built a fairly aggressive data-movement engine to keep its quad-chiplet design fed. Each NPU die integrates a configurable DMA subsystem with eight execution engines that can pull data from local HBM3E, remote HBM3E located on another chiplet, or from distributed shared memory. Bandwidth per DMA can reach up to 2.6 TB/s, which is arguably enough for an inference-focused accelerator. Meanwhile, to prevent certain tasks from starving others, the company implemented task-level QoS controls designed to reduce long-tail latency and avoid congestion when different workloads are running simultaneously.

Coordinating work across four chiplets requires careful synchronization. But instead of relying on a dedicated scheduler, Rebellions implemented synchronization managers in each NPU instead. Each chiplet integrates a dedicated hardware synchronization manager with hardwired control logic that can coordinate activity across dies, either under centralized control or in a more autonomous manner. The architecture specifically avoids direct peer-to-peer communications between units and inter-unit dependencies to cut down unnecessary traffic and coordination overhead and keep overall utilization high during different execution phases of LLM inference.

To improve the reliability of its die-to-die interface, in addition to standard UCIe functionality, Rebellions implemented multiple loopback modes, transaction-level tracking, and channel-level diagnostics, which are generally intended to simplify validation and fault isolation in a multi-die package during debugging. For commercial deployments, Rebellions added a configurable switching mode that uses the aforementioned features to sacrifice a small amount of performance in exchange for improved MTBF and MTTF characteristics to maximize uptime, which is important for large AI clusters where uptime matters more than marginal throughput gains.

An unorthodox approach to power delivery

The Rebel 100 accelerator is rated for a thermal design power 600W TDP, but instantaneous transient surges — when multiple neural cores switch on — exceed the nominal level by two times. As currents rise quickly and sharply, they create voltage dips, which poses significant challenges for power integrity of the quad-chiplet AI accelerator.

To mitigate this, Rebellions implemented a hardware staggering technique that offsets start times of neural cores instead of activating them simultaneously, which smooths current ramps and reduces supply noise. Measurements show that synchronized switching produces steep current spikes and noticeable voltage disturbance, whereas staggered activation results in gentler transitions and a more stable power rail, according to Rebellions. Additional control logic dynamically limits instruction issue rate over short time windows to further reduce sudden load changes both within a chiplet and across dies.