Ex-Twitter dev reminisces about finding 700 unused Nvidia GPUs after takeover — forgotten cluster was 'powered on and idle'
Nothing to do.
An engineer who worked at Twitter during the seismic Agrawal-Musk transition has been publicly reminiscing about finding a cluster of 700 Nvidia V100 GPUs. Tim Zaman, who now works as a software engineer at Google DeepMind, discovered this significant chunk of GPU power to be powered up but idle in the data center of X’s chirpy ancestor.
A few weeks post Twitter-acquisition in 2022 we found 700 V100 gpus (pcie, lol) in the datacenter.They were powered on and idle, and had been for ages: the forgotten remains of a honest attempt to make a cluster within Twitter 1.0.Haha how times have changed! 100k gpus on the… https://t.co/zSChG0BvVZJuly 22, 2024
The warm, humming mass of Nvidia silicon and PCBs in the Twitter data center was poetically described as “the forgotten remains of an honest attempt to make a cluster within Twitter 1.0” by Zaman in a Twitter/X post on Monday. The engineer had been spurred to write about his surprise discovery of this silicon treasure trove after reading about xAI’s Memphis Supercluster getting to work training Grok 3, powered by 100,000 liquid-cooled Nvidia H100 accelerators on a single RDMA fabric.
Zaman underlined what many of you will be thinking: Twitter had 700 of the world's most powerful GPUs humming along without purpose for years. “How times have changed!” he exclaimed. Indeed, the first Nvidia Volta architecture V100 GPUs for data centers started to arrive in the market during the first great GPU shortage of 2017, and Zaman found the 700x V100 card-powered cluster running without purpose in mid-2022. That’s a lot of computing time and resources wasted.
Another moment of mirth for Zaman was discovering that the 700 Nvidia V100s were PCIe GPUs rather than the far higher bandwidth NVLink interfaced SXM2 form factor variety. Of course, we don’t know why the 2017-era Twitter bought PCIe instead of SXM2 bus V100 GPUs for this sizable installation, and perhaps we will never know.
Zaman’s Tweet also contained some interesting musings about Musk’s new ‘Gigafactory of Compute.’ Running “100k GPUs on the same fabric must be an epic challenge,” commented the engineer. “At that scale, the only guarantee is failure, and it's all about graceful failure management.” With this in mind, Zaman pondered over disaggregating resources into distinct domains so that failures wouldn’t bring the whole house down.
The engineer also found the potential maximum number of GPUs that could exist on a single fabric fascinating. With tech titans racing to build bigger and bigger AI training clusters, both predictable and unforeseen limits on the maximum number of GPUs on the same fiber are bound to become known.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Mark Tyson is a news editor at Tom's Hardware. He enjoys covering the full breadth of PC tech; from business and semiconductor design to products approaching the edge of reason.
-
elforeign How much do they consume at idle? 20w? 5w? either way, 700 of these consuming that much per hour 24/7 is a sizeable amount of power draw. Do they not look at their power bill?Reply -
coromonadalix lol they honestly don't give a damnReply
a waste of energy and $$$ sitting duck pffff and now their big baby will be obsolete in 1 year loll -
vanadiel007 Maybe, just maybe, they were not sitting idle all the time. Maybe they were used for crypto mining and were idling when they found them.Reply
I highly doubt a single person would be able to purchases 700 high end GPU's and associated equipment without anybody knowing anything about them... -
CmdrShepard
Let's say you had 8 cards per server -- that's like 88 servers which are at least 2U high if not more. That's like at minimum 4 full sized racks of equipment. With max 300W per card that should be like 3,000W PSU per server (2,400 + some slack). Multiply by 88 and that's potential peak load of 264,000 W. Even at idle, each of those servers probably pulled at least 100W (8 GPUs + CPU) which would give 8kW per hour 24/7/365.AkroZ said:As for the electrical bill, this a side project in a datacenter, the power draw is just a fraction of the total and it's not like you have an expected total.
Keep in mind that this is just a back of the napkin math and a rough estimate, but in no way that's a fraction becuase regular servers without GPUs don't pull anywhere near that peak power and pull even less at idle. -
wunshot I bet it was a crypto mining rig running right under their noses by some crafty developer.Reply -
KyaraM
I would guess that among all their servers and other stuff running, even that much didn't really parse or could have been attributed to other sources. Even if we say they pull 20W each idle, that's "only" 1400W. Imagine that vs several 100 server CPUs running at 300W+ under load and ask that question again. It likely wasn't more than a blip on the radar, if even that much.elforeign said:How much do they consume at idle? 20w? 5w? either way, 700 of these consuming that much per hour 24/7 is a sizeable amount of power draw. Do they not look at their power bill? -
bill001g The headlines said twitter cut 80% of the staff when musk took over. Could have easily been they cut the whole team including any manager that knew about it.Reply
Likely there is lots of abandon projects that currently employees are afraid to shut off because they don't know what they do.