Dev Boots Linux 292,612 Times to Find Intel, AMD Kernel Bug

Stock image of a computer bug
(Image credit: Shutterstock)

Red Hat Linux developer Richard WM Jones has shared an eyebrow raising tale of Linux bug hunting. Jones noticed that Linux 6.4 has a bug which means it will hang on boot about 1 in 1,000 times. Jones set out to pinpoint the bug, and prove he had caught it red handed. However, his headlining travail, involving booting Linux 292,612 times (and another 1,000 times to confirm the bug) apparently "only took 21 hours." It also seems that the bug is less common with Intel hardware than AMD based machines.

Jones caught first whiff of this elusive but replicable Linux booting bug when some server software tests with nbdkit (a protocol for accessing block devices over a network) seemed to be "randomly hanging" when used with libguestsfs (a tool for accessing and modifying virtual machine disk images.) Though we know the looping testing phase was a measly 21 hours long (even though there were an astronomical 293,612 boot processes initiated) Jones says that getting to this point "took many days." The Linux developer recounts that a painful bisection between Linux v6.0 and v6.4-rc6 helped him narrow down the boot hang culprit. That culprit is claimed to be a regression in the printk time feature. Reverting this code commit "fixes the problem," asserts Jones.

A clue to the cause was that the bug always appeared at the same early stage of the booting process, when booting the latest qemu. If you follow this link you will see the easiest way to replicate the hanging issue is to run a guestfish command in a loop, with many instances in parallel, parsing the output to detect when there was a boot hanging event. Usually he ran the guestfish loop 10,000 times, as a workable threshold to gather useful log data.

Perhaps of some interested to hardware fans, Jones remarks that this weird boot hang issue occurs less often on Intel systems than AMD systems. Whatever the case, hopefully the exposure and pinpointing of this bug means that will be squashed, never to return.

Mark Tyson
News Editor

Mark Tyson is a news editor at Tom's Hardware. He enjoys covering the full breadth of PC tech; from business and semiconductor design to products approaching the edge of reason.

  • ezst036
    Thank you!
    Reply
  • BigBig5
    This seems to happen more often with Linux 6.4 on my newly built AMD desktop.
    Reply
  • bit_user
    Works out to 3.88 boots/sec. Obviously, he must've been using a collection of VMs. I wonder how many and on what hardware.
    Reply
  • Royalgipsy
    So can we get a update? If there is a certain range of hardware. I don't k know but this felt like giving us the problem and not telling what kind of hardware is likely to replicate the issue..
    Reply
  • bit_user
    Royalgipsy said:
    So can we get a update? If there is a certain range of hardware. I don't k know but this felt like giving us the problem and not telling what kind of hardware is likely to replicate the issue..
    As usual, the article has a source link, near the top.
    https://rwmj.wordpress.com/2023/06/14/i-booted-linux-292612-times/
    I had already clicked on it to find out more details, but that's just a short blog post with more links and some comments. I didn't bother to go any further, but I'm sure the answers are there to be found.

    I thought this might be covered better on Phoronix, so I just checked. Saw no mention of it, but this related work sounds promising:
    https://www.phoronix.com/news/Linux-x86-Boot-Process-Mess
    Reply
  • Steve Nord_
    Royalgipsy said:
    So can we get a update? If there is a certain range of hardware. I don't k know but this felt like giving us the problem and not telling what kind of hardware is likely to replicate the issue..
    Since the link is from the RedHat blog, I'd expect it on mainline linux and any supported RedHat release. Last linked in top linked page to kernel org, so there's your log.

    In other news, it's a miracle they didn't go into a martial trance demonstrating it for the kernel WG at the 293,000th BEEEP or whatever.
    Reply