Build Crashes Under Heavy Load

Hello All,

Not really sure if this is the correct category to place this topic in but since I'm running a *nix OS thought I'd try here first. Even though it doesn't look like the issue is *nix related.
I'm dealing with a really weird issue on a build that I finished assembling a couple of months ago.
Under heavy load my system becomes completely un-responsive and the only thing I can do is just a hard reset and bring it back up. I've ran memtest to check for any issues with RAM and it comes back with a pass, not seeing any issues there.
Syslog on this machine doesn't show anything that stands out, as a matter of fact, the time frame the system is unresponsive there's a huge gap of no syslog either locally or from anywhere on my network.
Here's what I'm running in terms of hardware:

i3770k processor
32GB RAM
120GB Samsung SSD
Asus P877-V Lk motherboard
No Graphics Card (Nvidia or AMD), just whatever's on the board

From an OS perspective, i'm running Debian Wheezy.

I run one Virtualbox VM which causes the heavy load, between .89 and 1.20 load average. When monitoring via Cacti, each core was sitting between 20-30%.
Just wondering if anyone's come across this sort of an issue and possibly point me in the right direction to troubleshoot/fix, etc. would gladly accept any feedback

thank you in advance.
11 answers Last reply
More about build crashes heavy load
  1. I have had cases where a server fails under heavy load, It took me a week to debug but was cause by a SATA port becoming unresponsive under stress. It was caused by a intel chipset bug that was compounded by firmware issues of the SSD I was using at the time.

    the sata port would become deaf, this prevented logs and debug information from being written for some time before the system would actually crash because it could not page in something from the hard drive. I limited the System ram so I could reproduce the issue faster until I isolated the problem. I think micrsoft windows now resets the port when they become deaf (over 30 seconds) this still leaves chipset issues and firmware issues to deal with.
  2. Have you checked CPU temperature? If too high it may be throttling back.
  3. thank you for responding.
    Temperature does not rise above 40C under load, at least that's what Cacti's measuring with lm-sensors.

    Johnbl, what was the fix to your issue , new firmware , new mb ? or move to another SATA port ?
  4. in my case this is what I think occurred on my computer:
    1. SATA firmware was calling clean up routines and they took too long
    2. windows detected this as a failure and reset the SATA port
    3. the Drive did not auto reconnect to the port.
    4. eventually the OS had to page in something it required and bugchecked.

    a short term workaround was fixing the issue of not reconnecting. That was a setting in BIOS to enable hotswapping for the port the drive was on.

    After this the system did not bugcheck but would have delays of 1 to 2 mins after about 8 hours.
  5. cool, thank you for the feedback. I'm going to try the BIOS hot swap method and see if that fixes anything.
  6. What exactly do you mean unresponsive? Could it just be X crashing? Or is the system hard locked?
  7. no Xwindows running on this install. The system just hard locks, the display sleeps, so no access to CLI either via a keyboard connected or remotely via SSH. the upstream switch shows no arp entry or mac address on the associated switchport.
  8. Ok well in this case you will want to be looking close at your [kernel] logs.

    add this lines to syslog.conf

    *.* /var/log/all_logs
    kern.*   /var/log/kernel_logs


    This will output all messages to '/var/log/kernel_logs' and '/var/log/all_logs' so hopefully when you reboot you can see what is happening up until it died.

    *also* does REISUB still work? might be better if you can gracefully reboot instead of hard reset...
  9. i have the following besides others in my rsyslog.conf

    *.*;auth,authpriv.none -/var/log/syslog
    kern.* -/var/log/kern.log


    I've pasted an output from my syslog.conf before during and after the event. Before the event, i'm not noticing anything suspicious, during the event there's nothing posted to syslog, and then when i hard reset, there's the regular bootup sequence:


    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:43744->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:41705->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:49680->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:59175->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:38675->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:45439->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:39959->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:44478->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:57972->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:43843->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:43801->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:41895->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:55053->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:52307->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:39207->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:54043->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:56768->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:33081->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:51886->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:44823->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:53332->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:56814->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:58543->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:54367->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:36440->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:38542->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:44020->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:41299->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:33946->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:35397->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:38619->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:42402->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:57409->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:55289->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:57717->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:46942->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:60828->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:55208->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:59140->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:55798->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:34648->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:59934->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:60999->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:53434->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:51296->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:55647->[127.0.0.1]:161
    Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:58020->[127.0.0.1]:161
    Oct 29 15:55:01 br01.rl0.171.10.10.in-addr.arpa /usr/sbin/cron[48613]: (root) CMD (/usr/local/pkg/vnstat2.sh)
    Oct 29 15:55:01 br01.rl0.171.10.10.in-addr.arpa /usr/sbin/cron[48659]: (root) CMD (/etc/ping_hosts.sh)
    Oct 29 15:55:01 br01.rl0.171.10.10.in-addr.arpa cron[48613]: NSSWITCH(_nsdispatch): nis, passwd_compat, endpwent, not found, and no fallback provided
    Oct 29 15:55:01 br01.rl0.171.10.10.in-addr.arpa cron[48659]: NSSWITCH(_nsdispatch): nis, passwd_compat, endpwent, not found, and no fallback provided
    Oct 29 15:55:01 br01.rl0.171.10.10.in-addr.arpa cron[48677]: NSSWITCH(_nsdispatch): nis, passwd_compat, endpwent, not found, and no fallback provided
    Oct 29 15:55:10 br01.rl0.171.10.10.in-addr.arpa sockstat: NSSWITCH(_nsdispatch): nis, passwd_compat, setpwent, not found, and no fallback provided
    Oct 29 15:55:11 br01.rl0.171.10.10.in-addr.arpa sockstat: NSSWITCH(_nsdispatch): nis, passwd_compat, setpwent, not found, and no fallback provided
    Oct 29 15:55:20 br01.rl0.171.10.10.in-addr.arpa dhcpd: DHCPINFORM from 10.10.181.101 via rl2
    Oct 29 15:55:20 br01.rl0.171.10.10.in-addr.arpa dhcpd: DHCPACK to 10.10.181.101 (3c:97:0e:9b:26:45) via rl2
    Oct 29 15:55:21 br01.rl0.171.10.10.in-addr.arpa pf: 00:00:24.305530 rule 1/0(match): block in on pppoe0: (tos 0x0, ttl 60, id 0, offset 0, flags [DF],
    proto TCP (6), length 60)
    Oct 29 15:55:21 br01.rl0.171.10.10.in-addr.arpa pf: 204.0.87.128.80 > 67.0.176.16.17797: Flags , cksum 0x05ce (correct), seq 3117505549, ack 3
    122374063, win 14480, options [mss 1452,sackOK,TS val 2352636702 ecr 623452941,nop,wscale 1], length 0
    Oct 29 15:55:47 br01.rl0.171.10.10.in-addr.arpa pf: 00:00:26.624699 rule 1/0(match): block in on rl0: (hlim 255, next-header ICMPv6 (58) payload lengt
    h: 56) fe80::a60:6eff:fe81:4f26 > ff02::1: [icmp6 sum ok] ICMP6, router advertisement, length 56
    Oct 29 15:55:47 br01.rl0.171.10.10.in-addr.arpa pf: #011hop limit 64, Flags [none], pref medium, router lifetime 1800s, reachable time 0s, retrans tim
    e 0s
    Oct 29 15:55:47 br01.rl0.171.10.10.in-addr.arpa pf: #011 prefix info option (3), length 32 (4): 2001:470:4b:18::/64, Flags [onlink, auto, router], va
    lid time 86400s, pref. time 14400s
    Oct 29 15:55:47 br01.rl0.171.10.10.in-addr.arpa pf: #011 source link-address option (1), length 8 (1): 08:60:6e:81:4f:26
    Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 00:00:06.476743 rule 1/0(match): block in on pppoe0: (tos 0x0, ttl 57, id 0, offset 0, flags [DF],
    proto TCP (6), length 60)
    Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 209.8.115.111.80 > 67.0.176.16.31360: Flags , cksum 0xba35 (correct), seq 2684273102, ack
    3218244999, win 14480, options [mss 1452,sackOK,TS val 1045165278 ecr 623485793,nop,wscale 4], length 0
    Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 00:00:00.197511 rule 1/0(match): block in on pppoe0: (tos 0x0, ttl 60, id 0, offset 0, flags [DF],
    proto TCP (6), length 60)
    Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 204.2.222.128.80 > 67.0.176.16.13308: Flags , cksum 0x95fd (correct), seq 4025671382, ack
    1785815943, win 14480, options [mss 1452,sackOK,TS val 387853062 ecr 623485824,nop,wscale 1], length 0
    Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 00:00:00.151995 rule 1/0(match): block in on pppoe0: (tos 0x0, ttl 60, id 0, offset 0, flags [DF],
    proto TCP (6), length 60)
    Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 204.0.3.33.80 > 67.0.176.16.64524: Flags , cksum 0xf74a (correct), seq 248986274, ack 4271
    48573, win 14480, options [mss 1452,sackOK,TS val 2993009505 ecr 623486615,nop,wscale 1], length 0
    Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 00:00:00.463198 rule 1/0(match): block in on pppoe0: (tos 0x0, ttl 57, id 0, offset 0, flags [DF],
    proto TCP (6), length 60)
    Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 209.8.115.102.80 > 67.0.176.16.44467: Flags , cksum 0x143b (correct), seq 2251575889, ack
    3427976252, win 14480, options [mss 1452,sackOK,TS val 1045145027 ecr 623485791,nop,wscale 4], length 0
    Oct 29 23:59:26 lab-box rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="2490" x-info="http://www.rsyslog.com"] start
    Oct 29 23:59:26 lab-box rsyslogd-3003: ID for user 'syslog' could not be found or error [try http://www.rsyslog.com/e/3003 ]
    Oct 29 23:59:26 lab-box rsyslogd-3003: ID for user 'syslog' could not be found or error [try http://www.rsyslog.com/e/3003 ]
    Oct 29 23:59:26 lab-box rsyslogd-3003: ID for group 'syslog' could not be found [try http://www.rsyslog.com/e/3003 ]
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] Initializing cgroup subsys cpuset
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] Initializing cgroup subsys cpu
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] Linux version 3.2.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1
    SMP Debian 3.2.46-1+deb7u1
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.2.0-4-amd64 root=UUID=49f3e1e1-9f64-4a9b-a9e3-0cf36636cfc1 ro
    quiet
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-provided physical RAM map:
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009d800 (usable)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 000000000009d800 - 00000000000a0000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000000100000 - 0000000020000000 (usable)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000020000000 - 0000000020200000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000020200000 - 0000000040004000 (usable)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000040004000 - 0000000040005000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000040005000 - 00000000cc3bf000 (usable)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cc3bf000 - 00000000cc999000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cc999000 - 00000000cc9a9000 (ACPI data)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cc9a9000 - 00000000ccad2000 (ACPI NVS)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000ccad2000 - 00000000cd7f5000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cd7f5000 - 00000000cd7f6000 (usable)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cd7f6000 - 00000000cd839000 (ACPI NVS)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cd839000 - 00000000cdc6d000 (usable)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cdc6d000 - 00000000cdff4000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cdff4000 - 00000000ce000000 (usable)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cf000000 - 00000000df200000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000f8000000 - 00000000fc000000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000fed00000 - 00000000fed04000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000fed1c000 - 00000000fed20000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000100000000 - 000000081fe00000 (usable)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] NX (Execute Disable) protection: active
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] SMBIOS 2.7 present.
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] DMI: System manufacturer System Product Name/P8Z77-V LK, BIOS 1104 08/23/2013
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] No AGP bridge found
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] last_pfn = 0x81fe00 max_arch_pfn = 0x400000000
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] MTRR default type: uncachable
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] MTRR fixed ranges enabled:
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] 00000-9FFFF write-back
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] A0000-BFFFF uncachable
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] C0000-CFFFF write-protect
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] D0000-E7FFF uncachable
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] E8000-FFFFF write-protect
    Oct 29 23:59:26 lab-box kernel: [ 0.000000] MTRR variable ranges enabled:
  10. you might try to boot with kernel option 'noapic nolapic apci=off' ; see if that helps at all, APIC can occasionally cause hard locks.

    this certainly sounds more like hardware error causing lock since no helpful info written to logs...
  11. hi johnbl,

    i've enabled hotplugging for my ssd under BIOS and looks like that's resolved the issue. I've been running at 15+days of uptime with several vm's running in the background and no crash.

    thanks for all your help.
Ask a new question

Read More

Debian X64 Build