Build Crashes Under Heavy Load

brokenflea

Honorable
Oct 30, 2013
6
0
10,510
Hello All,

Not really sure if this is the correct category to place this topic in but since I'm running a *nix OS thought I'd try here first. Even though it doesn't look like the issue is *nix related.
I'm dealing with a really weird issue on a build that I finished assembling a couple of months ago.
Under heavy load my system becomes completely un-responsive and the only thing I can do is just a hard reset and bring it back up. I've ran memtest to check for any issues with RAM and it comes back with a pass, not seeing any issues there.
Syslog on this machine doesn't show anything that stands out, as a matter of fact, the time frame the system is unresponsive there's a huge gap of no syslog either locally or from anywhere on my network.
Here's what I'm running in terms of hardware:

i3770k processor
32GB RAM
120GB Samsung SSD
Asus P877-V Lk motherboard
No Graphics Card (Nvidia or AMD), just whatever's on the board

From an OS perspective, i'm running Debian Wheezy.

I run one Virtualbox VM which causes the heavy load, between .89 and 1.20 load average. When monitoring via Cacti, each core was sitting between 20-30%.
Just wondering if anyone's come across this sort of an issue and possibly point me in the right direction to troubleshoot/fix, etc. would gladly accept any feedback

thank you in advance.

 
I have had cases where a server fails under heavy load, It took me a week to debug but was cause by a SATA port becoming unresponsive under stress. It was caused by a intel chipset bug that was compounded by firmware issues of the SSD I was using at the time.

the sata port would become deaf, this prevented logs and debug information from being written for some time before the system would actually crash because it could not page in something from the hard drive. I limited the System ram so I could reproduce the issue faster until I isolated the problem. I think micrsoft windows now resets the port when they become deaf (over 30 seconds) this still leaves chipset issues and firmware issues to deal with.
 

brokenflea

Honorable
Oct 30, 2013
6
0
10,510
thank you for responding.
Temperature does not rise above 40C under load, at least that's what Cacti's measuring with lm-sensors.

Johnbl, what was the fix to your issue , new firmware , new mb ? or move to another SATA port ?
 
in my case this is what I think occurred on my computer:
1. SATA firmware was calling clean up routines and they took too long
2. windows detected this as a failure and reset the SATA port
3. the Drive did not auto reconnect to the port.
4. eventually the OS had to page in something it required and bugchecked.

a short term workaround was fixing the issue of not reconnecting. That was a setting in BIOS to enable hotswapping for the port the drive was on.

After this the system did not bugcheck but would have delays of 1 to 2 mins after about 8 hours.
 

brokenflea

Honorable
Oct 30, 2013
6
0
10,510
no Xwindows running on this install. The system just hard locks, the display sleeps, so no access to CLI either via a keyboard connected or remotely via SSH. the upstream switch shows no arp entry or mac address on the associated switchport.
 
Ok well in this case you will want to be looking close at your [kernel] logs.

add this lines to syslog.conf

Code:
*.* /var/log/all_logs
kern.*   /var/log/kernel_logs

This will output all messages to '/var/log/kernel_logs' and '/var/log/all_logs' so hopefully when you reboot you can see what is happening up until it died.

*also* does REISUB still work? might be better if you can gracefully reboot instead of hard reset...
 

brokenflea

Honorable
Oct 30, 2013
6
0
10,510
i have the following besides others in my rsyslog.conf

*.*;auth,authpriv.none -/var/log/syslog
kern.* -/var/log/kern.log


I've pasted an output from my syslog.conf before during and after the event. Before the event, i'm not noticing anything suspicious, during the event there's nothing posted to syslog, and then when i hard reset, there's the regular bootup sequence:

Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:43744->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:41705->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:49680->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:59175->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:38675->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:45439->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:39959->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:44478->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:57972->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:43843->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:43801->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:41895->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:55053->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:52307->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:39207->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:54043->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:56768->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:33081->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:51886->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:44823->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:53332->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:56814->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:58543->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:54367->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:36440->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:38542->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:44020->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:41299->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:33946->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:35397->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:38619->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:42402->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:57409->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:55289->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:57717->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:46942->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:60828->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:55208->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:59140->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:55798->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:34648->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:59934->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:60999->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:53434->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:51296->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:55647->[127.0.0.1]:161
Oct 29 15:55:01 lab-box snmpd[2748]: Connection from UDP: [127.0.0.1]:58020->[127.0.0.1]:161
Oct 29 15:55:01 br01.rl0.171.10.10.in-addr.arpa /usr/sbin/cron[48613]: (root) CMD (/usr/local/pkg/vnstat2.sh)
Oct 29 15:55:01 br01.rl0.171.10.10.in-addr.arpa /usr/sbin/cron[48659]: (root) CMD (/etc/ping_hosts.sh)
Oct 29 15:55:01 br01.rl0.171.10.10.in-addr.arpa cron[48613]: NSSWITCH(_nsdispatch): nis, passwd_compat, endpwent, not found, and no fallback provided
Oct 29 15:55:01 br01.rl0.171.10.10.in-addr.arpa cron[48659]: NSSWITCH(_nsdispatch): nis, passwd_compat, endpwent, not found, and no fallback provided
Oct 29 15:55:01 br01.rl0.171.10.10.in-addr.arpa cron[48677]: NSSWITCH(_nsdispatch): nis, passwd_compat, endpwent, not found, and no fallback provided
Oct 29 15:55:10 br01.rl0.171.10.10.in-addr.arpa sockstat: NSSWITCH(_nsdispatch): nis, passwd_compat, setpwent, not found, and no fallback provided
Oct 29 15:55:11 br01.rl0.171.10.10.in-addr.arpa sockstat: NSSWITCH(_nsdispatch): nis, passwd_compat, setpwent, not found, and no fallback provided
Oct 29 15:55:20 br01.rl0.171.10.10.in-addr.arpa dhcpd: DHCPINFORM from 10.10.181.101 via rl2
Oct 29 15:55:20 br01.rl0.171.10.10.in-addr.arpa dhcpd: DHCPACK to 10.10.181.101 (3c:97:0e:9b:26:45) via rl2
Oct 29 15:55:21 br01.rl0.171.10.10.in-addr.arpa pf: 00:00:24.305530 rule 1/0(match): block in on pppoe0: (tos 0x0, ttl 60, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
Oct 29 15:55:21 br01.rl0.171.10.10.in-addr.arpa pf: 204.0.87.128.80 > 67.0.176.16.17797: Flags [S.], cksum 0x05ce (correct), seq 3117505549, ack 3
122374063, win 14480, options [mss 1452,sackOK,TS val 2352636702 ecr 623452941,nop,wscale 1], length 0
Oct 29 15:55:47 br01.rl0.171.10.10.in-addr.arpa pf: 00:00:26.624699 rule 1/0(match): block in on rl0: (hlim 255, next-header ICMPv6 (58) payload lengt
h: 56) fe80::a60:6eff:fe81:4f26 > ff02::1: [icmp6 sum ok] ICMP6, router advertisement, length 56
Oct 29 15:55:47 br01.rl0.171.10.10.in-addr.arpa pf: #011hop limit 64, Flags [none], pref medium, router lifetime 1800s, reachable time 0s, retrans tim
e 0s
Oct 29 15:55:47 br01.rl0.171.10.10.in-addr.arpa pf: #011 prefix info option (3), length 32 (4): 2001:470:4b:18::/64, Flags [onlink, auto, router], va
lid time 86400s, pref. time 14400s
Oct 29 15:55:47 br01.rl0.171.10.10.in-addr.arpa pf: #011 source link-address option (1), length 8 (1): 08:60:6e:81:4f:26
Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 00:00:06.476743 rule 1/0(match): block in on pppoe0: (tos 0x0, ttl 57, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 209.8.115.111.80 > 67.0.176.16.31360: Flags [S.], cksum 0xba35 (correct), seq 2684273102, ack
3218244999, win 14480, options [mss 1452,sackOK,TS val 1045165278 ecr 623485793,nop,wscale 4], length 0
Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 00:00:00.197511 rule 1/0(match): block in on pppoe0: (tos 0x0, ttl 60, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 204.2.222.128.80 > 67.0.176.16.13308: Flags [S.], cksum 0x95fd (correct), seq 4025671382, ack
1785815943, win 14480, options [mss 1452,sackOK,TS val 387853062 ecr 623485824,nop,wscale 1], length 0
Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 00:00:00.151995 rule 1/0(match): block in on pppoe0: (tos 0x0, ttl 60, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 204.0.3.33.80 > 67.0.176.16.64524: Flags [S.], cksum 0xf74a (correct), seq 248986274, ack 4271
48573, win 14480, options [mss 1452,sackOK,TS val 2993009505 ecr 623486615,nop,wscale 1], length 0
Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 00:00:00.463198 rule 1/0(match): block in on pppoe0: (tos 0x0, ttl 57, id 0, offset 0, flags [DF],
proto TCP (6), length 60)
Oct 29 15:55:54 br01.rl0.171.10.10.in-addr.arpa pf: 209.8.115.102.80 > 67.0.176.16.44467: Flags [S.], cksum 0x143b (correct), seq 2251575889, ack
3427976252, win 14480, options [mss 1452,sackOK,TS val 1045145027 ecr 623485791,nop,wscale 4], length 0
Oct 29 23:59:26 lab-box rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="2490" x-info="http://www.rsyslog.com"] start
Oct 29 23:59:26 lab-box rsyslogd-3003: ID for user 'syslog' could not be found or error [try http://www.rsyslog.com/e/3003 ]
Oct 29 23:59:26 lab-box rsyslogd-3003: ID for user 'syslog' could not be found or error [try http://www.rsyslog.com/e/3003 ]
Oct 29 23:59:26 lab-box rsyslogd-3003: ID for group 'syslog' could not be found [try http://www.rsyslog.com/e/3003 ]
Oct 29 23:59:26 lab-box kernel: [ 0.000000] Initializing cgroup subsys cpuset
Oct 29 23:59:26 lab-box kernel: [ 0.000000] Initializing cgroup subsys cpu
Oct 29 23:59:26 lab-box kernel: [ 0.000000] Linux version 3.2.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1
SMP Debian 3.2.46-1+deb7u1
Oct 29 23:59:26 lab-box kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.2.0-4-amd64 root=UUID=49f3e1e1-9f64-4a9b-a9e3-0cf36636cfc1 ro
quiet
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-provided physical RAM map:
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009d800 (usable)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 000000000009d800 - 00000000000a0000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000000100000 - 0000000020000000 (usable)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000020000000 - 0000000020200000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000020200000 - 0000000040004000 (usable)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000040004000 - 0000000040005000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000040005000 - 00000000cc3bf000 (usable)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cc3bf000 - 00000000cc999000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cc999000 - 00000000cc9a9000 (ACPI data)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cc9a9000 - 00000000ccad2000 (ACPI NVS)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000ccad2000 - 00000000cd7f5000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cd7f5000 - 00000000cd7f6000 (usable)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cd7f6000 - 00000000cd839000 (ACPI NVS)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cd839000 - 00000000cdc6d000 (usable)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cdc6d000 - 00000000cdff4000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cdff4000 - 00000000ce000000 (usable)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000cf000000 - 00000000df200000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000f8000000 - 00000000fc000000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000fed00000 - 00000000fed04000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000fed1c000 - 00000000fed20000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] BIOS-e820: 0000000100000000 - 000000081fe00000 (usable)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] NX (Execute Disable) protection: active
Oct 29 23:59:26 lab-box kernel: [ 0.000000] SMBIOS 2.7 present.
Oct 29 23:59:26 lab-box kernel: [ 0.000000] DMI: System manufacturer System Product Name/P8Z77-V LK, BIOS 1104 08/23/2013
Oct 29 23:59:26 lab-box kernel: [ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
Oct 29 23:59:26 lab-box kernel: [ 0.000000] No AGP bridge found
Oct 29 23:59:26 lab-box kernel: [ 0.000000] last_pfn = 0x81fe00 max_arch_pfn = 0x400000000
Oct 29 23:59:26 lab-box kernel: [ 0.000000] MTRR default type: uncachable
Oct 29 23:59:26 lab-box kernel: [ 0.000000] MTRR fixed ranges enabled:
Oct 29 23:59:26 lab-box kernel: [ 0.000000] 00000-9FFFF write-back
Oct 29 23:59:26 lab-box kernel: [ 0.000000] A0000-BFFFF uncachable
Oct 29 23:59:26 lab-box kernel: [ 0.000000] C0000-CFFFF write-protect
Oct 29 23:59:26 lab-box kernel: [ 0.000000] D0000-E7FFF uncachable
Oct 29 23:59:26 lab-box kernel: [ 0.000000] E8000-FFFFF write-protect
Oct 29 23:59:26 lab-box kernel: [ 0.000000] MTRR variable ranges enabled:
 
you might try to boot with kernel option 'noapic nolapic apci=off' ; see if that helps at all, APIC can occasionally cause hard locks.

this certainly sounds more like hardware error causing lock since no helpful info written to logs...
 

brokenflea

Honorable
Oct 30, 2013
6
0
10,510
hi johnbl,

i've enabled hotplugging for my ssd under BIOS and looks like that's resolved the issue. I've been running at 15+days of uptime with several vm's running in the background and no crash.

thanks for all your help.