Very strange random network cable unplugged problem! (NOT UNPLUGGED)

bail3yz

Reputable
Sep 30, 2014
3
0
4,520
I run an IPCOP router with 8 servers. (5 win2k8 and 4 win2k3)

Recently I added a new server (win2k8), and ever since then 3 of the win2k3 servers randomly go offline with network cable unplugged error. All 3 servers that this happen to have the same hardware. M5A99X EVO Motherboard, using onboard lan Realtek 8111E.

Originally I thought it was the router being overloaded or something, but every other server / comp is fine.

Once one of the affected servers goes into "network cable unplugged" mode, only way I can make it work is disable/enabling the LAN connection. Unplugging / replugging the cable does nothing. Plugging into a new network also does nothing. IPs on the servers are set statically and DHCP leases last a week. Problem occurs randomly, up to every few hours, so its not related to DHCP.

When it does occur, I can still ping 127.0.0.1. Both amber and green lights are on on the network card. Light is also on on the switch. During troubleshooting I replaced all cables and switches (just incase).

I updated all bios and drivers.
I have another server win2k8, with the same hardware that doesnt have this problem.

I am not sure how adding a new server to the network could possibly cause this problem. If relevant the new server hosts VMs (30), so it has 30 Macs and 30 IPs.

My only idea now is that maybe win2k3 + this network card do some network mapping and theres some sort of bug? My next plan is to upgrade them to win2k8 or add a PCI network card. But I am curious if anyone has experienced anything similar.

Lastly, I also tried to manually set speed/duplex, and disabled all power/green settings. No luck.

Any suggestions?
 
Solution
I believe I fixed it. Not entirely sure exactly what happened, but it turns out winsock was corrupted on those 3 servers.
Since those 3 servers had same OS + network card + driver, my guess is somehow when I added the new server, there was an IP conflict or something strange that corrupted winsock on those specific setups.

netsh winsock catalog reset

seems to have solved it, and hopefully bios/driver updates prevent it from occurring again in the future.
The only external things I can think of are heat and/or power. You've added to both. Assuming you have everything on UPS's maybe you are overloading one of them and/or heat from new server has pushed you into a problem area. But I don't know why that last would affect the other machines at the same time unless the extra heat is affecting the UPS. Try the new server on a different electrical outlet in another room maybe and see what happens.
 

bail3yz

Reputable
Sep 30, 2014
3
0
4,520


I run hwmonitor on the servers and heat is lower than normal. (its getting cold here now lol).

I have the servers on UPSs but the load for the 3 servers that are experiencing this problem didn't change, new server is in a different room + different UPS.


 

bail3yz

Reputable
Sep 30, 2014
3
0
4,520
I believe I fixed it. Not entirely sure exactly what happened, but it turns out winsock was corrupted on those 3 servers.
Since those 3 servers had same OS + network card + driver, my guess is somehow when I added the new server, there was an IP conflict or something strange that corrupted winsock on those specific setups.

netsh winsock catalog reset

seems to have solved it, and hopefully bios/driver updates prevent it from occurring again in the future.
 
Solution