GTX 670 won't idle properly

imrazor

Distinguished
I don't know if anyone here can help me, but my GTX 670 is running at full speed while doing nothing more than displaying the desktop. So clock speed at idle is 1019MHz and temps are running at 54C, but GPU load is 0% according to GPU-Z.

My config is very unusual. I'm running the GTX 670 in a virtual machine with PCIe passthrough. Until recently I was running with an evaluation version of Windows 8.1 for testing and to get the config right. It worked nearly perfectly, including idling and a much lower idle temp of 38C. So I recently installed a retail copy of Win 8.1 Pro in the VM (using exactly the same settings and driver version), but my card is not idling properly.

Just to be clear the card works perfectly for playing games - 45 fps in the Witcher 3 with Medium-High settings at 1080p. It just won't calm down after the game is over.
 

nzalog

Respectable
BANNED
Jan 2, 2017
541
0
2,160
First off neat, which hypervisor? Assuming KVM since ESXi doesn't let you hide the VT-D bit to the OS?

Have you compared the behavior of the card in a non-virtualized environment?
 

imrazor

Distinguished
Well it appears that it was a driver issue. I noticed a driver update available in Geforce Experience, so I went ahead and installed it (378.66 if anyone is curious.) As soon as I did, the idle clock speed dropped to ~300MHz.

I'm running this under ESXi 6.0.0 Update 2. I tried to get Linux/KVM working but just couldn't get passthrough working, though I certainly tried. I would've actually preferred Linux because I'm much more familiar with it than I am with ESXi.
 

imrazor

Distinguished
Yes, when I first tried to set up GPU passthrough the Geforce I got an error 43 in Device Manager. Fortunately I had an old Quadro for testing, so I knew there wasn't anything wrong with my setup. I found that if you add this parameter to the .vmx config file for the VM:

hypervisor.cpuid.v0 = "FALSE"

that the fact that the GPU is running in a VM is hidden from the NVidia driver. Error 43 went away, and I was able to use the Nvidia driver.
 

imrazor

Distinguished
I hope it works for you. One peculiar side effect is that Task Manager shows 0% CPU usage for *everything*. Nothing else seems affected. Tools like MSI Afterburner can still report overall CPU usage. Performance impact seems very minimal, and I get near native FPS.
 

imrazor

Distinguished
I spoke too soon. Immediately after installing the driver, the clock speed went down to 300MHz. But after 30 minutes or so of use, it climbed back to 1GHz and won't go down. When playing a game, the boost kicks in and ramps up to 1.2GHz, and then drops to 1GHz when exiting the game. But it won't drop to 300MHz like it used to. Windows power management is set to balanced and I tried changing the power management mode in the Nvidia control panel to "Adaptive" and "Optimal", but neither made a difference.
 

imrazor

Distinguished
It's really wacky. After exiting Discord, the GPU ramped back up yet again. I rebooted the VM (not the whole ESXi host), and so far it seems to be behaving. I'll have to try firing up a few games, and see if the GPU clocks back down normally.

Let me know if you get PCIe passthrough working...
 

nzalog

Respectable
BANNED
Jan 2, 2017
541
0
2,160
Thanks much dude, I just tried an old quadro 600 card that I intentionally bought for this purpose a while ago (thinking it would work because it was a quadro but was wrong). That card didn't work (before) because it was not high end enough to support the feature. Had that same device manager error which basically meant it was disabled by the driver.

Now I added hypervisor.cpuid.v0 = "FALSE" and driver error gone...
https://dl.dropboxusercontent.com/u/7655543/IMG_0258.jpg

Feel like a bit of dunce though. I looked up the advanced VM parameter and it led me to a reddit page that I've read before. Problem is I misunderstood what they were saying. They were talking about using the GPU to accelerate 3D across multiple VM's and not just doing VT-D to one single VM. So while the card can't be used to accelerate many other VMs it can be passed through to one VM without issue.

Really appreciate you taking the time to share the info. =]

Now I'm really tempted to put my 1070 in here to see how much more the idle power draw would be. I could completely get rid of my gaming PC and have the VM take care of that instead.
 

nzalog

Respectable
BANNED
Jan 2, 2017
541
0
2,160
So my 1070 worked =] Not sure if I'm going to keep the setup though. Idle power draw went from ~55W to ~80W which is pretty significant for something that's to stay on 24/7. Part of the reason it went up that much is because I needed to add a USB card as well. Any chance you know how to pass-through a mouse and keyboard without passing through and entire USB card? I know this used to be possible in older versions of ESXi.

Unfortunately the before and after benchmarks would be too much work. The ESXi server I'm using didn't have a native windows install and my gaming rig is a different gen CPU so it would not really be a good comparison. Played overwatched and I can't tell anything is different from my i7-3770 system with 16GB ram.

esxi 6 host:
Xeon e3-1230 v5
64GB ECC
Supermicro Board

I'm running a 32GB - 1 vCPU - freenas vm for my iSCSI storage (on same host) LSI SAS Card is passed through to this VM so it has direct storage access.

My Gaming VM is a 12GB ram and 2 vCPU windows 10 VM.
https://dl.dropboxusercontent.com/u/7655543/vt-d/2017-02-23%2023_17_49-Program%20Manager.png
https://dl.dropboxusercontent.com/u/7655543/vt-d/Screen%20Shot%202017-02-23%20at%2010.41.19%20PM.png
https://dl.dropboxusercontent.com/u/7655543/vt-d/IMG_0260.jpg

 

imrazor

Distinguished
Thanks for sharing your experience. Yes, I have noticed my office get a bit warm when running ESXi and the 670 in passthrough mode. Unfortunately, I don't have a kill-a-watt meter to test my power draw. However, I suspect that it's still consuming much less power than running two separate PCs.

For my specs, check my signature. Currently, the only VMs I'm running 24/7 are a pfSense VM (providing dynamic DNS and an OpenVPN service, as well as connecting a couple of other VMs to my network), the Win8.1 gaming VM and a Debian VM for studying for some certs. Sadly, the Dell system I'm using doesn't have many drive bays. (From your photo, it looks like you've put your drive bays to good use.) It came with a Dell SAS card, but so far I've just opted to use onboard SATA.

EDIT: Went back over your post, and saw the question about USB. Yes, I also had to drop in a PCI USB card. I just couldn't get PCI passthrough to work with the onboard USB controllers, nor could I get USB passthrough to work with a keyboard and mouse. From my reading, it seems that VMware wants to make sure that no VM can take over a keyboard and mouse from the host OS (i.e., ESXi.)

EDIT2: My original gaming system is actually beefier than the old Dell I've got for this setup. However, the Dell has two advantages over my i5-3570K gaming rig. 1) I can upgrade to a six-core processor, or even two six-core processors and 2) it has VT-D. Intel decided to cripple the K-class Ivy Bridge CPUs by disabling VT-D. If I had a plain 3570, I could've just used that.
 

imrazor

Distinguished
Just wanted to report an update to this thread. I tested and confirmed that the 'hypervisor.cpuid.v0 = "FALSE"' parameter completely disabled cpu frequency scaling. So even if all your VMs are completely idle, your CPU will keep chugging away at FULL speed.

It's really disappointing that Nvidia decided to deliberately cripple their Geforce driver in this manner. Fortunately I have an old Quadro I can use until I can afford a Radeon, but it's a sad old thing that has trouble running classic Skyrim. At least I'll have a GUI management console for my ESXi box...
 

nzalog

Respectable
BANNED
Jan 2, 2017
541
0
2,160
I was kind of back and forth on if I should update you on my experience, figured you kind of went through this but I have invested a bit of time into this.

I bought a 1050 GTX and am now using my ESXi host as a HTPC. Previously I was only streaming from the ESXi box to other things like RaspberryPi's running Kodi. This solution was pretty elegant because I was able to get things like 4k playback and HDR which many low power devices could not do well and it seems to be pretty low power.

So it's been about a week now and I'm seeing absolutely no problems. The gtx 1050 is sipping power about 5W idle. I'm also not experiencing the high CPU usage (unless I'm not looking at the correct stats). First off the GPU had some initial weird issues with the idle not dropping but it seems to have gone away. I think messing with interrupt settings helped but since I made a lot of changes I can't pin point what made the idle better.

So... reason I started messing with IRQ (interrupt) related settings... Video playback from the VM was perfect, however HDMI audio would start OK then slowly degrade to the point where it would just completely stop. So I found a setting called MSI (Message-Signaled Interrupts) for IRQ modes, read here how to enable that: https://msdn.microsoft.com/en-us/library/windows/hardware/ff544246(v=vs.85).aspx
The hardest part was finding the correct device ID associated to the video card and sound. I've tried this on a couple systems and even enabling it on every device with no negative effects and it did fix the audio issue, would not be surprised if it contributed to fixing the idle. I highly suggest giving this a shot.

As far as the CPU usage, I'm looking at the VM stats (vcenter client) and while 100% cpu usage by the VM would be around 6000+ MHz, it's only hovering around 350Mhz while watching a bluray file so I'm pretty sure I'm not seeing the same problem as you...

Anyways I hope that this helps you... keep me updated.

TLDR ver; I'm not seeing the same issues as you, I was having weird audio issues that turned out to be related to system inturrepts. Enabling "Message-Signaled Interrupts" resolved the issues and possibly fixed my Idle issues on the GPU. I highly recommend you give it a shot: https://msdn.microsoft.com/en-us/library/windows/hardware/ff544246
 

imrazor

Distinguished
I didn't see the full bore frequency scaling until I looked at Windows Resource Monitor in the VM. While it showed minimal usage, the CPU frequency was pegged at 100%. I don't think VMware's MHz stat reflects actual clock speed, but rather an interpretation of CPU load. In my case even though the CPU load is very low, the chip seems to be running at full speed. I.e., Intel SpeedStep is not functioning when that "hypervisor.cpuid.v0 = "FALSE"" parameter is enabled.

Are you seeing the same weirdness with Task Manager that I reported above? Every process shows 0% usage, even when the system is operating at full load. The only way to get the true system load is to use Resource Monitor.

I'll certainly check into MSI, but I'm thinking this is an ESXi characteristic rather than anything solvable by tweaking Windows settings.
 

nzalog

Respectable
BANNED
Jan 2, 2017
541
0
2,160
So I can't check that out right now but I'm judging a lot of this based on total power usage. My system 6 x 3.5" HDD drives, 2 x SSDs, 64GB ECC, e3-1230v5, GTX 1050, LSI 9207-8i and 5 noctua fans are only pulling ~60 to 65W Idle. I see this number rise significantly when I really start taxing the system. I don't think I could get that low power usage without some kind of CPU throttling. Also I think if the issue you are describing was caused by the "hypervisor.cpuid.v0" setting I would have seen a significant increase in power consumption after making the passthrough VM but I'm only seeing about 4 to 5W (idle) which I attribute to the video card. I'm also thinking that ESXi does not manage CPU speed the same way that windows does, it's designed for a different purpose where speed is probably more important, however I'm just assuming and I'd have to do more research on that. I am pretty confident that what you're describing doesn't have anything to do with that setting though.
 

nzalog

Respectable
BANNED
Jan 2, 2017
541
0
2,160
One thing I should mention, not sure if you understand how ESXi schedules CPU cycles... and why it's important to right size VM cpu count. Let me see if I can simplifiy it.

In ESXi, unlike ram usage, CPU needs to allocate all the vCPUs you have assigned to a VM before the VM can do one Cycle and (even if the VM only needs one core at the moment and rest are idle). Also... a VMs vcpus must run on the same host CPU cycle , it cannot run out of sync...

So.. if you create 2 vms with 3 vcpu each on a 4 core host... only one vm will get the CPU per cycle and 1 core will remain idle on the host. So what I'm getting at is if you have a 4 core VM and a 4 Core host to esxi all 4 cores are used when even if only 1 core is used in the VM. It's a shitty limitation but also a huge reason to right-size VMs on a host.

Another time this is important is if you have a 4 core host and 2 vms with 3 vCPU's each. Each time a VM needs any CPU usage, all 3 core will need to be use AT THE SAME TIME, meaning the last core cannot be used by the other VM.

I hope this make sense...
 

nzalog

Respectable
BANNED
Jan 2, 2017
541
0
2,160
I am seeing the same weirdness with 0% but it doesn't seem to be affecting how power saving is working.

**Edit:
OK so my host is in fact using speed step.
If you get an ssh session to your host, run "esxtop" from there just press "p".

This should be the output, and from looking at this it hits P-state 15 (or 800 mhz) pretty frequently). Just make sure your P-State are enabled in your bios.


6:45:26am up 4 days 22:49, 510 worlds, 4 VMs, 4 vCPUs; CPU load average: 0.27, 0.28, 0.28
Power Usage: N/A , Power Cap: N/A
PSTATE MHZ: 3401 3400 3200 3000 2800 2700 2500 2300 2100 1900 1700 1500 1400 1200 1000 800
Unknown command `it `h' for help
CPU %USED %UTIL %C0 %C1 %C2 %C3 %P0 %P1 %P2 %P3 %P4 %P5 %P6 %P7 %P8 %P9 %P10 %P11 %P12 %P13 %P14 %P15
0 0.9 1.5 1 25 28 45 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 75
1 2.0 2.2 2 5 4 89 42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 58
2 1.3 2.7 3 22 27 47 86 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14
3 0.3 0.8 1 5 8 86 87 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13
4 1.6 100.0 100 0 0 0 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0.0 100.0 0 3 0 97 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0.9 2.0 2 21 26 50 83 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17
7 1.7 2.2 2 28 7 63 71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29
 

imrazor

Distinguished
You seem to be right. Just ran esxtop (nice utility, BTW) and looked at the P-states. Sample output:

10:53:22am up 1 day 19:34, 526 worlds, 2 VMs, 5 vCPUs; CPU load average: 0.09, 0.12, 0.11
Power Usage: N/A , Power Cap: N/A
PSTATE MHZ: 2395 2394 2261 2128 1995 1862 1729 1596

CPU %USED %UTIL %C0 %C1 %C2 %C3 %P0 %P1 %P2 %P3 %P4 %P5 %P6 %P7
0 1.7 2.5 4 74 8 14 85 0 0 0 0 0 0 15
1 2.3 2.8 3 50 5 41 63 0 0 0 0 0 0 37
2 6.0 6.9 7 47 6 39 95 0 0 0 0 0 0 5
3 0.1 0.1 0 0 0 100 56 0 0 0 0 0 0 44
4 3.3 3.6 2 11 3 84 72 0 0 0 0 0 0 28
5 2.8 3.6 3 39 3 55 61 0 0 0 0 0 0 39
6 1.3 1.8 2 8 2 88 32 0 0 0 0 0 0 68
7 8.2 10.1 10 72 6 12 100 0 0 0 0 0 0 0

Looks like I only have 7 P-states, the lowest of which is ~1600MHz. I'm not sure how to explain the 100% CPU frequency I'm seeing with Windows Resource Monitor when "hypervisor.cpuid.v0 = FALSE" is set. I'd still like to know why the desktop is pumping out as much heat as it is. Maybe because my lowest P-state only goes down to 1596MHz?
 

nzalog

Respectable
BANNED
Jan 2, 2017
541
0
2,160
I have a feeling that the formula used to get the CPU usage metric is just different between Resource Monitor and Task Manager. 0 of 0 could be look at as 100% or 0%. I think you're just seeing the same side effect of not being able to get metrics from the CPU but it doesn't translate to real CPU usage. I'd say it's pretty safe to rely on the vcenter metric when it comes to CPU usage because it's ESXi that controls the physical CPU speed and not a VM running on it.

As far as your p-states not going lower. It could just be a limitation of that specific CPU, not sure if intel would have the specs available on it's speed step range.
 

imrazor

Distinguished
I know 1600MHz was the lowest speed my old Q6600 could hit when idling, so maybe this Westmere CPU also can't clock any lower than that.

Re: sound issues. I've also had some issues with HDMI output. In my case, I just used an old USB sound card I had lying around. I don't think I've had any issues with it, and if I have they're more easily resolved with the USB sound card, usually just by switching audio outputs.
 

nzalog

Respectable
BANNED
Jan 2, 2017
541
0
2,160
Have you tried vga passthrough on an esxi 6.5 host?

I'm asking because I just swapped to an nvidia card on my work test machine but unlike at home the work machine is on ESXi 6.5. I can't seem to get the driver to work and getting code 43 with the nvidia card. Didn't have issues with an older radeon card. I confirmed there were no typos in the hypervisor.cpuid.v0 = "FALSE" setting.