Soft Machines' 'Virtual Cores' Promise 2-4x Performance/Watt Advantage Over Competing CPUs

Status
Not open for further replies.

Puiucs

Honorable
Jan 17, 2014
66
0
10,630
There is a reason why single threaded instructions aren't already using multiple cores already. While i think their design can work, i feel like they are omitting a lot of details about real performance and limitations.
What i'm more interested in is the dynamic balancing of the cores in the case of multiple threads, This is something that can really improve performance by a lot if done right.
 

Puiucs

Honorable
Jan 17, 2014
66
0
10,630
There is a reason why single threaded instructions aren't already using multiple cores already. While i think their design can work, i feel like they are omitting a lot of details about real performance and limitations.
What i'm more interested in is the dynamic balancing of the cores in the case of multiple threads, This is something that can really improve performance by a lot if done right.
 

Mahdi_B

Reputable
Feb 4, 2016
1
0
4,510
Ok but I don't seem to understand exactly what we're talking about, is this a x86 desktop processor or an ARM or perhaps a new IS called VISC? (in which case I don't really see how it will work considering existing legacy apps and operating systems)
 

Epsilon_0EVP

Honorable
Jun 27, 2012
1,350
1
11,960
Ok but I don't seem to understand exactly what we're talking about, is this a x86 desktop processor or an ARM or perhaps a new IS called VISC? (in which case I don't really see how it will work considering existing legacy apps and operating systems)

It's a new IS, yes. It uses a "custom 64-bit ISA," but the article also says they might use other architectures.

Backwards compatibility isn't an issue, since this is mostly designed for servers. Since most servers run Linux and open-source software, recompiling the necessary code isn't particularly difficult once a proper compiler is set up for the architecture. However, they will indeed take much longer to arrive on consumer PC's (if ever).
 

bourgeoisdude

Distinguished
Dec 15, 2005
1,240
25
19,320
I wonder if this would be helpful with console/system emulation as well. If this were licensed to Microsoft or Sony, for example, it could make it easier to emulate software from prior consoles.
 

ammaross

Distinguished
Jan 12, 2011
269
0
18,790
It will be interesting to see how they manage. The big thing preventing single-threaded apps from being well-threaded is sequential dependence that prevents Out-of-Order Execution (can't do C until A+B is calculated or done). However, many things in an app CAN be threaded, but usually lazy programming leads to heap access (shared memory) and management makes that difficult. They must feel their tech can out-perform branch prediction and such though else they'd likely not be trying to bring it to market.
 

bit_user

Polypheme
Ambassador
I think it's going to be quite workload dependent. For instance, DSP code that already has wide instruction-level parallelism and is tuned to make effective use of the native hardware might even be a bit slower, due to increased cache misses from their scheduler chopping up & migrating the threads around.

That said, I think it's clever to try to intelligently pair complementary threads on SMT cores. I think Intel, AMD, ARM, etc. should try to add some analysis capabilities to their hardware, to enable OS thread schedulers to do similar. Of course, I don't expect to see them chop threads into threadlets, but compilers could certainly do that.

In other words, they have some neat ideas, but all implementable without the need for a "Virtual Core" abstraction layer. In fact, the biggest benefit from the "Virtual Core" construct comes from the capability it gives them to do these things on existing CPUs. So, I was a bit surprised to see them building custom silicon. Perhaps they're going to do something radical, like a transport triggered architecture. Something really wide, simple, and highly-dependent on good profile-driven optimization.
 

Kewlx25

Distinguished


I haven't finished the video, but HT does this. The main issue with HT is how it splits the CPU core's resources. Older HT used to be more dynamic but many times caused one HT to starve the other. Later they made it do a 50/50 split, but that means a heavy thread doesn't get enough resources and a light thread gets too much. If there was a way for the OS to manage the resource split it some way, the OS could do more advanced scheduling.




Me feeling also. Dynamic balancing is just another way of saying thread scheduling. It's fairly expensive for an OS to context switch among many light threads. This gets important for hypervisors. Current VM tech requires that if you have a guest with 8 cores, then your host must have 8 cores free. If a guest has mostly idle cores, but one core is using a lot of CPU, then you have to consume as much CPU time as if all of the cores are also under load. This is incredibly inefficient. If instead the host could schedule some of those cores to some weaker virtual cores, then the guest would get the number of cores it expects, but without wasting all of the CPU time of the idle guest cores.

I still have no idea how they plan to allow a single thread to be computed over multiple cores. That's effectively what out-of-order instruction scheduling already does on a pipelined CPU. It is literally impossible to do this efficiently. Cross core communications will ALWAYS be more expensive than inter-core because moving data further is more expensive. No way around this in our Universe, unless you start messing with space-time.

OoO CPUs already trade efficiency for performance because it takes extra work to inspect instructions for dependencies and split instructions up to different execution units, then recombine them. All of that work takes more electricity and more transistors. Doing this at a core level would be horrible. It's a relatively small loss of efficiency for a potentially large gain in performance, but it's still a trade-off.
 

bit_user

Polypheme
Ambassador
Wow! What VM hypervisor are you using? I've never seen that, and I've heard lots about people running dozens and even hundreds of VMs on the same machine. I don't know how that would be possible if the host had to burn CPU time for the idle time of its guests.

That's actually a good analogy. And like OoO execution, at the instruction level, this requires data-flow analysis. However, the ability of platform to do that with compiled, binary code is much less than that of a compiler that can analyze the original, high-level source code. The only benefit of operating at the platform level is the ability to analyze the actual runtime behavior. For instance, the compiler often has no idea how many time a loop will iterate. Though, some compilers feature "profile-driven optimization" where profiling data from a test is used to inform the compiler in a second pass. I think most JavaScript engines do something like this, internally.

Right. For this feature to be a net-win, they need to estimate time savings from added parallelism and the overhead of the added communication. If it's not a net win, then don't parallelize that part. Modern compilers make lots of cost vs. benefit decisions, so this is nothing new.

Correct, again. VLIW processors try to do the instruction scheduling in software, to save the transistors needed to do it in hardware. I think Softmachines might be doing sophisticated runtime analysis & optimization in software, in order to use a VLIW or similar architecture. The benefit is that the software scheduling and analysis only needs to be rerun when the program's behavior changes, significantly. It can even cache the optimized version (or versions) on disk. Whereas, if you do it in hardware, then you basically burn power re-analyzing the same code again and again and again.
 

Kewlx25

Distinguished


Xen, VMWare, HyperV, and bhyve all do it this way. This is always why they recommend all guests have the same number of cores. Say you have an 8 core CPU. You can have 2 4 core guests running at the same time. But if a 5 core guest came along, that 5 core guest would have to wait for all 8 cores to become free, and while it was running, no other 4 core guests could run.

This is yet another reason Docker or Jails are getting popular.
 

bit_user

Polypheme
Ambassador
Thanks for clarifying. I've only run fairly light-weight tasks in low core-count VMs, using VirtualBox and Xen. So, I hadn't really hit that case.

I also thought you were saying that the host's cores are spinning during guests' idle cycles, but I now see what you're saying: it's more like a scheduling constraint. That must require pretty deep hooks into the OS' thread scheduler, because I don't know of a user-space way to require that more than one thread be running simultaneously.
 

Northtag

Reputable
Sep 10, 2014
5
0
4,510
Xen, VMWare, HyperV, and bhyve all do it this way. This is always why they recommend all guests have the same number of cores. Say you have an 8 core CPU. You can have 2 4 core guests running at the same time. But if a 5 core guest came along, that 5 core guest would have to wait for all 8 cores to become free, and while it was running, no other 4 core guests could run.

This is yet another reason Docker or Jails are getting popular.
Eh?

No, ESXi just advances the clock of virtual cores that are in the idle state that the guest OS puts them in, doesn't have to schedule them to run on a real core at all. Done that for quite a few releases now. A 5 core VM and a 4 core VM will run happily enough simultaneously on an 8 physical core CPU provided that they have at least one idle virtual core between them at all times.

VMware recommend against having more virtual cores assigned to a VM than you need because it's more overhead for the hypervisor to check their idle status and schedule them. If you have 4 virtual cores in a VM then if nothing else then at some point there'll be enough background processes running on the guest OS to demand all of them at the same time, and that becomes a pain to schedule on a busy system. If need be, ESXi will not even run all busy virtual cores at the same time, just that it won't let the virtual core clocks get too far out of whack with each other (which is something that adds to the hypervisor scheduling overhead).
 

Kewlx25

Distinguished


I don't work with VMs much, but in the past few months, someone was having VM performance issues and an official VMWare engineer chimed in with the issues that I described. The nutshell of it was if at all possible, have all guests use the same number of cores, if you can't, make sure they're multiples of each other, so don't use a 6 core guest with a 4 core guest, increase it to 8 cores. If you don't follow these recommendations, you can get guest starvation while having low CPU usage and negative performance gains by adding more cores. 4 cores may be faster than 6 or 8 cores.

And this was on the newest bestest flashiest VMWare and a beast of a server. When I googled about guest scheduling, I came across this exact same stuff for all big name VMs. Guest starvation is a common issue and entirely caused by core-count differences.

I am talking strictly about Type 1 bearmetal hypervisors, not software type 2.
 
Status
Not open for further replies.