Soft Machines' 'Virtual Cores' Promise 2-4x Performance/Watt Advantage Over Competing CPUs

Lucian Armasu · Feb 3, 2016

Soft Machines' Virtual Instruction Set Architecture uses the concept of "virtual cores" to better distribute threads between two or more physical cores. The company claimed this offers better performance per watt than competing CPU architectures.

Soft Machines' 'Virtual Cores' Promise 2-4x Performance/Watt Advantage Over Competing CPUs : Read more

dE_logics · Feb 3, 2016

Why will Intel buy this company?

jasonkaler · Feb 4, 2016

when I first heard of hyperthreading, this is exactly what I thought it would do.

Puiucs · Feb 4, 2016

There is a reason why single threaded instructions aren't already using multiple cores already. While i think their design can work, i feel like they are omitting a lot of details about real performance and limitations.
What i'm more interested in is the dynamic balancing of the cores in the case of multiple threads, This is something that can really improve performance by a lot if done right.

Puiucs · Feb 4, 2016

There is a reason why single threaded instructions aren't already using multiple cores already. While i think their design can work, i feel like they are omitting a lot of details about real performance and limitations.
What i'm more interested in is the dynamic balancing of the cores in the case of multiple threads, This is something that can really improve performance by a lot if done right.

dark_lord69 · Feb 4, 2016

I'll be interested to see real world benchmarks when this CPU comes to town.

CRITICALThinker · Feb 4, 2016

Somethink tells me this design would pair really well with the ideas behind bulldozer and its multithreadeing capabilities.

Mahdi_B · Feb 4, 2016

Ok but I don't seem to understand exactly what we're talking about, is this a x86 desktop processor or an ARM or perhaps a new IS called VISC? (in which case I don't really see how it will work considering existing legacy apps and operating systems)

Epsilon_0EVP · Feb 4, 2016

Ok but I don't seem to understand exactly what we're talking about, is this a x86 desktop processor or an ARM or perhaps a new IS called VISC? (in which case I don't really see how it will work considering existing legacy apps and operating systems)

It's a new IS, yes. It uses a "custom 64-bit ISA," but the article also says they might use other architectures.

Backwards compatibility isn't an issue, since this is mostly designed for servers. Since most servers run Linux and open-source software, recompiling the necessary code isn't particularly difficult once a proper compiler is set up for the architecture. However, they will indeed take much longer to arrive on consumer PC's (if ever).

bourgeoisdude · Feb 4, 2016

I wonder if this would be helpful with console/system emulation as well. If this were licensed to Microsoft or Sony, for example, it could make it easier to emulate software from prior consoles.

ammaross · Feb 4, 2016

It will be interesting to see how they manage. The big thing preventing single-threaded apps from being well-threaded is sequential dependence that prevents Out-of-Order Execution (can't do C until A+B is calculated or done). However, many things in an app CAN be threaded, but usually lazy programming leads to heap access (shared memory) and management makes that difficult. They must feel their tech can out-perform branch prediction and such though else they'd likely not be trying to bring it to market.

bit_user · Feb 4, 2016

I think it's going to be quite workload dependent. For instance, DSP code that already has wide instruction-level parallelism and is tuned to make effective use of the native hardware might even be a bit slower, due to increased cache misses from their scheduler chopping up & migrating the threads around.

That said, I think it's clever to try to intelligently pair complementary threads on SMT cores. I think Intel, AMD, ARM, etc. should try to add some analysis capabilities to their hardware, to enable OS thread schedulers to do similar. Of course, I don't expect to see them chop threads into threadlets, but compilers could certainly do that.

In other words, they have some neat ideas, but all implementable without the need for a "Virtual Core" abstraction layer. In fact, the biggest benefit from the "Virtual Core" construct comes from the capability it gives them to do these things on existing CPUs. So, I was a bit surprised to see them building custom silicon. Perhaps they're going to do something radical, like a transport triggered architecture. Something really wide, simple, and highly-dependent on good profile-driven optimization.

Kewlx25 · Feb 5, 2016

jasonkaler :

I haven't finished the video, but HT does this. The main issue with HT is how it splits the CPU core's resources. Older HT used to be more dynamic but many times caused one HT to starve the other. Later they made it do a 50/50 split, but that means a heavy thread doesn't get enough resources and a light thread gets too much. If there was a way for the OS to manage the resource split it some way, the OS could do more advanced scheduling.

Puiucs :

Me feeling also. Dynamic balancing is just another way of saying thread scheduling. It's fairly expensive for an OS to context switch among many light threads. This gets important for hypervisors. Current VM tech requires that if you have a guest with 8 cores, then your host must have 8 cores free. If a guest has mostly idle cores, but one core is using a lot of CPU, then you have to consume as much CPU time as if all of the cores are also under load. This is incredibly inefficient. If instead the host could schedule some of those cores to some weaker virtual cores, then the guest would get the number of cores it expects, but without wasting all of the CPU time of the idle guest cores.

I still have no idea how they plan to allow a single thread to be computed over multiple cores. That's effectively what out-of-order instruction scheduling already does on a pipelined CPU. It is literally impossible to do this efficiently. Cross core communications will ALWAYS be more expensive than inter-core because moving data further is more expensive. No way around this in our Universe, unless you start messing with space-time.

OoO CPUs already trade efficiency for performance because it takes extra work to inspect instructions for dependencies and split instructions up to different execution units, then recombine them. All of that work takes more electricity and more transistors. Doing this at a core level would be horrible. It's a relatively small loss of efficiency for a potentially large gain in performance, but it's still a trade-off.

bit_user · Feb 5, 2016

Kewlx25 :

Wow! What VM hypervisor are you using? I've never seen that, and I've heard lots about people running dozens and even hundreds of VMs on the same machine. I don't know how that would be possible if the host had to burn CPU time for the idle time of its guests.

Kewlx25 :

That's actually a good analogy. And like OoO execution, at the instruction level, this requires data-flow analysis. However, the ability of platform to do that with compiled, binary code is much less than that of a compiler that can analyze the original, high-level source code. The only benefit of operating at the platform level is the ability to analyze the actual runtime behavior. For instance, the compiler often has no idea how many time a loop will iterate. Though, some compilers feature "profile-driven optimization" where profiling data from a test is used to inform the compiler in a second pass. I think most JavaScript engines do something like this, internally.

Kewlx25 :

Right. For this feature to be a net-win, they need to estimate time savings from added parallelism and the overhead of the added communication. If it's not a net win, then don't parallelize that part. Modern compilers make lots of cost vs. benefit decisions, so this is nothing new.

Kewlx25 :

Correct, again. VLIW processors try to do the instruction scheduling in software, to save the transistors needed to do it in hardware. I think Softmachines might be doing sophisticated runtime analysis & optimization in software, in order to use a VLIW or similar architecture. The benefit is that the software scheduling and analysis only needs to be rerun when the program's behavior changes, significantly. It can even cache the optimized version (or versions) on disk. Whereas, if you do it in hardware, then you basically burn power re-analyzing the same code again and again and again.

Kewlx25 · Feb 5, 2016

bit_user :

Xen, VMWare, HyperV, and bhyve all do it this way. This is always why they recommend all guests have the same number of cores. Say you have an 8 core CPU. You can have 2 4 core guests running at the same time. But if a 5 core guest came along, that 5 core guest would have to wait for all 8 cores to become free, and while it was running, no other 4 core guests could run.

This is yet another reason Docker or Jails are getting popular.

bit_user · Feb 5, 2016

Kewlx25 :

Thanks for clarifying. I've only run fairly light-weight tasks in low core-count VMs, using VirtualBox and Xen. So, I hadn't really hit that case.

I also thought you were saying that the host's cores are spinning during guests' idle cycles, but I now see what you're saying: it's more like a scheduling constraint. That must require pretty deep hooks into the OS' thread scheduler, because I don't know of a user-space way to require that more than one thread be running simultaneously.

Northtag · Feb 6, 2016

Xen, VMWare, HyperV, and bhyve all do it this way. This is always why they recommend all guests have the same number of cores. Say you have an 8 core CPU. You can have 2 4 core guests running at the same time. But if a 5 core guest came along, that 5 core guest would have to wait for all 8 cores to become free, and while it was running, no other 4 core guests could run.

This is yet another reason Docker or Jails are getting popular.

Eh?

No, ESXi just advances the clock of virtual cores that are in the idle state that the guest OS puts them in, doesn't have to schedule them to run on a real core at all. Done that for quite a few releases now. A 5 core VM and a 4 core VM will run happily enough simultaneously on an 8 physical core CPU provided that they have at least one idle virtual core between them at all times.

VMware recommend against having more virtual cores assigned to a VM than you need because it's more overhead for the hypervisor to check their idle status and schedule them. If you have 4 virtual cores in a VM then if nothing else then at some point there'll be enough background processes running on the guest OS to demand all of them at the same time, and that becomes a pain to schedule on a busy system. If need be, ESXi will not even run all busy virtual cores at the same time, just that it won't let the virtual core clocks get too far out of whack with each other (which is something that adds to the hypervisor scheduling overhead).

Kewlx25 · Feb 6, 2016

Northtag :

I don't work with VMs much, but in the past few months, someone was having VM performance issues and an official VMWare engineer chimed in with the issues that I described. The nutshell of it was if at all possible, have all guests use the same number of cores, if you can't, make sure they're multiples of each other, so don't use a 6 core guest with a 4 core guest, increase it to 8 cores. If you don't follow these recommendations, you can get guest starvation while having low CPU usage and negative performance gains by adding more cores. 4 cores may be faster than 6 or 8 cores.

And this was on the newest bestest flashiest VMWare and a beast of a server. When I googled about guest scheduling, I came across this exact same stuff for all big name VMs. Guest starvation is a common issue and entirely caused by core-count differences.

I am talking strictly about Type 1 bearmetal hypervisors, not software type 2.

Soft Machines' 'Virtual Cores' Promise 2-4x Performance/Watt Advantage Over Competing CPUs

Contributing Writer

Reputable

Distinguished

Honorable

Honorable

Splendid

Distinguished

Reputable

Honorable

Distinguished

Distinguished

Polypheme

Distinguished

Polypheme

Distinguished

Polypheme

Reputable

Distinguished

Share this page