Intel's Knights Corner: 50+ Core 22nm Co-processor
1 Teraflop. 1 Chip. Many integrated cores.
Today on our desktop computers, we have CPUs with core counts that we can count through with our fingers. Intel, however, has just presented at SC'11 its the first silicon of the "Knights Corner" co-processor that is capable of delivering more than 1 TFLOPs of double precision floating point performance.
Such power of Intel's MIC (many integrated core) architecture won't be used to play Crysis, but rather it'll be put towards highly parallel applications, such as weather modelling, tomography, proteins folding and advanced materials simulation.

"Intel first demonstrated a Teraflop supercomputer utilizing 9,680 Intel Pentium Pro Processors in 1997 as part of Sandia Lab's 'ASCI RED' system," said Rajeeb Hazra, general manager of Technical Computing, Intel Datacenter and Connected Systems Group. "Having this performance now in a single chip based on Intel MIC architecture is a milestone that will once again be etched into HPC history."
Knights Corner, the first commercial Intel MIC architecture product, will be manufactured using Intel’s latest 3-D Tri-Gate 22nm transistor process and will feature more than 50 cores. Furthermore, Intel promises compatibility with existing x86 programming model and tools.

Hazra boasted that the Knights Corner co-processor is unlike traditional accelerators in that "it is fully accessible and programmable like fully functional HPC compute node, visible to applications as though it was a computer that runs its own Linux-based operating system independent of the host OS."
Intel says that its MIC architecture benefits from the ability to run existing applications without the need to port the code to a new programming environment. Intel believe that this will allow scientists to use both CPU and co-processor performance simultaneously with existing x86 based applications without needing to rewrite them to alternative proprietary languages.

It's a Co-Processor, an accelerator, not a main CPU they are not the same thing.
More or less, this took the design elements from the scrapped Larrabee project.
It's a Co-Processor, an accelerator, not a main CPU they are not the same thing.
Not an absolute flop (it does provide a good price/performance ratio) but not good either. There's just no getting around the inherent flaws in the current revision of the Bulldozer architecture, even in the highly parallel workloads found in the server/workstation market:
http://www.anandtech.com/show/5058/amds-opteron-interlagos-6200
And Knights Corner isn't serving the same market as Interlagos, so they're not really directly comparable.
You know what this is?
I think this is Intel's answer to ARM's server bids.
Think about it.
50+ cores at 1.2 GHz? That sounds a lot like what ARM will be promising in the near future.
Except that everyone who wants to go the low-power route needs to re-write their programs for the ARM instruction set. With this they don't have to. The tools for Xeon optimization are also the same.
So you can have a powerful 4/6/8/10-core Xeon processor (that you probably already own) but bolting this on, combined with Intel's advancements in power consumption (Sandy Bridge is already very good on idle battery life in notebooks), should make a changeover to ARM technology a hard sell.
GPUs like the 6970 have around 2500 vector cores. Like FPUs in the OP, they can't do the full spectrum of x86 instructions and can only do a specialized subset for one task.
Likewise, we have growing numbers of do everything cores on a die.
One important abstraction is that "cores" are just an FPU, SPU, TLB, etc, all on a die. A 4 core chip is basically 4 processors on one piece of silicon with one bus. A GPU is 2500 VPUs with shared memory, shared FPUs, and a shared bus and output.
The end game is that we have processor chips with specialized parts doing different, specialized tasks, all on one die. Like how Sandy Bridge had integrated graphics, that is just fancy abstraction for throwing a bunch of VPUs on the die that the CPU cores can access with their own bit of l3.
In a decade, expect processor chips to have much more cache, and a collection of VPUs / SPUs / etc on top of some registers and TLBs representing the limits of parallelism.
You merge the cores, and get processors of, say, 256 cores, where 64 of them are general purpose TLBs / Register sets and 192 are mixed FPU / VPUs doing hard computations for the general cores. If you add some floats, that work would be sent to an FPU to do, if you had a munch of float math in parallel, the process would have each operation delegated to an FPU.
Thats in mainstream computing, I think. Server markets are going towards specialized subset instruction set hardware that can't do normal computing tasks, but doesn't need to and actually shouldn't to save power. Every instruction you throw on the cpu pile means more transistors dedicated to operation decoding you could be using for more FPUs and such.
This is a co-processor. The 6990 does 1.37 TFLOPs Double Precision and 5.40 TFLOPs Single Precision. The 7990 and above should be much higher.
doen't sound like GPU?
More or less, this took the design elements from the scrapped Larrabee project.
why wait 10 years...take a look at the mid line of graphic cards these days, they have at least 200+ cores....intel and amd are a little behind