Massively parallel hardware designed to run generic (non-graphic) code, with appropriate drivers for doing so.
A programming language based on C for programming said hardware, and an assembly language that other programming languages can use as a target.
A software development kit that includes libraries, various debugging, profiling and compiling tools, and bindings that let CPU-side programming languages invoke GPU-side code.
The point of CUDA is to write code that can run on compatible massively parallel SIMD architectures: this includes several GPU types as well as non-GPU hardware such as nVidia Tesla. Massively parallel hardware can run a significantly larger number of operations per second than the CPU, at a fairly similar financial cost, yielding performance improvements of 50× or more in situations that allow it.
One of the benefits of CUDA over the earlier methods is that a general-purpose language is available, instead of having to use pixel and vertex shaders to emulate general-purpose computers. That language is based on C with a few additional keywords and concepts, which makes it fairly easy for non-GPU programmers to pick up.
It's also a sign that nVidia is willing to support general-purpose parallelization on their hardware: it now sounds less like "hacking around with the GPU" and more like "using a vendor-supported technology", and that makes its adoption easier in presence of non-technical stakeholders.
To start using CUDA, download the SDK, read the manual (seriously, it's not that complicated if you already know C) and buy CUDA-compatible hardware (you can use the emulator at first, but performance being the ultimate point of this, it's better if you can actually try your code out)
"If I have 128CUDA@700Mzh core vs 256@100Mhz, all other factors aside, the 256 would preform ~2X as fast? The GPU is just one big fat processor, but it is progammable, unlike a regular CPU that has predetermined areas for certain functions. A CUDA core can be programmed to do anything the programmer wants it to do. Like a video encoding software can use all of the cores and optimize them for encoding video. A video game will use some of the cores for texture processing, some for physics calculations, etc., etc.
If the difference in speed is THAT much, then the 128cores would win. If they were the same speed, the 256 cores would obviously get twice as much done in the same amount of time (or the same amount of work in 1/2 the time), but since the speed of the 128 is more than twice as fast (all other factors aside) the 128 cores would win. However, there are no two graphic cards with THAT much difference in speed between two core clocks or two memory clocks. Most typical graphic cards have speeds from 700mhz to 1.x ghz, just a few hundred mgz difference, and usually the higher number of cores have higher clocks. Make sure you look at the EFFECTIVE mgz speed and not the CORE speed. The 100mhz may be 800mhz effective."