There are some actual engineer level definitions to processing units. Processors must contain at least three separate elements to be defined as a distinct processing core, instruction control unit, instruction execution unit, and input/ouput unit. You can have a lot more, but those are the three core requirements to be considered a separate processing element, aka "core". This was a very handy description until super-scalar CPU's were created which abstracted the internal processing element from the machine code coming in. So it gets tweaked a little bit. AMD's BD design is definitely two cores per module, each core has a control unit (scheduler) 4 execution units (2ALU 2AGU) and an I/O unit to get data in and out. The front end scheduler that is shared is the external scheduler not the internal one. The FPU is actually a separate processor entirely, it's bolted on and uses the same I/O but otherwise has it's own registers and scheduler.
Anyhow the primary difference between AMD and Intel's design lies within Intel's caching technology, which is often mistaken as the memory controller, though the IMC also helps Intel quite a bit. I would say a good 90%+ of the "memory benchmarks" people do are actually cache benchmarks as CPU's don't directly read from memory. When you do a memory read / write it's the cache unit that performs the operation in local cache then reports back to the CPU it was finished while asynchronously writing back the operation to main memory. The only time you predictably hit main memory is when your operation is much larger then the cache size.
The differences between their cache technologies is pretty radical. Intel's is an inclusive cache, the contents of each cache are also held in the cache one level higher. The L1 cache is kept in L2 and L3, this makes cache lookups really fast when working with small repetitive datasets. It leverages their strength is a superior branch predictor and prefetch unit. They are really good at ensuring the instructions and data you need are there before your code requests them. AMD use's an exclusive cache design, the contents of each cache are separate from the rest. When L1D fills up, the excess is moved to L2, when that fills up it's moved to L3, when that's full it gets dumped. AMD's predictor and prefetch is significantly less advanced then Intel's and so to compensate they try to maximize the size of the cache and thus the probability of something you need already being in it. Their cache is much slower then Intel's and BD's L2/L3 is even slower then Phenom II due to the modular design. The BD CPU was often being stalled due to having to wait on cache returns.
There are some other uArch differences, Intel use's 3 general purpose ALU's per core, AMD use's 2ALU and 2AGU per core. AMD has four 256-bit FPU's per module that can act as eight 128-bit FPU's while Intel has four 256-bit FPU's that can process 2~3 instructions at once. It boils down to AMD making some design decisions more suited for server environments while trying to lower design costs by making their chips extremely easy to modify / customize. That last one is no small matter, typically you would spend thousands of engineering hours post-design doing timing tweaks and optimizing a chip's microcode which result in a tightly timed efficient CPU. AMD mostly skipped that step and use's an automated design process, the result is BD's rather loose timings. AMD is continuing to work on those timings and is the reason PD and SR have had fairly large efficiency (what you guys call IPC) improvements. Them going modular had the benefit of being able to customize their chip for vender specific sales, it's no coincidence that XBONE and PS4 ended up designing around AMD.