I think there should be a reasonable gain based on the newer architecture for the DX10 chips. Since they use stream processors instead of specific pipes, it should be much easier for the API to route data to get proccessed. It should be much more effecient at taking advantage of all avalible resources. At least, that's what I'm hoping.
You're theory is interesting, but if that is possible in DX10, I don't see why they wouldn't have made it also work in DX9.