Intel's Admirable Reaction To Our Last Article
Now so far I only presented you a bunch of what some of you may call 'idle talk'. The sentence "Pentium 4 would be faster if ... " doesn't sound particularly helpful, because it is merely based on wishful thinking. It would be a lot more helpful if we could have a look at some facts and hard numbers.
Luckily there's a guy by the name of 'Alex', who works at Intel's German headquarter in Munich. After he had read my Pentium 4 update article on Thursday, he decided to have a close look at the source code of Flask MPEG , which is openly available under the GNU software license. Thursday night Alex didn't go home, but spent his time re-compiling the iDCT-code of FlasK using several different options. He built a special version of FlasK that would show a very different behavior. On Friday afternoon I received this little gem from Intel Germany and I was extremely impressed with what I saw.
|Hi Tom,here's the version of Flask with the Intel additions built in.Our engineer quickly integrated and tested the following iDCT modules with Flask:FastIDCT MMX - Flask's originalFastIDCT Int - Flask's originalAP-945 FastIDCT SSE2 - this is the Vec-class version. The engineer would have been able to squeeze some more speed with assembler, but we rather wanted to make a point that "SSE2 is easy to implement", especially compared to the author's comment on how he had to suffer to implement the MMX version.IEEE1180 SSE2/Single Precision - this is the straight IEEE1180 reference algorithm using single precision FP SSE + some SSE2 icing. Again, the point is that "SSE2 port is easy".IEEE1180 SSE2/DP - the same straight IEEE1180 algorithm, now with double precision FP SSE2. Same point "SSE2 port is easy".IEEE1180 x87i - Flask's original IEEE1180 code using plain x87, compiled with Intel compiler 5.1 B18, with Katmai & Willamette instructions disabled for compatibility with "non-SSE products".You find all the variants in this order selectable from the UI of the attached version.Here are the measurements on producing a non-compressed AVI (so the codec isn't in the picture) out of a 1:02 min DVD title. The size of the produced AVI is 446 kB. The system is a 1.5 GHz Win98SE.iDCT Module Time,m:s============================1: FastIDCT MMX 1:282: FastIDCT Int 1:393: AP-945 FastIDCT SSE2 1:244: IEEE1180 SSE2/SP 1:355: IEEE1180 SSE2/DP 1:456: IEEE1180 x87i 2:34============================The new version was compiled with different compiler options. So only by different options No.6 improved significantly. We measured 10:36 with the unchanged Flask v0.594.Tom, please keep in mind that this program was edited in only half a day. So the engineer didn't incorporate a CPU identification. If you select one of the SSE2 optimized paths running Flask on a non-SSE2-enabled machine, you might run into issues.Also we didn't improve the User Interface or the like. The engineer concentrated on coding the given algorithms with SSE2. And finally we didn't make extensive testing.As agreed on the phone please don't distribute this version of flask to anybody else. We still haven't got hold on the author of Flask and we don't want to distribute this version without permission.If you find any strange things or irregularities we would be very happy if you contact George or ourselves in that case.Thank you and best regards,Hans & Christian|