Fillrate confusion

eden

Champion
Ok now this is becoming mixing.
If Multitexturing happens when a card has 2 TMUs (logically), then why the heck is:
1)The single texturing results of current cards not even correct with their theoretical amount of pixel pipes?
2)Multitexturing tests yeild higher Mtexels for R300 cards when they don't even support MT, having 8 pipes with 1 TMU? Which logically would make point number 1 even all the more confusing?

I know the fillrate is calculated usually by the amount of pixel rendered per clock, times the clock frequency. So if the R9700PRO has 8 pipes and runs at 325MHZ, then why the heck does single texturing in 3dMark 03 yeild 1500 MTexels?!
Furthermore, when the so-called multitexturing test is there, the cards perform closer to the theoretical single texturing fillrate of 8*325= 2600 Mtexels, with 2270 Mtexels/sec.
So, not only are the cards not properly rendering, but they don't even reach the full theoretical fill rate at all, like the FX5800. Could anyone explain if my reasoning of fillrate calculation is right, and why is nothing making sense?

--
This post is brought to you by Eden, on a Via Eden, in the garden of Eden. :smile: <P ID="edit"><FONT SIZE=-1><EM>Edited by Eden on 03/18/03 11:09 PM.</EM></FONT></P>
 

phsstpok

Splendid
Dec 31, 2007
5,600
1
25,780
So if the R9700PRO has 8 pipes and runs at 325MHZ, then why the heck does single texturing in 3dMark 03 yeild 1500 MTexels
The answer is fillrate is limited by memory bandwidth.

Here's a link to an old article but I believe it has the explanation you wanted.

<A HREF="http://www.sincero.com/editorials/" target="_new">http://www.sincero.com/editorials/</A>

This seems to be the key.

<i><font color=green>Since every pixel that the 3D chip produces assumes a Z-buffer read and write, I called this a "worst-case" type condition. However, I also ignored texturing and color reads so this condition can actually be worse. Creative Labs defines Z-buffer as the following:


A Z-buffer is a part of the video frame buffer memory that stores the Z (depth) -axis value (front to back position) of a pixel on-screen. This value is compared with Z-value data of other pixels to determine whether the pixel is behind or in front of another pixel and thus will be drawn or not, and how overlapping pixels will be presented.
With these assumptions in mind, each pixel will require one color write (2 bytes), one Z-buffer read (2 bytes) and one Z-buffer write (2 bytes) for a total of 6 bytes.</i></font color=green>

I'm guessing with modern graphics cards then everything is calcuated for 32-bit color, so 12 bytes of memory access are needed for each textured pixel.

Going on that assumption, R300 can theoretically process 8 * 325 = 2600 MTexels/sec. IIRC the 9700 PRO has 310 Mhz memory making the bandwidth 310 * 32 (for 256-bit) * 2 (for DDR) = 19,840 MB/sec of raw bandwidth but 12 bytes of memory access are needed for each Texel. The fillrate should be 19,840 MB/sec divided by 12 bytes [per Texel processed] = 1653 MTexel/sec (theoretical). 1500 MTexels/sec might be a realistic number in testing.

This is my best guess.

<b>99% is great, unless you are talking about system stability</b><P ID="edit"><FONT SIZE=-1><EM>Edited by phsstpok on 03/19/03 00:14 AM.</EM></FONT></P>
 

chuck232

Distinguished
Mar 25, 2002
3,430
0
20,780
But why is the multitexturing so much faster?... It's still only got 1 TMU and still the same memory bandwidth..

...And all the King's horses and all the King's men couldn't put my computer back together again...
 

phsstpok

Splendid
Dec 31, 2007
5,600
1
25,780
I'm already in the guessing stage so I'll take it a little further.

[/warning]
[/enter novice]

A single textured pixel is 1 texel (I think the article defines a texel as a textured pixel per second but I'm using the 3DMark definition).

A pixel with multi-textures, two in this case, has a fillrate of 2. Note: one pixel, two textures = two texel fillrate.

Your putting down two textures for every pixel. The GPU can't work any faster because of 1 TMU per pipe but while each pixel has two textures, memory is only accessed one time for the single pixel.

This puts less emphasis on the memory bandwidth. Now with half the memory demands of single texturing, the multi-texturing (two textures per pixel only) fillrate doubles.

In theory, 9700 Pro's memory (@310 Mhz) would permit 1653 * 2 = 3306 MTexels/sec. However, remember that R300 (@325 Mhz)is limited to 2600 MTexels/sec and this does not change with multitexturing (because there is still only 1 TMU per pipe).

So with plenty of memory bandwidth, R300 is now capable of reaching it's full potential, but only when it is doing multi-texturing.

Does any of my guess seem plausible?

<b>99% is great, unless you are talking about system stability</b>
 

eden

Champion
It makes sense, but how do you explain Multitexturing, bringing the card to its theoretical max, without even using any secondary TMUs?

--
This post is brought to you by Eden, on a Via Eden, in the garden of Eden. :smile:
 

phsstpok

Splendid
Dec 31, 2007
5,600
1
25,780
I don't know so I'll ask a couple questions. If after the processing the first texture does the GPU have to access memory or can it just reprocess the pixel adding the second texture, i.e. push it through the pipe a second time? How long does the second pass take relative to the first pass?

Remember, that the GPU has some extra time because it has to wait for slow memory.

Refering back to R300 with it's theoretical fillrate of 2600 Mtexels/sec, now picture if it did have 2 TMUs per pipe. Well, then the multitexture fill rate would be double or 5200 Mtexels/sec. So the real word fillrate isn't as high as it sounds. When you don't have two TMUs per pipe think of multitexturing as running thru the process twice, but with caching. GPU probably can do this while it waits for memory to catch up. [more guessing!!!]

<b>99% is great, unless you are talking about system stability</b><P ID="edit"><FONT SIZE=-1><EM>Edited by phsstpok on 03/19/03 06:43 PM.</EM></FONT></P>
 

phsstpok

Splendid
Dec 31, 2007
5,600
1
25,780
This is kind of fun. Once you go out on a limb with a wild guess you can take it pretty far.

You said,
Furthermore, when the so-called multitexturing test is there, the cards perform closer to the theoretical single texturing fillrate of 8*325= 2600 Mtexels, with 2270 Mtexels/sec.
Then I said,
The fillrate [of memory] should be 19,840 MB/sec divided by 12 bytes [per Texel processed] = 1653 MTexel/sec (theoretical).
Well look at this, the theoretical single texture fillrate of R300 is much higher than the theoretical max fillrate of R9700 Pro's memory. 2600 MTexels/sec vs 1653 MTexels/sec. Well look at the ratio 1653 to 2600. In the time it takes the GPU to do one pass only 1653/2600ths of time needed to write it to the memory has elapsed. So while the GPU waits it can complete 924/2600ths (2600 - 1653 is 924) of a second pass. What does that mean? It means the GPU can do (1653+924) Mtexels/sec in the time it has to wait for memory, or 2477 MTexel/sec of multitexturing (theoretical) minus any latencies for the second pass.

Does 2270 MTexel/sec (measured) seem more plausible now?

<b>99% is great, unless you are talking about system stability</b>
 

Willamette_sucks

Distinguished
Feb 24, 2002
1,940
0
19,780
interesting.
however, phsstpok ur math is a little messed up.
2600-1653=974, 974+1653=2600. with the fractions and stuff it may have looked good, but you just canceled yourself out.

now check out the FX.

The FX's (5800 ultra) theoretical max fillrate for single texturing (GPU) is 4 x 1 x 500 = 2000 mtexels, it achieves 1540.

its max theoretical multitexturing (gpu) is 4 x 2 x 500 = 4000 mtexels, it achieves 3483 (42.74s i think)

its memory bandwidth should allow for 500 x 16(128 bit) x 2(ddr) = 16000/12(bits) = 1333 mtexels

why does the single texturing performance exceed the memories theoretical maximum? were not dealing with multitexturing, so only 1 mtexel per pipe*tmu per clock.
now if this is dumb please correct me, but all these cards use Z-compression techniques, therefore the theoretical maximums weve just calculated for the memory are relatively meaningless.
so technically either of these cards should be able to achieve their theoretical maximums limited by the GPU if everything was perfect and optimized and happy, but it isnt.
but even with this, i still think the reason for the performance increase from ST to MT on the 9700 pro even with only 1 TMU, is because the GPU is just SO fast at doing these operations (although there are certain limitations) it can double pass the texel before it even gets more info from the memory (so some extent).



Long live ATI.
 

phsstpok

Splendid
Dec 31, 2007
5,600
1
25,780
This is getting more fun.

2600-1653=974, 974+1653=2600. with the fractions and stuff it may have looked good, but you just canceled yourself out.
Did I go to fast? I'll simplify that example.

Suppose the GPU could pump out texels twice as fast as memory could handle it and the fillrates were 2000 MTexels/sec and 1000 MTexels/sec, respectively. The ratio would now be 1000/2000. One texture pass on the GPU would require 1000/2000ths (one half) of the time required by memory and would leave (2000-1000)/2000ths or another 1000/2000ths (exactly half). Time enough to complete a second pass. It's more obvious now that the GPU is twice as fast as the memory.

Look carefully at my other example. I switched the relative time from units to complete one memory pass to units to complete GPU texturing passes. In memory pass units it takes 1635/1635 (equals 1) units to complete a memory pass. However the GPU can complete more than one complete pass in that amount of time. Using fractions, one GPU pass takes 1635/2000ths as long as one memory pass. This means the remaing time to complete the memory pass is 978/2000ths of the whole. This gives you a total of (1635+978)/2000ths which equals exactly 1 as it should be (one complete memory pass). However, this is relative to one memory pass.

To make it relative to one GPU pass you divide by the conversion ratio of 1635/2000. This give you (1635+978)/2000 divided by 1635/2000 which equals (1635+978)/1635ths (which is a fraction more than one GPU pass).
The FX's (5800 ultra) theoretical max fillrate for single texturing (GPU) is 4 x 1 x 500 = 2000 mtexels, it achieves 1540.

its max theoretical multitexturing (gpu) is 4 x 2 x 500 = 4000 mtexels, it achieves 3483 (42.74s i think)

its memory bandwidth should allow for 500 x 16(128 bit) x 2(ddr) = 16000/12(bits) = 1333 mtexels
Your math is right but you forgot one thing about GFX. It uses Lightspeed Memory Architecture (I think that's the name). IIRC it has 4 independent memory controllers but each of them accesses memory 64 bits at a time (not the full 128-bit).

Now try to follow me (because I don't know where I'm going). Four memory controllers (full 128-bit widthh) would quadruple memory bandwidth (in theory). However, being half as wide, at 64-bit, you loose half that bandwidth. The bandwidth is reduced to only double the original. So now we get

500 x 16(128 bit) x 2(ddr) x 2 (for LMA3) = 32000/12(bits) = 2666 mtexels (theoretical)

Real world ??? Don't know. Maybe LMA3 doesn't come close to the theoretical bandwidth.



<b>99% is great, unless you are talking about system stability</b><P ID="edit"><FONT SIZE=-1><EM>Edited by phsstpok on 03/19/03 09:40 PM.</EM></FONT></P>
 

eden

Champion
Holy crap this is confusing. Guys I just can't keep up lol. I am more visual anyways, but thanks for the theory from you.

I think maybe Dave could give us info, since he works in the programming of graphics.

Also, why did you previously reply to your own self, and in a way as if someone else was talking to you?

--
This post is brought to you by Eden, on a Via Eden, in the garden of Eden. :smile:
 

phsstpok

Splendid
Dec 31, 2007
5,600
1
25,780
Are you talking to me or Phsstpok?

Shut-up! He ain't talking to you!! Get out of here!!

It's OK. I'm feeling much better now!





I replied to myself because I had too much to add and didn't want to just edit the original.

<b>99% is great, unless you are talking about system stability</b>
 

Willamette_sucks

Distinguished
Feb 24, 2002
1,940
0
19,780
ty phsstpok, i knew what you were getting at and you did clear it up, but your post before last mustve been a little rushed.
interesting point about the lma 3, i didnt take that into consideration, but obviously it directly effects memory bandwidth/fillrate.
i still think z-compression eliminates memory as a bottleneck for fillrate performance.
this could be tested by clocking down the memory and leaving the GPU at stock speeds.
ne1 w/a radeon 9700 pro want to test this?

Long live ATI.
 

phsstpok

Splendid
Dec 31, 2007
5,600
1
25,780
Well If I made a mistake in my first POST I'm not going to fix it now that's it's been explained.

As for LMA 3, I'm not even sure I have my facts right. It was LMA 2, that I was remembering, i.e. 4, 64-bit, memory controllers.

Don't know how Z-compression works, whether it just conserves video memory or if it significantly improves memory performance. Same with color compression, something else the nVidia was touting.

Your comment about testing fillrate vs memory speed is an interesting idea. We can all test this with any video card. Hold the GPU speed and try different memory speeds then hold the memory speed and try different GPU spees. See how the various combinations affect fillrate.

I don't see how this would test Z-compression as changing memory speed will affect general bandwidth not just Z-compression.

<b>99% is great, unless you are talking about system stability</b>
 

Willamette_sucks

Distinguished
Feb 24, 2002
1,940
0
19,780
Yeah i know anyone could test it, i was just too lazy and thought some1 else could do it. plus so far this thread has been relating to the 9700 pro and FX specifically.

i refer back to your (phsstpok) post in which you gave this link "http://www.sincero.com/editorials/" and an excerpt from it for my assumption about the z-buffer.
i do know however, that the z-buffer does not only conserve memory but increases theoretical bandwidth, via real time compression.
i suggest someone try lowering their ram speed to exactly HALF of gpu speed (sdr) (to try and keep latencies to a minimum) and do the single and multitexturing fillrate tests on 3dmark 2001. I have no idea how they will turn out, but for the hell of it, ill say that they should stay ALMOST the same. the only reason they should score any lower, is because of increased latency times between the GPU and the memory.
the z-buffer is closely related to fillrate performance and therefore optimizations and compression techniques used in the z-buffer (as high as 24:1) should at least make this portion of the data unreliant on memory bandwidth.
but z-buffer data isnt the only data going through the memory bus, so this should give us some indication of what percentages (pure fillrate) of data for this specific test are stored/passed through the z-buffer, and which are not.



Long live ATI.
 

phsstpok

Splendid
Dec 31, 2007
5,600
1
25,780
Now that you have explained it. I'd like to see it too.

Do you have any links on how Z-compression (or any compression for that matter) can increase bandwidth? I'm having trouble picturing how this can work.

Looking back at your first post. I'm wondering what's going on with GeforceFX. Memory should not be slowing fillrates at all. 15xx MTexels/sec is a far cry from the theoretical 2000 MTexels/sec that NV30 can pump out at 500 Mhz.

Remember when nVidia was claiming 48 GB/sec of memory bandwidth with LMA and compression methods? GeforceFX fillrates seem to be showing that real-world bandwidth is much lower.


<b>99% is great, unless you are talking about system stability</b>
 

cleeve

Illustrious
z-compression? I don't think it actually increases physical bandwidth; I think Z-compression increases "Practical" bandwidth.

Hmmm... how do I explain it... ok:

Let's say you have a data path with a bandwidth of 10 Megabytes per second.

And let's say you're using it to transfer a CD-quality .wav file. The wav file is 50 Megs.

SO... logically, it should take you 5 seconds to transfer that wav file through that data path. (50 Megabytes/10 Megabytes per second = 5 seconds)

OK... now let's say you run a compression algorythm on that wav file. let's say you convert it to an MP3. The MP3 file is only 5 Megabytes, but sounds like the exact same song.

It takes only 1/2 of a second to transfer that MP3 file through that data path. (5 megabytes/10 Megabytes per second = 1/2 second)

SO... even though the physical constrictions of that data path are 10 Megabytes per second and haven't changed, you've used compression technology to deliver what sounds like the exact same data in a smaller stream.

You might even say that your data path is capable of transfering 100 Megabytes per second of compressed data... know what I mean? (two mp3 files of 10 MB per second is equivalent to two wav files of 100 MB per second)

This is a crude metaphor that oversimplifies a bit, but I think it relays the idea.

- Cleeve
 

phsstpok

Splendid
Dec 31, 2007
5,600
1
25,780
Your example is pretty much the way I think about compression. It's pretty straight forward.

What I don't quite understand (not at all really)is how Z-compression is used. This is Z-buffer compression is it not? Z-buffer is used to determine "hidden" pixels isn't it? I just don't see how compressing the data improves "practical bandwidth". I definitely don't see it would improve real bandwidth. Plus, I don't really understand how Z-buffer works at all so that definitely is part of my problem.


<b>99% is great, unless you are talking about system stability</b>
 

cleeve

Illustrious
Well, that I can help you with.

First off, in a video card without LMA or Hyper-z, every object in a scene is rendered... even if it is behind another object.
i.e. if there is a car behind a building, the video card will render that car, then render the building on top of it.

Well, with LMA and hyper-z, the video card checks to see if something is visible before rendering it. This uses bandwidth more effectively.

I also believe LMA and hyperz compress textures, but I'm not sure on that point.

- Cleeve
 

phsstpok

Splendid
Dec 31, 2007
5,600
1
25,780
Sorry, I guess I didn't phrase my questions well.

I didn't mean,"what does the Z-Buffer do?" I meant how does a modern Z-buffer impact fillrate? Compressed data or not, something has to be checked to determine if a pixel is visible. The simple model, which we used for this discussion, assumes a simple memory read of a simple Z-buffer. How do R300 and NV30 implement this function and how do they actually improve upon it? Even compressed, some data has to be looked up. How does this improve performance?

<b>99% is great, unless you are talking about system stability</b>
 

Willamette_sucks

Distinguished
Feb 24, 2002
1,940
0
19,780
OK. I got un-lazy and here are a few benchmark results.
This is using a Ti4400 (41.09's) at various speeds.
I tried both 3D Mark 2001 (build 330) and 3D Mark 2003.
2003 had a more predictable and consistent pattern, and it seems like its a better (more accurate) fillrate tester.
I will include some 2001 results anyways.

3D Mark 2003 Fillrate Tests (ST + MT)
Clocks... ST MT
250/500 543 1490
250/550 606 1554
250/600 636 1537
250/625 666 1578
275/500 544 1546
275/550 600 1619
275/600 637 1667
275/625 671 1701
300/500 541 1581
300/550 596 1699
300/600 648 1776
300/625 679 1824
325/500 539 1581
325/550 595 1737
325/600 644 1840
325/625 673 1882

In this test, both the GPU and RAM speed have approximately the same impact on the ST and MT performance. However as both core and ram speeds get higher, ST becomes limited by memory bandwidth more so than MT.
I only had time to test each setting 1 time, but i never restarted or changed any settings, and 3d mark stayed open the whole time while i changed clock speeds via riva tuner which was closed before each test was ran.
Some apparent inconsistancies most likely can be partially attributed to gpu/memory latencies at different frequencies.
I wanted to see if latencies played a big roll in performance of either ST or MT so...

Clocks... ST MT
300/600 660 1798
300/601 648 1776

I ran the 3d mark 2003 fillrate test 5 times for each of these speeds, and took the average. Again, the computer was never restarted, etc.
This clearly shows the impact that gpu/memory latencies can have over performance (in this case fillrate).

3D Mark 2001 Fillrate (ST + MT) (less thorough/complete)
Clocks... ST MT
250/500 733 1626
250/550 760 1580
250/625 800 1631
275/550 782 1772
300/500 774 1917
300/550 831 1908
300/600 863 1908
300/625 900 1915
325/500 779 1993
325/550 858 2050
325/625 904 2134

The overall pattern here shows that Single-Texturing relies more on memory bandwidth than Multi-Texturing does.

I think that if each test at each speed was run 10 times and the average was taken, and the computer was under the exact same conditions during each trial the results would show a more accurate pattern, but this pretty much dismisses my idea that memory bandwidth, apart from latencies, should not affect fillrate.
I think this shows that a much larger portion of the data than I thought used for the fillrate tests must not be z-data, because otherwise memory bandwidth should be less of an influence.

geeze, all that testing made me tired:)

Long live ATI.
 

phsstpok

Splendid
Dec 31, 2007
5,600
1
25,780
A few benchmarks! That must have taken you hours!!!

Er, thanks for all the effort.

You might find this interesting. I took the liberty using your data and did some quick ratios, Single Text Fillrate/GPU clock, STF/Memory clock, Multi-Texture Fillrate/GPU clock, and MTF/Memory clock.

Here they are
<pre>CLOCKS FILRATES RATIOS
GPU MEM ST MT ST/GPU ST/MEM MT/GPU MT/MEM
250 500 543 1490 2.172 1.086 5.960 2.980
250 550 606 1554 2.424 1.102 6.216 2.825
250 600 636 1537 2.544 1.060 6.148 2.562
250 625 666 1578 2.664 1.066 6.312 2.525
275 500 544 1546 1.978 1.088 5.622 3.092
275 550 600 1619 2.182 1.091 5.887 2.944
275 600 637 1667 2.316 1.062 6.062 2.778
275 625 671 1701 2.440 1.074 6.185 2.722
300 500 541 1581 1.803 1.082 5.270 3.162
300 550 596 1699 1.987 1.084 5.663 3.089
300 600 648 1776 2.160 1.080 5.920 2.960
300 625 679 1824 2.263 1.086 6.080 2.918
325 500 539 1581 1.658 1.078 4.865 3.162
325 550 595 1737 1.831 1.082 5.345 3.158
325 600 644 1840 1.982 1.073 5.662 3.067
325 625 673 1882 2.071 1.077 5.791 3.011</pre><p>Very strong correlation between Singe Texture Fillrate and memory speeds.

<b>99% is great, unless you are talking about system stability</b>
 

Willamette_sucks

Distinguished
Feb 24, 2002
1,940
0
19,780
yep, thats the trend i was seeing.
ty for that nice graph.
it did take a little time:) but i wouldnt have done it unless I was interested in seeing the results, which i was, and then i simply passed them on to you.

Long live ATI.