Tri-cubic interpolation speed on video card?

G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

I'm wandering what number of tri-cubic (not trilinear) interpolations
per second can be achieved on video card? Does somebody have
non-assumptive but actual test-proven numbers? The only number I could
find on internet is the tri-cubic speed on GeForce3 what seems very
slow (1.800.000 tricubics per second) , see:
< http://wwwvis.informatik.uni-stuttgart.de/vmv01/dl/papers/8.pdf >
Even on general CPU the speed is around 7.000.000 tricubics per second
(P4 3.8Ghz)

Task description:
- Given 512x512x2048 (12-bit) volume
- The oblique/arbitrary oriented plane 1024x1024 (12-bit) crosses this
volume
- Each pixel of the plane has to be filled by tri-cubic interpolation
values

Thanks,
George
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

buyanovsky@attbi.com wrote:
> I'm wandering what number of tri-cubic (not trilinear) interpolations
> per second can be achieved on video card? Does somebody have
> non-assumptive but actual test-proven numbers? The only number I
could
> find on internet is the tri-cubic speed on GeForce3 what seems very
> slow (1.800.000 tricubics per second) , see:
> < http://wwwvis.informatik.uni-stuttgart.de/vmv01/dl/papers/8.pdf >
> Even on general CPU the speed is around 7.000.000 tricubics per
second
> (P4 3.8Ghz)
>
> Task description:
> - Given 512x512x2048 (12-bit) volume
> - The oblique/arbitrary oriented plane 1024x1024 (12-bit) crosses
this
> volume
> - Each pixel of the plane has to be filled by tri-cubic interpolation
> values
>
> Thanks,
> George

You can achieve a 25-fold efficiency increase by reversing the polarity
on the influx capacitors and cross-coupling the warp-field signitures.
;-)

Nathan.
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

Evenbit wrote:

> You can achieve a 25-fold efficiency increase by reversing the polarity
> on the influx capacitors and cross-coupling the warp-field signitures.
> ;-)

Nope. No go.

In that case, you'll also need transphasic inductor with tricobalt
injector, to simulate secondary phase field capacitor, which results
in three times slower reaction during tri-cubic interpolation.

--
Apples rule. If it weren't for a conspiracy on the part of fruit
manufacturers we'd all have apples.
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

"Kristijan Korazija" <ime.prezime@globalnet.hr> wrote in message
news:w7468ipn13xp.dlg@trashcan.hr...
> Evenbit wrote:
>
>> You can achieve a 25-fold efficiency increase by reversing the polarity
>> on the influx capacitors and cross-coupling the warp-field signitures.
>> ;-)
>
> Nope. No go.
>
> In that case, you'll also need transphasic inductor with tricobalt
> injector, to simulate secondary phase field capacitor, which results
> in three times slower reaction during tri-cubic interpolation.
>
> --
> Apples rule. If it weren't for a conspiracy on the part of fruit
> manufacturers we'd all have apples.

Don't forget to brew a Very Strong, Very Hot cup of tea for the Infinite
Improbability Generator.

Vaughn L.Porter
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

Herman Dullink wrote:
> > I'm wandering what number of tri-cubic (not trilinear)
interpolations
> > per second can be achieved on video card?
> Are you sure you want to do this on a video card?
> True, it can do specific tasks very quickly, but only very specific
tasks...
> Also, it's optimised to output data on the screen, not back to the
CPU
> and/or system memory
>
> > Task description:
> > - Given 512x512x2048 (12-bit) volume
> That's a lot of data. And video processors don't often support
12-bit...
> Using 16-bit, this is more than 1GB.
>
> With current normal available hardware, I think that it will not be
worth
> the effort. I think that a smart addressing system, where you can map
> the cube in system memory (but not loading unless necessary) will be
> more effective (unless of course... you want to do this many times,
> or in real time).
>
> Medical data?
>

Thanks for the prompt reply,

> Are you sure you want to do this on a video card?

No, I'm not sure, and this is the reason I'm looking for the
opinion of people who has an expertise in programming of video card to
do a custom job.

> That's a lot of data. And video processors don't often
>support 12-bit... Using 16-bit, this is more than 1GB.

It is exactly 1GB = (2^(9+9+11+1)) bytes. It is very affordable today
to have 4GB (~3.2 under Windows XP /3GB) so the memory size is not the
problem. The bottleneck is the memory latency. Just today finished the
accurate benchmarking of brute force tricubic MIP performance on dual
Xeon 3.6/800fsb and is 19 million tricubic&MIP samples per sec (for
SSE2 - 4 threads) with coherent memory access, and only half of this
speed for the slowest oblique MIP.

The numbers I'v seen about tri-linear performance on NVIDIA 6800
makes me wander that maybe there is a way to harvest this power.

> Medical data?

Yes, today the typical range of CT datasets size is
512x512x(300...1000) and taking into account the new 64 slices CT the
3000-4000 slices of near isotropic volumes are going to be pretty
common in 2-3 years.

Thanks,
George
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

Ben Pope wrote:
> buyanovsky@attbi.com wrote:
> > The numbers I'v seen about tri-linear performance on NVIDIA 6800
> > makes me wander that maybe there is a way to harvest this power.
>
> The problem here is that they are designed to do *linear* operations.

> In moving to cubic from linear, you are increasing your workload
> considerably.
>
> With linear you have f(x) = ax + b
> With cubic you have f(x) = ax^3 + bx^2 + cx + d
>
> Now extend them to 3 dimensions and see how much more tricky it gets?
>
> Off the top of my head, I can't think of a way to make use of
trilinear
> operations that can count towards your tricubic result.
>
> Perhaps the best thing to do here is to take a look at what
operations
> can be performed in a DX9 pixel shader and decompose a cubic from
there.
> You can safely ignore the other 2 dimensions to start with, as your

> dimensions are linearly seperable.
>

On P4 it takes consecutive block of 49 SSE2 instructions to compute
tricubic result. If to substitute the memory reading (voxels reading)
by some fictive register contents the speed goes up dramatically from
19mln to 55mln. Note: I switched off only voxel reading from tricubic
part, it still writes result and Maximum Intensity blending still
reads/writes memory. So the main bottleneck is the memory latency (not
computations). Even if we assume that tricubic computations take zero
time; still the speed for this specific task can not go higher
36mln/sec. I guess that the same problem is true for video cards, but
maybe video card has more efficient memory organization to process
these kind tasks. Anyway, before I invest time trying video card
approach I would like to gather as much as possible info.

Thanks everybody for useful reply.
-------
George
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

> I'm wandering what number of tri-cubic (not trilinear) interpolations
> per second can be achieved on video card?
Are you sure you want to do this on a video card?
True, it can do specific tasks very quickly, but only very specific tasks...
Also, it's optimised to output data on the screen, not back to the CPU
and/or system memory

> Task description:
> - Given 512x512x2048 (12-bit) volume
That's a lot of data. And video processors don't often support 12-bit...
Using 16-bit, this is more than 1GB.

With current normal available hardware, I think that it will not be worth
the effort. I think that a smart addressing system, where you can map
the cube in system memory (but not loading unless necessary) will be
more effective (unless of course... you want to do this many times,
or in real time).

Medical data?

H
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

U hr.comp.hardver Kristijan Korazija <ime.prezime@globalnet.hr> prica:
>> You can achieve a 25-fold efficiency increase by reversing the polarity
>> on the influx capacitors and cross-coupling the warp-field signitures.
>> ;-)

> Nope. No go.

> In that case, you'll also need transphasic inductor with tricobalt
> injector, to simulate secondary phase field capacitor, which results
> in three times slower reaction during tri-cubic interpolation.

You forgot again... Compensate!!! Then you can achieve peak performance of
the warp core during tri-cubic interpolation...

--
Na biciklu se za svaku Novu Godinu celav balkonu farbu.
By runf

Damir Lukic, calypso@_MAKNIOVO_fly.srk.fer.hr
a member of hr.comp.hardver FAQ-team
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

>> That's a lot of data.
> It is exactly 1GB = (2^(9+9+11+1)) bytes. It is very affordable today
Yes, for system memory. But I haven't seen many consumer class graphics
adapters yetvwith at least this size of memory. A graphics adapter also
needs some on-screen memory for the GUI, and some off-screen buffers for the
result view(s).

>> Are you sure you want to do this on a video card?
> No, I'm not sure, and this is the reason I'm looking for the
> opinion of people who has an expertise in programming of video card to
> do a custom job.
I have some expertise, but not in 3D (yet).
So I can't really help you with the implementation details using modern 3D
GPUs, but I know a bit about busses, the data channels in a system. The main
problem with most architectures is that it performs best when you 'push' the
data through a channel (e.g from CPU to graphics adapter). Pulling data is
very bad for performance, the CPU has to wait many cycles for one fetch to
complete. A cache helps (and only helps) if certain data is fetched multiple
times, and as long no more data is used than the cache size (ie. it's very
effective with 'looping' algorithms).
DMA techniques are used for some better performace; a device is then
programmed to push the data through a channel without further CPU
intervention.
If you use a large sequential stream of data, prefetching can be used. You
probably know about that, MMX/SSE/3Dnow have some prefetch instructions.

Maybe somebody else can give you some info about 3D specifics. You might
even contact the manufacturers of GPUs. Theoretically, the newest generation
of 3D GPUs should be able to address more than a GB of data. These GPUs are
programmable, so it should be possible to implement the whole algorithm in
GPU code.
All that's needed is (somebody with) the right (programming) information...
Because of competition of these manufacturers, it'll very hard to get
detailed info. Maybe someone working there can see the challenge :)

Another approach is to look at the implementation of your algorithm. Rewrite
(parts of) it so that memory cycles and caches are used optimal.
You might e.g. split up the volume in to smaller subvolumes, and/or use
tile-based rendering, ie split the screen up in smaller rectangular (or
square) parts, so that chances of data still in cache is higher.


Herman
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

buyanovsky@attbi.com wrote:
> The numbers I'v seen about tri-linear performance on NVIDIA 6800
> makes me wander that maybe there is a way to harvest this power.

The problem here is that they are designed to do *linear* operations.
In moving to cubic from linear, you are increasing your workload
considerably.

With linear you have f(x) = ax + b
With cubic you have f(x) = ax^3 + bx^2 + cx + d

Now extend them to 3 dimensions and see how much more tricky it gets?

Off the top of my head, I can't think of a way to make use of trilinear
operations that can count towards your tricubic result.

Perhaps the best thing to do here is to take a look at what operations
can be performed in a DX9 pixel shader and decompose a cubic from there.
You can safely ignore the other 2 dimensions to start with, as your
dimensions are linearly seperable.

The problem of getting the data from the card to the CPU is one that
also needs considering, but I suspect that PCI Express will help
considerably as it is faster than AGP.

Sounds like an intersting problem.

Ben
--
A7N8X FAQ: www.ben.pope.name/a7n8x_faq.html
Questions by email will likely be ignored, please use the newsgroups.
I'm not just a number. To many, I'm known as a String...
 

FRiK

Distinguished
Jan 7, 2003
4
0
18,510
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

"Herman Dullink" <hd7@hetnet.nl> wrote in message
news:d2s7bm$og1$1@reader13.wxs.nl...
>>> That's a lot of data.
>> It is exactly 1GB = (2^(9+9+11+1)) bytes. It is very affordable today
> Yes, for system memory. But I haven't seen many consumer class graphics
> adapters yetvwith at least this size of memory. A graphics adapter also
> needs some on-screen memory for the GUI, and some off-screen buffers for
> the
> result view(s).
>
>>> Are you sure you want to do this on a video card?
>> No, I'm not sure, and this is the reason I'm looking for the
>> opinion of people who has an expertise in programming of video card to
>> do a custom job.
> I have some expertise, but not in 3D (yet).
> So I can't really help you with the implementation details using modern 3D
> GPUs, but I know a bit about busses, the data channels in a system. The
> main
> problem with most architectures is that it performs best when you 'push'
> the
> data through a channel (e.g from CPU to graphics adapter). Pulling data is
> very bad for performance, the CPU has to wait many cycles for one fetch to
> complete. A cache helps (and only helps) if certain data is fetched
> multiple
> times, and as long no more data is used than the cache size (ie. it's very
> effective with 'looping' algorithms).
> DMA techniques are used for some better performace; a device is then
> programmed to push the data through a channel without further CPU
> intervention.
> If you use a large sequential stream of data, prefetching can be used. You
> probably know about that, MMX/SSE/3Dnow have some prefetch instructions.
>
> Maybe somebody else can give you some info about 3D specifics. You might
> even contact the manufacturers of GPUs. Theoretically, the newest
> generation
> of 3D GPUs should be able to address more than a GB of data. These GPUs
> are
> programmable, so it should be possible to implement the whole algorithm in
> GPU code.
> All that's needed is (somebody with) the right (programming)
> information...
> Because of competition of these manufacturers, it'll very hard to get
> detailed info. Maybe someone working there can see the challenge :)
>
> Another approach is to look at the implementation of your algorithm.
> Rewrite
> (parts of) it so that memory cycles and caches are used optimal.
> You might e.g. split up the volume in to smaller subvolumes, and/or use
> tile-based rendering, ie split the screen up in smaller rectangular (or
> square) parts, so that chances of data still in cache is higher.

There is one big problem with most GPUs, and that CRC, or to say, the lack
of it. AGP doesn't have hardware CRC, so in order to keep the data safe
drivers have huge tables (that's a big part of the 20+ MB you have to
download every time a new driver is out) that check the respose to every
command given to the GPU. That's why it's very hard to make programs that
would use GPUs huge power. I tried adding two numbers using ATi SDK and I
can tell you it's hard work. However PCI Express (I'm 99% positive of this)
has hardware CRC so it may be easier to make a program for PCI Express GPU.
The problem Herman mentioned (getting the info from the GPU's memory) is not
as big with PCI Express since upload and download are pretty much same
speed. Then there are TurboCache models that use system memory and only have
16MB of memory onboard. Those models are not as powerful as the top models,
but they show that there might be a way to use system memory for GPU data.

Hope this information helps you with your project
Greetz
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

>
> You forgot again... Compensate!!! Then you can achieve peak performance of
> the warp core during tri-cubic interpolation...

Why would you need compensation?..or interpolation???....
you just need to plug in your favorite pickup..and O®E®I!!..:DISKOOO
DISKOOO MILEE VOLI DISKOOOOO...
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

Blento wrote:

> Why would you need compensation?..or interpolation???....
> you just need to plug in your favorite pickup..and O®E®I!!..:DISKOOO
> DISKOOO MILEE VOLI DISKOOOOO...

ROTFL :eek:)))
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

buyanovsky@attbi.com wrote:
> On P4 it takes consecutive block of 49 SSE2 instructions to compute
> tricubic result. If to substitute the memory reading (voxels reading)
> by some fictive register contents the speed goes up dramatically from
> 19mln to 55mln. Note: I switched off only voxel reading from tricubic
> part, it still writes result and Maximum Intensity blending still
> reads/writes memory. So the main bottleneck is the memory latency (not
> computations). Even if we assume that tricubic computations take zero
> time; still the speed for this specific task can not go higher
> 36mln/sec. I guess that the same problem is true for video cards, but
> maybe video card has more efficient memory organization to process
> these kind tasks. Anyway, before I invest time trying video card
> approach I would like to gather as much as possible info.

OK, so if the computations are effectively starved by limited memory
bandwidth, you have two options:

1. Read from memory less (I'm not sure how clever your data structure is)

2. Use faster memory.

The first technique will apply to a CPU and GPU.

The second technique would be helped considerably by the graphics card,
if the access patterns are suitable for GDDR3.

I'm not an expert on these new memories and what types of memory
addressing GPUs have, but I suspect that you have a similar thing to a
CPU, but just wider and faster - so I'm guessing that your biggest
concern is to keep your access patterns to something that will at least
fit in cache for the duration of a computation "unit".

Ben
--
A7N8X FAQ: www.ben.pope.name/a7n8x_faq.html
Questions by email will likely be ignored, please use the newsgroups.
I'm not just a number. To many, I'm known as a String...
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

> The problem Herman mentioned (getting the info from the GPU's memory) is
> not as big with PCI Express since upload and download are pretty much same
> speed.
Yes and no, the communication on the bus can be at the same speed. But
fetching the data from the adapter's memory by the CPU will add many delays
to the overall system.
When writing, the CPU writes the data to a write buffer (or cache), so it's
like fire and forget. But when reading, the CPU has to wait until the data
has found its way back to the CPU via all bridges and other controllers. The
adapter's memory is not cached, so we're looking at worst case scenario
timing wise. The CPU may write GB/s, but will only be able to read MB/s when
there's no caching.
If the adapter has a function (DMA) to write the data directly into system
memory, that would be a great improvement.

I wonder if that's needed in this case though. If it's used for real-time
displaying, then all data can stay in the adapter's memory. Only when a
snapshot is required, data has to be copied to system memory.

H
 
G

Guest

Guest
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

> So the main bottleneck is the memory latency (not
> computations).
I couple of things crossed my mind:

- latency; the Xeon architecture uses a shared bus between CPUs. Therefor,
memory access is shared too, and there's a memory controller (north bridge)
between the CPUs and the memory. This will increase latency a bit. Try
running this on a AMD64 platform, it should give better results, if indeed
latency is the main bottleneck.

- Threads; because of the shared bus, multiple threads doing the same task
won't do much good... they'll block each other on memory access.

- Cache size of CPUs, the AMD64 and Intel Xeons have onchip caches between
512KB and 2MB. Try splitting up the work in smaller reactangular 'tiles',
where to process each tile requires to access less amount of data than the
cache size. E.g. 64×64 pixels or 256×256 pixels. Cache latency is very low
:)

H
 

FRiK

Distinguished
Jan 7, 2003
4
0
18,510
Archived from groups: alt.comp.periphs.videocards.ati,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.soft-sys.matlab,hr.comp.hardver (More info?)

"Herman Dullink" <hd7@hetnet.nl> wrote in message
news:d2ufpc$9bq$1@reader13.wxs.nl...
>> The problem Herman mentioned (getting the info from the GPU's memory) is
>> not as big with PCI Express since upload and download are pretty much
>> same speed.
> Yes and no, the communication on the bus can be at the same speed. But
> fetching the data from the adapter's memory by the CPU will add many
> delays to the overall system.
> When writing, the CPU writes the data to a write buffer (or cache), so
> it's like fire and forget. But when reading, the CPU has to wait until the
> data has found its way back to the CPU via all bridges and other
> controllers. The adapter's memory is not cached, so we're looking at worst
> case scenario timing wise. The CPU may write GB/s, but will only be able
> to read MB/s when there's no caching.
> If the adapter has a function (DMA) to write the data directly into system
> memory, that would be a great improvement.
>
> I wonder if that's needed in this case though. If it's used for real-time
> displaying, then all data can stay in the adapter's memory. Only when a
> snapshot is required, data has to be copied to system memory.

You are correct, I checked some of my test data and PCI Express GPUs are
twice as fast downloaders as the fastest AGP GPUs, but downloading
information from the GPUs memory to system memory is somewhere 15 do 20
times slower (250-380 MB/s) the uploading. However, there must be a DMA
function avaliable since nVidia 6200 TurboCache GPUs use system memory and
their performance is not much slower (as much as you would expect if the
memory write speed was 250-380MB/s) comparing to the plain 6200 model that
has onboard memory.