ATI GPU's: a major step forward

RichPLS

Champion
Folding@Home on ATI GPU's: a major step forward

INTRODUCTION
Since 2000, Folding@Home (FAH) has lead to a major jump in the capabilities of molecular simulation. By joining together hundreds of thousands of PCs throughout the world, calculations which were previously considered impossible have now become routine. FAH has targeted the study of protein folding and protein folding disease, and numerous scientific advances have come from the project.

Now in 2006, we are looking forward to another major advance in capabilities. This advance utilizes the new, high performance Graphics Processing Units (GPUs) from ATI to achieve performance previously only possible on supercomputers. With this new technology (as well as the new Cell processor in Sony’s PlayStation 3), we will soon be able to attain performance on the 100 gigaflop scale per computer. With this new software and hardware, we will be able to push Folding@Home a major step forward.

Our goal is to apply this new technology to dramatically advance the capabilities of Folding@Home, applying our simulations to further study of protein folding and related diseases, including Alzheimer’s Disease, Huntington's Disease, and certain forms of cancer. With these computational advances, coupled with new simulation methodologies to harness the new techniques, we will be able to address questions previously considered impossible to tackle computationally, and make even greater impacts on our knowledge of folding and folding related diseases.

A BRIEF HISTORY OF FAH: FROM TINKER TO GROMACS TO GPU'S
Folding@home debuts with the Tinker core (October 2000)
In October 2000, Folding@home was officially released. The main software core engine was the Tinker molecular dynamics (MD) code. Tinker was chosen as the first scientific core due to its versatility and well laid out software design. In particular, Tinker was the only code to support a wide variety of MD force fields and solvent models. With the Tinker core, we were able to make several advances, including the first folding of a small protein starting purely from sequence (subsequently published in Nature).

A major step forward: the Gromacs core (May 2003)
After many months of testing, Folding@home officially rolled out a new core based on the Gromacs MD code in May 2003. Gromacs is the fastest MD code available, and likely one of the most optimized scientific codes in the world. By using hand tuned assembly code and utilizing new hardware in many PC’s and Intel-based Mac’s (the SSE instructions), Gromacs was considerably faster than most MD codes by a factor of about 10x, and approximately a 20x to 30x speed increase over Tinker (which was written for flexibility and functionality, but not for speed).

However, while Gromacs is faster than Tinker, it has limits to what it can do; for example, it does not support many implicit solvent models, which play a key role in our folding simulations with Tinker. Thus, while Gromacs significantly sped certain calculations, it was not a replacement for Tinker, and so the Tinker core will continue to play an important role in Folding@Home (including a recent paper in Science). For these reasons, points for Gromacs WU’s were set to be consistent with points for Tinker WU’s, as both play an important role in the science of FAH. Moreover, we switched the benchmark machine to a 2.8 GHz Pentium 4 (from a 500MHz Celeron) in order to allow us to fairly benchmark these types of WU’s (as the benchmark machine needed to have hardware support for SSE).

The next major step forward: Streaming Processor cores (September 2006)
Much like the Gromacs core greatly enhanced Folding@home by a 20x to 30x speed increase via a new utilization of hardware (SSE) in PC’s, in 2006, Folding@home has developed a new streaming processor core to utilize another new generation of hardware: GPU’s with programmable floating-point capability. By writing highly optimized, hand tuned code to run on ATI X1900 class GPU’s, the science of Folding@home will see another 20x to 30x speed increase over its previous software (Gromacs) for certain applications. This great speed increase is achieved by running essentially the complete molecular dynamics calculation on the GPU; while this is a challenging software development task, it appears to be the way to achieve the highest speed improvement on GPU's.

In addition, through a collaboration with Pande Group, Sony has developed an analogous core for the PS3’s Cell processor (another streaming processor), which should see a significant speed increase for the science over the types of calculations we could previously do on a x86/SSE Gromacs core as well. Following what we did with the introduction of Gromacs, we will now switch benchmark machines and include an ATI X1900XT GPU in order to be able to benchmark streaming WUs (which cannot be run on non-GPU machines). This machine will also benchmark CPU units (which continue to be of value since GPUs work only for certain simulations) without using its GPU.

FREQUENTLY ASKED QUESTIONS
GPU and OS support

Will X1800 cards will be supported in the new client as well? What about any other ATI models (i.e. X1600 cards/RV530)?
At first, we will launch with support for X1900 cards only. X1800 cards do not provide the performance needed. These cards are actually quite different -- they have different processors (R520, R530 vs. the R580 [in the X1900 series]). The R580 makes a huge difference in performance -- its 48 pixel shaders are key, as we use pixel shaders for our computations. However, we are working to get reasonable performance from the X1800 cards (1/2 to 1/3 of the X1900) and we will likely support them soon (hopefully 1 month after the initial beta test roll out).

What about video cards with other (non-ATI) chipsets?]
The R580 (in the X1900XT, etc.) performs particularly well for molecular dynamics, due to its 48 pixel shaders. Currently, other cards (such as those from nVidia and other ATI cards) do not perform well enough for our calculations as they have fewer pixel shaders. Also, nVidia cards in general have some technical limitations beyond the number of pixel shaders which makes them perform poorly in our calculations.


Is the GPU client for Windows XP only, or has it been tested on other OS’s like Linux, Mac, and Vista?
We will launch with Windows XP (32 bit only) support only due to driver and compiler support issues. In time, we hope to support Linux as well. Macintosh OSX support is much further out, as the compilers and drivers we need are not supported in OSX, and thus we cannot port our code until that has been resolved.

Are there any plans to enable the client to take advantage of dual-GPU systems like CrossFire, or even 3-slot systems that can support three GPUs?
We will not support this at launch, but we are aggressively working to support multi-GPU systems.

http://folding.stanford.edu/FAQ-ATI.html

I am not sure if this has been posted in this section, but it is interesting at the very least that ATI's latest GPU's have more pixel shaders and their design enables utilization for more than just graphic rendering, while nVidia's GPU's in general have some technical limitations extending beyond the number of pixel shaders which makes them perform poorly for this utilization...
 

ara

Distinguished
Sep 13, 2005
494
0
18,780
baaah
they should try to get this working on other graphics cards regardless, i mean, my GPU is still more powerful then my CPU in the processing type they need, and it's not specifically a powerful GPU (x700...) i think they could easily do this and they would probably get a major increase in their output.

does this thing utilize both CPU and GPU or just the GPU?

Ara
 

casewhite

Distinguished
Apr 11, 2006
106
0
18,680
Good find since this not Quadcore is the new cutting edge. The most interesting thing about this is that it renders quad core and Kentsfield obsolete. No one has written any of the special code in an application for quad core whereas the code which enables this can be made a part of the driver for something like Crossfire. http://www.ati.com/technology/crossfire/physics/index.html
One card for graphics and one for GPU. It allows someone with an 1800XT or higher to salvage that card when Direct X10 comes out. The only limitation until HTT 3.0 comes out is that you are limited to the memory installed on the card. HTT 3.0 will allow you to access the entire memory on the mb with the GPU. The one configuration that seems to benefit the most is the 939/940 line with high quallity DDR. Latency and timings make DDR2 not show the gains that this does with DDR. Lawrence Livermore has upgraded Gauss using accelerators so that it will be able to display the full 1 petaflop of Blue Gene L. Remove 1 Opteron 252 and insert an IBM Cell chip and you are instantly 5 times faster. The Cell upgrade does not work with Intel without some major hardware changes since the Cell chip needs access to system memory. That is why a 252 with 2 HTT channels is used. http://www.llnl.gov/pao/news/news_releases/2005/NR-05-11-04p.html the Cell chip avoids the issues of the nVidia chipset. You will notice that Stanford is setting this up for the IBM Cell as well. Cell+ will push this up close to 200 gigaflops. The articles at these links will give you an idea of where this is going and why Quadcore is the next Netburst. Note on TSUBAME how much power was required to get a 24% increase in out put. Quad cores require a 24% increase in power and that power difference will kill off the use of quadcores in servers. . http://techreport.com/onearticle.x/10993
http://www.supercomputingonline.com/article.php?sid=11894
http://www-03.ibm.com/technology/splash/qs20/
http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/proceedings/&toc=comp/proceedings/ipdps/2005/2312/04/2312toc.xml&DOI=10.1109/IPDPS.2005.121#search=%22ron%20sass%20reconfigured%22