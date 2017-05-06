Login | Sign Up
The History Of Nvidia GPUs

by
NV1: Nvidia Enters The Market

Nvidia formed in 1993 and immediately began work on its first product, the NV1. Taking two years to develop, the NV1 was officially launched in 1995. An innovative chipset for its time, the NV1 was capable of handling both 2D and 3D video, along with included audio processing hardware. Following Sega's decision to use the NV1 inside of its Saturn game console, Nvidia also incorporated support for the Saturn controller, which enabled desktop graphics cards to also use the controller.

A unique aspect of the NV1's graphics accelerator is that it used quadratic surfaces as the most basic geometric primitive. This created difficulties for game designers to add support for the NV1 or to design games for it. This became increasingly problematic when Microsoft released its first revision of the DirectX gaming API, which was designed with polygons as the most basic geometric primitive.

Desktop cards used the common PCI interface with 133 MB/s of bandwidth. Cards could use EDO memory clocked at up to 75 MHz, and the graphics accelerator was capable of a max resolution of 1600x1200 with 16-bit color. Thanks to the combination of Sega Saturn and desktop market sales, Nvidia was able to stay in business, but the NV1 was not particularly successful. Its graphics and audio performance were lackluster, the various hardware components made it expensive compared to other graphics accelerators.

Nvidia started work on the NV2 as a successor to the NV1, but after a series of disagreements with Sega, Sega opted to use PowerVR technology inside of its Dreamcast console and the NV2 was cancelled.

NV3: Riva 128

The Riva 128, also known as the NV3, launched in 1997 and was considerably more successful. It switched from using quadrilaterals as the most basic geometric primitive to the far more common polygon. This made it easier to add support for the Riva 128 in games. The GPU also used polygon texture mapping with mixed results. This allowed the GPU to render frames more quickly, but it had reduced image quality.

The GPU was available in two main variants: the Riva 128 and the Riva 128ZX. The Riva 128ZX graphics accelerators used higher-quality binned chips that enabled Nvidia to raise the RAMDAC frequency. Both models used SDRAM memory clocked at 100 MHz accessed over a 128-bit bus, giving the GPUs 1.6 GB/s of bandwidth. The Riva 128ZX chips, however, had 8MB of VRAM compared to 4MB on the Riva 128. The Riva 128ZX also operated at a higher 250 MHz clock speed compared to the Riva 128's 206 MHz.

These GPUs were fairly popular because they were capable of both 2D and 3D graphics acceleration, though they were clocked lower than alternatives from Nvidia's leading competitor, 3dfx.

NV4: The Plot To Drop The Bomb

In 1998, Nvidia introduced its most explosive card to date, the Riva TNT (code named "NV4"). Similar to the NV3, the NV4 was capable of rendering both 2D and 3D graphics. Nvidia improved over the NV3 by enabling support for 32-bit "True Color," expanding the RAM to 16MB of SDR SDRAM and increasing performance. Although the AGP slot was becoming increasingly popular, a large number of systems didn't contain one, so Nvidia sold the NV4 primarily as a PCI graphics accelerator and produced a relatively small number of AGP-compatible cards. Starting with the Riva TNT, Nvidia made a strong effort to regularly update its drivers in order to improve compatibility and performance.

At the time it was released, 3dfx's Voodoo2 held the performance crown, but it was relatively expensive and was limited to 16-bit color. The Voodoo2 also required a separate 2D video card, which raised its cost of ownership even higher. Needing a separate 2D video card was common in the 1990s, but as the Riva TNT was capable of processing both 2D and 3D video, the card was considerably more budget friendly than the Voodoo2.

Nvidia planned to ship the Riva TNT clocked at 125 MHz in an attempt to take the performance crown from the Voodoo2, but the core simply ran too hot and wasn't sufficiently stable. Instead, Nvidia was forced to ship at 90 MHz with RAM clocked at 110 MHz, resulting in the Riva TNT being slower than the Vodoo2. The Riva TNT still offered decent performance for its time, and after the release of Nvidia's "Detonator" drivers, performance increased significantly making it even more competitive.

Overall the Riva TNT was extremely successful due to its performance and features. The increased driver support from Nvidia also helped to attract customers, as anyone in the 1990s can tell you what a nightmare dealing with drivers used to be.

NV5: Another Explosion

In 1999, Nvidia made another grab for the performance crown with the Riva TNT2 (codenamed "NV5"). The Riva TNT2 was architecturally similar to the original Riva TNT, but thanks to an improved rendering engine it was able to perform about 10 to 17 percent faster than its predecessor at the same clock speed. Nvidia also added support for AGP 4X slots, which provided more bandwidth to the card, and doubled the amount of VRAM to 32MB. Probably the most significant improvement was the transition to 250 nm, which allowed Nvidia to clock the Riva TNT2 up to 175 MHz.

The Riva TNT2's main competitor was the 3dfx Vodoo3. These two products traded blows with each other for years without either card being a clear victor in terms of performance or features.

NV10: Use The GeForce Luke!

In late 1999, Nvidia announced the GeForce 256 (code-named "NV10"). Prior to the GeForce 256, essentially all video cards were referred to as "graphics accelerators" or simply as "video cards," but Nvidia opted to call the GeForce 256 a "GPU." Nvidia packed in several new features with this card including hardware T&L (Transform and Lighting) processing, which allowed the GPU to perform calculations that were typically relegated to the CPU. Since the T&L Engine was fixed-function hardware designed specifically for this task, its throughput was roughly five times higher than a then high-end Pentium III processor clocked at 550 MHz.

The design also differed from the Riva TNT2 in that it contained four pixel pipelines instead of just two. It was unable to match the clock speed of the Riva TNT2, but because of the additional pipelines it was still able to perform roughly 50% faster than its predecessor. The GPU was also Nvidia's first card to use between 32 to 64MB of DDR SDRAM, which contributed to its performance increase. The GPU's transistors shrunk to 220 nm and the core itself operated at 120 MHz, with the RAM ranging from 150 to 166 MHz.

The GeForce 256 also represents the first time Nvidia included video acceleration hardware, but it was limited to motion acceleration for MPEG-2 content.

NV11, NV15, NV16: GeForce2

Nvidia followed the NV10 GeForce 256 up with the GeForce2. The architecture of the GeForce2 was similar to the its predecessor, but Nvidia was able to double the TMUs attached to each pixel pipeline by further shrinking the die with 180 nm transistors. Nvidia used three different cores, codenamed NV11, NV15, and NV16 inside of GeForce2-branded cards. All of these cores used the same architecture, but NV11 contained just two pixel pipelines while the NV15 and NV16 cores had four, and NV16 operated at higher clock rates than NV15.

The GeForce2 was also the first Nvidia line-up to support multiple monitor configurations. GeForce2 GPUs were available with both SDR and DDR memory.

NV20: The GeForce3

In 2001, the GeForce3 (codenamed "NV20") arrived as Nvidia's first DirectX 8-compatible card. The core contained 60 million transistors manufactured at 150 nm, which could be clocked up to 250 MHz. Nvidia introduced a new memory subsystem on the GeForce3 called "Lightspeed Memory Architecture" (LMA), which was designed to compress the Z-buffer and reduce the overall demand on the memory's limited bandwidth. It was also designed to accelerate FSAA using a special algorithm called "Quincunx." Overall performance was higher than the GeForce2, but due to the complexity of the GPU it was fairly expensive to produce, and thus carried a high price tag in comparison.

NV2A: Nvidia And The Xbox

Nvidia would once again find itself back in the home console market as a key component of Microsoft's original Xbox in 2001. The Xbox used hardware nearly identical to what you would find inside of modern PCs at that time, and the GPU designed by Nvidia was essentially a tweaked GeForce3. Just like the NV20 GPU, the NV2A inside of the Xbox contained four pixel pipelines with two TMUs each. Nvidia also created the Xbox's audio hardware known as MCPX, or "SoundStorm".

NV17: GeForce4 (Part 1)

Nvidia started to shake things up in 2002 by introducing several GPUs based on different architectures. All of these were branded as GeForce4. At the low-end of the GeForce4 stack was the NV17, which was essentially an NV11 GeForce2 die that had been shrunk using 150 nm transistors and clocked between 250 and 300 MHz. It was a drastically simpler design compared to the NV20, which made it an affordable product that Nvidia could push to both mobile and desktop markets.

Nvidia later released two revisions of the NV17 core called NV18 and NV19. NV18 featured an upgraded bus to AGP 8X, while NV19 was essentially an NV18 chip with a PCIe bridge to support x16 links. The DDR memory on these chips was clocked anywhere between 166 and 667 MHz.

NV25: GeForce4 (Part 2)

With the NV17 covering the lower-half of the market, Nvidia launched NV25 to cover the high-end. The NV25 was developed as an improvement upon the GeForce3's architecture, and essentially had the same resources with four pixel pipelines, eight TMUs, and four ROPs. The NV25 did have twice as many vertex shaders (an increase from one to two), however, and it featured the updated LMA-II system. Overall, the NV25 contained 63 million transistors, just 3 million more than the GeForce3. The GeForce4 NV25 also had a clock speed advantage over the GeForce3, ranging between 225 and 300 MHz. The 128MB DDR memory was clocked between 500 to 650 MHz.

Benchmarks of the NV25 in DirectX 7 titles showed modest performance gains over the GeForce3 of around 10%. However, DirectX 8 games that took advantage of the vertex shaders saw the performance advantage held by the NV25 grow to 38%.

Nvidia later released a revised NV25 chip, called the NV28. Similar to the NV18 mentioned in the last slide, the NV28 only differed from the NV25 in that it supported AGP 8X.

NV30: The FX 5000 (Part 1)

In 2002, the gaming world welcomed the arrival of Microsoft's DirectX 9 API, which was one of the most heavily used and influential gaming APIs for several years. ATI and Nvidia both scrambled to develop DX9-compliant hardware, which meant the new GPUs had to support Pixel Shader 2.0. ATI beat Nvidia to the market in August 2002 with its first DX9-capable cards, but by the end of 2002 Nvidia launched its FX 5000 series.

Although Nvidia launched its DX 9 cards later than ATI, they came with a few additional features that Nvidia used to attract game developers. The key difference was the use of Pixel Shader 2.0A, Nvidia's own in-house revision. Pixel Shader 2.0A featured a number of improvements over Microsoft's Pixel Shader 2.0, such as unlimited dependent textures, a sharp increase in the number of instruction slots, instruction predication hardware, and support for more advanced gradient effects. Essentially, Pixel Shader 2.0A contained several improvements that would become part of Microsoft's Pixel Shader 3.0.

Crafted using 130 nm transistors, NV30 operated between 400 and 500 MHz, and it had access to 128 or 256MB of DDR2 RAM over a 128-bit bus operating at either 800 or 1000 MHz. The NV30 itself continued to use a four-pipeline design with two vertex shaders, eight TMUs, and four ROPs. Nvidia followed it up with lower-end variants that had four pixel pipelines and just one vertex shader, four TMUs, and four ROPs and could use less expensive DDR memory.

NV35: The FX 5000 Series (Part 2)

Although the NV30 was the original FX 5000-series flagship, just a few months later Nvidia released a faster model that had an extra vertex shader and could use DDR3 connected via a wider 256-bit bus.

The FX 5000 series was heavily criticized despite its advanced feature set because its lackluster performance lagged behind ATI's competing GPUs. It also had poor thermal characteristics, causing the GPU to run extraordinarily hot, requiring OEMs to sell the FX 5000 series with large air coolers.

NV40: Nvidia GeForce 6800

Just one year after the launch of the FX 5000 series, Nvidia released the 6000 series. The GeForce 6800 Ultra was Nvidia's flagship powered by the NV40. With 222 million transistors, 16 pixel superscalar pipelines (with one pixel shader, TMU, and ROP on each), six vertex shaders, Pixel Shader 3.0 support, and 32-bit floating-point precision, the NV40 had vastly more resources at its disposal than the NV30. This is also not counting native support for up to 512MB of GDDR3 over a 256-bit bus, giving the GPU more memory and better memory performance than its predecessor. These GPUs were produced with the same 130nm technology as the FX 5000 series.

The 6000 series was highly successful, as it could be twice as fast as the FX 5950 Ultra in some games, and was often roughly 50% faster in most tests. At the same time, it was also more energy efficient.

NV43: The GeForce 6600

After Nvidia secured its position at the high-end of the GPU market, it turned its attention to producing a new mid-range graphics chip known as the NV43. This GPU was used inside of the Nvidia GeForce 6600, and it had essentially half of the execution resources of the NV40. It also relied on a narrower 128-bit bus. The NV43 had one key advantage, however, as it was shrunk using 110 nm transistors. The reduced number of resources made the NV43 relatively inexpensive to produce, while he new fabrication technology helped reduce power consumption and boost clock speeds by roughly 20% compared to the GeForce 6600.

G70: The GeForce 7800 GTX And GeForce 7800 GTX 512

The GeForce 6800 was succeeded by the GeForce 7800 GTX, which used a new GPU code-named G70. Based on the same 110 nm technology as NV43, the G70 contained a total of 24 pixel pipelines with 24 TMUs, eight vertex shaders, and 16 ROPs. The GPU could access to up to 256MB of GDDR3 clocked at up to 600 MHz (1.2 GHz DDR) over a 256-bit bus. The core itself operated at 430 MHz.

Although the GeForce 7800 GTX was quite powerful for its time, Nvidia managed to improve upon its design shortly after its release with the GeForce 7800 GTX 512. With this card, Nvidia reworked the layout of the core and transitioned over to a new cooler design, which enabled the company to push clock speed up to 550 MHz. It also improved its memory controller by reducing latency, increasing the bus width to 512-bit, and pushing the memory frequency up to 850 MHz (1.7 GHz DDR). Memory capacity increased to 512MB, too.

G80: GeForce 8000 Series And The Birth Of Tesla

Nvidia introduced its Tesla microarchitecture with the GeForce 8000 series - the company's first unified shader design. Tesla would become one of Nvidia's longest-running architectures, as it was used inside the GeForce 8000, GeForce 9000, GeForce 100, GeForce 200, and GeForce 300 series of GPUs.

The GeForce 8000 series's flagship was the 8800 GTX, powered by Nvidia's G80 GPU manufactured at 80 nm with over 681 million transistors. Thanks to the unified shader architecture, the 8800 GTX and the rest of the 8000 series featured full support for Microsoft's new DirectX 10 API and Pixel Shader 4.0. The 8800 GTX came with 128 shaders clocked at 575 MHz and connected to 768MB of GDDR3 over a 384-bit bus. Nvidia also increased the number of TMUs up to 64 and raised the ROP count to 24. All of these enhancements allowed the GeForce 8800 GTX perform more than twice as fast as its predecessor in high-resolution tests.

As yields improved, Nvidia later replaced the 8800 GTX with the 8800 Ultra as the flagship. Although both graphics cards used the same G80 core, the 8800 Ultra was clocked at 612 MHz, giving it a slight edge over the 8800 GTX.

G92: GeForce 9000 Series And Tesla Improved

Nvidia continued to use the Tesla architecture in its GeForce 9000 series products, but with a few revisions. Nvidia's G92 core inside the 9000-series flagship was essentially just a die shrink of G80. By fabricating G92 at 65 nm, Nvidia was able to hit clock speeds ranging from 600 to 675 MHz all while reducing overall power consumption.

Thanks to the improved energy efficiency and reduced heat, Nvidia launched a dual-G92 GPU called the GeForce 9800 GX2 as the flagship in the 9000 series. This was something Nvidia was unable to do with the power-hungry G80. In tests, the 9800 GX2 outperformed the 8800 Ultra on average between 29 to 41% when AA was turned off. When AA was active, however, the 9800 GX2's performance lead shrunk to 13% due to RAM limitations. Each G92 on the 9800 GX2 had access to 512MB of GDDR3, while the 8800 Ultra came with 768MB. The card was also considerably more expensive than the 8800 Ultra, which made it a tough sale.

G92 And G92B: GeForce 9000 Series (Continued)

Nvidia later released the GeForce 9800 GTX with a single G92 core clocked at 675 MHz and 512MB of GDDR3. This 9800 GTX was slightly faster than the 8800 Ultra thanks to its higher clock speed, but it also ran into issues due to its limited RAM capacity. Eventually, Nvidia created the GeForce 9800 GTX+ with a new 55 nm chip code-named G92B. This allowed Nvidia to push clock speed up to 738 MHz, but the most significant improvement that the 9800 GTX+ possessed was its 1GB of memory.

G92B: The GeForce 100 Series

Towards the end of the 9000-series, Nviidia introduced the GeForce 100 series targeted exclusively at OEMs. Individual consumers were not able to buy any of the 100-series cards directly from retailers. All 100 series GPUs were re-branded variants of the 9000 series with minor alternations to clock speed and card design.

GT200: GeForce 200 Series And Tesla 2.0

Nvidia introduced the GT200 core based on an improved Tesla architecture in 2008. Changes made to the architecture included an improved scheduler and instruction set, a wider memory interface, and an altered core ratio. Whereas the G92 had eight Texture Processor Clusters (TPC) with 16 EUs and eight TMUs, the GT200 used ten TPCs with 24 EUs and eight TMUs each. Nvidia also doubled the number of ROPs from 16 in the G92 to 32 in the GT200. The memory bus was extended from a 256-bit interface to a 512-bit wide connection to the GDDR3 memory pool.

The GT200 launched in the Nvidia GeForce GTX 280, which was significantly faster than the GeForce 9800 GTX+ due to the increased resources. It could not cleanly outperform the GeForce 9800 GX2, but because the 9800 GX2 had considerably higher power consumption and less memory, the GTX 280 was still considered the superior graphics card. The introduction of the GeForce GTX 295 with two GT200 cores in 2009 further cemented the market position of the 200 series.

GT215: The GeForce 300 Series

The GeForce 300 series was Nvidia's second OEM-only line of cards. It was composed entirely of medium-range and low-end GPUs from the GeForce 200 series. All GeForce 300 series desktop GPUs use a 40 nm process and are based on the Tesla 2.0 architecture.

GF100: Fermi Arrises In The GeForce 400

Tesla and the GeForce 8000, 9000, 100, 200, and 300 series were followed by Nvidia's Fermi architecture and the GeForce 400 series in 2010. The largest Fermi chip ever produced was the GF100, which contained four GPCs. Each GPC had four Streaming Multiprocessors, with 32 CUDA cores, four TMUs, three ROPs, and a PolyMorph Engine. A perfect GF100 core shipped with a total of 512 CUDA cores, 64 TMUs, 48 ROPs, and 16 PolyMorph Engines.

However, the GeForce GTX 480 ( the original Fermi flagship) shipped with just 480 CUDA cores, 60 TMUs, 48 ROPs, and 15 PolyMorph Engines enabled. Due to the hardware resources present in GF100, it was an enormous 529 square millimeters in area. This made it rather difficult to produce perfect samples, and forced Nvidia to use slightly defective cores instead. The GeForce GTX 480 also gained a reputation for running excessively hot. Nvidia and its board partners typically used beefy thermal solutions on the GTX 480 as well, which tended to be loud and earned the graphics card a reputation as one of the nosiest GPUs in recent years.

GF104, 106, 108: Fermi's Alterted Cores

To reduce production costs and increase yields of its smaller Fermi GPUs, Nvidia rearranged the resource count of its SMs. Each of the eight SMs inside the GF104 and four inside the GF106 contain 48 CUDA cores, four TMUs, and four ROPs. This reduced the overall die size, as less SMs were needed on each die. It also reduced shader performance somewhat, but these cores were nonetheless competitive. With this core configuration, an Nvidia GeForce GTX 460 powered by a GF104 was able to perform nearly identical to an Nvidia GeForce GTX 465 that contained a GF100 core with just 11 SMs enabled.

When Nvidia created the GF108, it again altered the number of resources in each SM. The GF108 has just two SMs, which contain 48 CUDA cores, four TMUs, and two ROPs each.

GF110: Fermi Re-Worked

Nvidia continued to use the Fermi architecture for the GeForce 500 series, but managed to improve upon its design by re-working each GPU on a transistor level. The underlying concept of this re-working process was to use slower more efficient transistors in some parts of the GPU that are less critical to performance, and to use faster transistors in key areas that strongly affect performance. This had the impact of reducing power consumption and enabling an increase in clock speed.

Under the hood of the GTX 580 (the GeForce 500-series flagship) was GF110. In addition to the aforementioned transistor re-working, Nvidia also improved FP16 and Z-cull efficiency. These changes made it possible to enable all 16 SMs on the GF110, and the GTX 580 was considerably faster than the GTX 480.

GK104: Kepler And The 600 Series

The GeForce GTX 580 was succeeded by the GTX 680, which used a GK104 based on the Kepler architecture. This marked a transition to 28 nm manufacturing, which is partially responsible for the GK104 being far more efficient than GF110. Compared to the GF110, GK104 also has twice as many TMUs and three times as many CUDA cores. The increase in resources didn't triple performance, but it did increase performance by between 10 and 30% depending on the game. Overall efficiency increased even more.

GK110: Big Kepler

Nvidia's plan for the GeForce 700 series was essentially to introduce a larger Kepler die. The GK110, which was developed for compute work inside of supercomputers, was perfect for the job. This massive GPU contained 2880 CUDA cores and 240 TMUs. It was first introduced inside the GTX Titan, which had a single SMX disabled, dropping the core count to 2688 CUDA cores, 224 TMUs, and 6GB of RAM. However, the Titan had an unusually high price of $1000, which limited sales. It was later re-introduced as the GTX 780 with just 3GB of RAM and a somewhat more affordable price tag.

Later, Nvidia would ship the GTX 780 Ti, which fully utilized the GK110 with all 2880 CUDA cores and 240 TMUs.

GM204: Maxwell

Nvidia introduced its Maxwell architecture in 2014 with a focus on efficiency. The initial flagship, the GM204, launched inside the GeForce GTX 980. A key difference between Maxwell and Kepler is the memory sub-system. GM204 has a narrower 256-bit bus, but Nvidia achieved greater utilization of the available bandwidth by implementing a powerful memory compression algorithm. The GM204 also utilizes a large 2MB L2 cache that further reduced the impact of the narrower memory interface.

The GM204 contained a total of 2048 CUDA cores, 128 TMUs, 64 ROPs, and 16 PolyMorph engines. Due to the reduced resources compared to the GTX 780 Ti, the GM204 wasn't exceptionally faster than the GeForce GTX 780 Ti. It achieved just a 6% performance advantage over the GTX 780 Ti, but it also consumed roughly 33% less power.

Nvidia later released the GM200 inside the GeForce GTX 980 Ti. The GM200 was essentially a more resource-rich GM204 with 2816 CUDA cores. It managed to increase performance over the GM204, but it was not quite as efficient.

GP104: Pascal

The Pascal architecture succeeded Maxwell, and marked Nvidia's transition to a new 16 nm FinFET process. This helped to increase the architectural efficiency and drive up clock speed. The 314 mm square GP104 used inside the GeForce GTX 1080 contains a whopping 7.2 billion transistors. With 2560 CUDA cores, 160 TMUs, 64 ROPs, and 20 PolyMorph engines, the GeForce GTX 1080 was far more powerful than the GeForce GTX 980 Ti.

Nvidia also produced four other GPUs based on Pascal with lower core counts. Like the GTX 1080, the GTX 1070 is targeted at the high-end gaming segment, while the GTX 1060 handles the mid-range segment, and the GeForce GTX 1050 and 1050 Ti handle the low-end of the market.

GP102: Titan X, 1080 Ti & Titan XP

Nvidia pushed the performance of the 1000 series further with the release of its GP102 GPU. This part features 3,840 CUDA cores with a 352-bit memory interface, and it is also produced on a 16nm process. It first appeared inside of the Titan X with a partially disabled die that left 3,584 cores clocked at 1,531MHz. It was equipped with 12GB of GDDR5X memory clocked at 10Gbps and had a max TDP of 250W.

The GP102 eventually made its way to the consumer market in the form of the Nvidia GeForce GTX 1080 Ti. This graphics card again had a partially disabled die, and it featured the same number of CUDA cores as the Titan X. It is able to outperform the Titan X, however, as it has a higher boost clock speed of 1,582MHz, and its 12GB of GDDR5X RAM is clocked higher at 11Gb/s.

As yields improved, Nvidia was able to release a new GPU called the Titan XP that uses a fully enabled GP102 core. This brings the CUDA core count up to 3,840. The Titan XP comes equipped with 12GB of GDDR5X clocked at 11.4Gb/s. The card is clocked identical to the GTX 1080 Ti, but should perform better thanks to the increased CUDA core count.

