The Dominance of GPUs

Savings News June 13, 2025 132

The Dominance of GPUs

Advertisements

Three decades ago, the landscape of computing was dominated by CPUs and specialized processors tasked with virtually all computational duties. During this time, GPUs existed mainly to enhance the rendering speed of 2D shapes within Windows and other applications, lacking the versatility we see today.

Fast forward to the present day, and GPUs have firmly established themselves as one of the most prominent chip types in the tech industry.

Ironically, the era where graphical chips served solely graphic purposes is a remnant of the past. Today, machine learning and high-performance computing heavily rely on the processing capabilities of GPUs, which have morphed from humble pixel pushers into powerful floating-point computing engines. Join us as we explore how this single chip evolved significantly over the years.

Back in the late 1990s, the computational domain was ruled entirely by CPUs. High-performance computing, encompassing scientific work in supercomputers, data processing on standard servers, and engineering tasks on workstations, depended exclusively on two primary types of CPUs: 1) dedicated processors developed for specific functions and 2) off-the-shelf chips from giants like AMD, IBM, and Intel.

The ASCI Red supercomputer, one of the most powerful machines around 1997, was made up of 9,632 Intel Pentium II Overdrive CPUs. Each unit operated at a frequency of 333 MHz, giving the entire system a theoretical peak computing performance slightly exceeding 3.2 TFLOPS, or trillions of floating-point operations per second.

Since we will reference TFLOPS throughout this article, it’s worth clarifying what it entails. In computer science, floating-point numbers (often referred to as floats) signify non-integer values, such as 6.2815 or 0.0044. Integers, on the other hand, are commonly used for controlling computations within software and systems.

Floats assume critical importance in scenarios demanding precision, particularly in scientific or engineering contexts. Even straightforward calculations, like determining the circumference of a circle, necessitate at least one floating-point value.

For decades, CPUs came equipped with separate circuitry to conduct logic operations on both integer and floating-point numbers. In the case of the earlier mentioned Pentium II Overdrive, it could execute a single basic floating-point operation (either multiplication or addition) during each clock cycle. This is theoretically why ASCI Red's peak floating-point performance was calculated as 9,632 CPUs x 333 MHz x 1 operation/cycle = approximately 3.2 million million FLOPS.

Real-world implementations rarely achieve these numbers, as they are based on ideal conditions—like using the simplest instructions on data suited for caching. However, these metrics effectively highlight the system's potential capabilities.

Other supercomputers operated similarly, utilizing vast numbers of standard processors—Lawrence Livermore National Laboratory's Blue Pacific relied on 5,808 IBM PowerPC 604e chips, while Los Alamos National Laboratory's Blue Mountain incorporated 6,144 MIPS Technologies R1000 processors.

Achieving a trillion floating-point operations demands thousands of CPUs, continually supported by extensive RAM and hard disk storage. This has always been the case owing to the mathematical demands of high-performance machines.

During initial encounters with equations in physics, chemistry, and other subjects in school, all computations were one-dimensional, meaning we employed a sole digit to represent distance, speed, mass, time, etc. However, accurately modeling and simulating phenomena requires additional dimensions and leads us into vector, matrix, and tensor mathematics.

In mathematics, these constructs are regarded as unified entities that incorporate multiple values, necessitating any computer performing calculations to simultaneously process vast quantities of numbers. Given that CPUs at the time managed only one or two floating-point numbers per cycle, the demand for thousands of floating-point numbers soared.

Then came SIMD in the fray...

Technologies like MMX, 3DNow!, and others took center stage.

In 1997, Intel unleashed an update to its Pentium CPU series involving a tech named MMX, introducing a set of instructions that made use of eight additional registers within the core, each designed to hold one to four integer values. This allowed the processor to simultaneously execute one instruction across multiple digits, a process referred to as SIMD (Single Instruction, Multiple Data).

Within a year, AMD debuted its own version, named 3DNow!, which provided exceptional performance since the registers could also accommodate floating-point values. Intel subsequently addressed this gap with the introduction of SSE (Streaming SIMD Extensions) in the Pentium III, a full year later.

As the calendar entered the new millennium, high-performance computer designers could finally exploit standard processors capable of efficiently managing vector mathematics.

Once expanded to thousands, these processors could equally adeptly tackle matrices and tensors. Despite advancements, the supercomputing realm still favored older or dedicated chips, as these new extensions had not been explicitly designed for such tasks. This also held true for another rapidly emerging processor type, the GPU, which was increasingly better at SIMD workloads than CPUs from AMD or Intel.

In the early days of graphics processing, the CPU handled calculations for the triangles composing the scene (hence AMD's naming for its SIMD capabilities as 3DNow!). Yet, tasks related to pixel shading and texturing were solely the domain of GPUs, and much of this work relied heavily on vector mathematics.

Top-tier consumer-grade graphics cards from over two decades ago, like the 3dfx Voodoo5 5500 and Nvidia GeForce 2 Ultra, were exceptional SIMD devices. However, their creation was aimed not at general computation but specifically at generating 3D graphics for gaming. Even specialized market graphics cards focused solely on rendering tasks.

For instance, the ATI FireGL 3, priced at $2,000, came equipped with two IBM chips (one GT1000 geometric engine and one RC1000 rasterizer), a gigantic 128 MB of DDR-SDRAM, and claimed to deliver around 30 GFLOPS of processing power. However, all this capability was directed to accelerating graphics within applications like 3D Studio Max and AutoCAD, utilizing the OpenGL rendering API.

At that time, GPUs were restricted to a narrow lane of functionalities as converting 3D objects into monitor images did not require extensive floating-point mathematics. In fact, much of the processing was conducted at the integer level, and graphics cards took years before they began utilizing floating-point values extensively throughout their pipelines.

One of the first exceptions was ATI’s R300 processor, equipped with eight distinct pixel pipelines that handled all mathematical operations with 24-bit floating-point precision. Unfortunately, aside from graphics, there was no other mechanism to exploit this capability as both hardware and related software were entirely image-centric.

Computer engineers had not overlooked the noteworthy SIMD capabilities of GPUs but faced difficulties finding methods to apply them across different fields. It was particularly fascinating that a gaming console would ultimately showcase how to solve this puzzle.

The dawn of a unified era occurred with Microsoft’s Xbox 360, launched in November 2005, whose CPU was designed and manufactured by IBM based on the PowerPC architecture, with the GPU conceived by ATI and built by TSMC.

The graphics chip codenamed Xenos was revolutionary due to its entirely different approach: it abandoned the classic model of separate vertex and pixel pipes in favor of a three-way SIMD array cluster.

Specifically, each cluster comprised 16 vector processors, with each containing five mathematical units, allowing each array to execute two sequential instructions from threads on 80 floating-point data values simultaneously during each cycle.

Referred to as a unified shader architecture, each array could handle any type of shader work. Despite the increased complexity in the chip's other aspects, this design paradigm sparked a blueprint that remains in use to this day. Operating at a clock speed of 500 MHz, the entire cluster theoretically permitted a processing rate of 240 GFLOPS (500 x 16 x 80 x 2) for multiplication and addition commands issued from three threads.

To put this figure into perspective, some of the world’s top supercomputers a decade earlier could not match this speed. The Sandia National Laboratory’s Aragon XP/S140, for instance, boasted a peak speed of 184 GFLOPS, utilizing 3,680 Intel i860 CPUs and topped the list of supercomputers back in 1994. The pace of chip development had rapidly outstripped that of such machines.

For years, CPUs would continue to integrate their SIMD arrays; Intel’s initial Pentium MMX included a dedicated unit for performing instructions on vectors, accommodating a maximum of eight 8-bit integers. While Microsoft's Xenos found widespread usage in homes globally, the size of such devices had more than doubled when compared to Xenos, although they still remained significantly smaller.

Gradually, consumer-grade graphics cards began embracing GPUs with unified shader architectures, leading to processing rates that significantly outpaced that of the graphics chip found in the Xbox 360.

The Nvidia G80 used in the GeForce 8800 GTX (2006), for example, had a theoretical peak of 346 GFLOPS, while the ATI R600 in the Radeon HD 2900 XT (2007) boasted 476 GFLOPS.

Soon after, the two graphics chip manufacturers leveraged this computation prowess in their professional models. Despite being exorbitantly priced, ATI’s FireGL V8650 and Nvidia’s Tesla C870 were splendid for high-end scientific computing. Nonetheless, at the uppermost echelons, supercomputers globally still relied predominantly on standard CPUs. In fact, it took several years before GPUs began appearing in the most powerful systems.

The design, construction, and operation of supercomputers and similar systems demand extraordinary investments. Over the years, they have primarily been constructed around vast arrays of CPUs, making the integration of an additional processor notably intricate. Effective planning and preliminary small-scale testing were prerequisites ahead of increasing chip volumes.

Additionally, ensuring that all components functioned harmoniously, particularly regarding software integration, posed considerable challenges—this marked a significant vulnerability for GPUs during this time. While these chips had become highly programmable, the software available for them was still limited.

Microsoft's HLSL (Higher Level Shader Language), Nvidia's Cg library, and OpenGL's GLSL simplified the access to the processing capabilities of graphics chips, although predominantly for rendering tasks.

The introduction of unified shader architecture GPUs revolutionized the situation.

In 2006, ATI, which had become a subsidiary of AMD, along with Nvidia released software development kits designed to expand this capability far beyond graphics. Their APIs were named CTM (Close To Metal) and CUDA (Compute Unified Device Architecture), respectively.

However, what the scientific and data processing community truly needed was a comprehensive software package that considered numerous CPUs and GPUs (referred to as heterogeneous platforms) as a single entity composed of various computing devices.

Their needs were finally met in 2009 with the release of OpenCL, initially developed by Apple and published by the Khronos Group, which absorbed OpenGL a few years prior. This established an effective software platform for utilizing GPUs beyond everyday graphics—GPGPU, a term coined by Mark Harris, referred to general-purpose computation performed on GPUs.

As GPUs entered the computation arena, the global tech review landscape varied drastically, with no army of reviewers testing the performance claims of supercomputers.

An ongoing project started at the University of Mannheim in Germany in the early 1990s was particularly focused on realizing this goal—it is known as "TOP500," which publishes a list of the strongest 500 supercomputers worldwide twice a year.

The first notable entries flaunting GPU reliance emerged in 2010, featuring two systems from China—Nebulae and Tianhe-1, which respectively relied on the Nvidia Tesla C2050 and the AMD Radeon HD 4870, with the former achieving a theoretical peak of 2,984 TFLOPS.

In the early days of high-end GPGPU, Nvidia emerged as the go-to vendor for computational heavyweights not purely based on performance (as AMD’s Radeon cards often offered better processing capabilities), but rather on software support. CUDA experienced rapid growth while AMD took time to search for an apt alternative, prompting users to switch to OpenCL.

Nonetheless, Nvidia did not monopolize the market entirely; Intel’s Xeon Phi processors aimed to carve out a niche. These large chips descended from a deprecated GPU project named Larrabee, representing a unique CPU-GPU hybrid made up of multiple Pentium-like cores (the CPU aspect) with a significant floating-point unit (the GPU element) paired alongside.

A closer look at the Nvidia Tesla C2050 revealed 14 blocks known as Streaming Multiprocessors (SM), each divided by a cache and central controller. Each contained 32 pairs of logical circuits (termed CUDA cores by Nvidia) to execute all mathematical computations—one set for integer values, the other for floating-point operations. In the latter case, each core could manage a single (32-bit) FMA (fused multiply-add) operation each clock cycle; double precision (64-bit) operations required at least two clock cycles.

In contrast, the floating-point units within Xeon Phi chips bore some similarities, albeit each core processed half of the data values compared to those in the C2050's SM. However, due to possessing 32 duplicate cores rather than 14 in Tesla, a single Xeon Phi processor could handle a greater number of values overall per clock cycle. Nonetheless, the first generation of this chip from Intel functioned more as a prototype, struggling to showcase its potential, leading to Nvidia’s solutions being faster and proving superior in the end.

This competitive dynamic reemerged between AMD, Intel, and Nvidia in the GPGPU sector, where one model could boast more processing cores, while another may possess higher clock speeds or stronger caching systems.

While CPUs remain critical for various computations, many supercomputers and high-end computing systems still utilize processors from AMD or Intel. Although individual CPUs cannot compete with the SIMD performance of an average GPU, when thousands of CPUs are interconnected, they prove ample; however, such systems often lack efficacy.

For instance, while the Tianhe-1 supercomputer deployed AMD’s Radeon HD 4870 GPU, AMD’s largest server CPU (a 12-core Opteron 6176 SE) gained popularity. This CPU could theoretically achieve 220 GFLOPS with a power draw of around 140 W, whereas the GPU could easily deliver 1,200 GFLOPS at peak performance, requiring just an additional 10 W, all at a fraction of the cost.

A small graphics card could do more.

Years later, the use of GPUs for collective parallel computing extended beyond just supercomputers globally. Nvidia actively promoted its GRID platform, a GPU virtualization service catering to scientific and various applications. Initially launched as a system for hosting cloud-based gaming, the ever-growing demand for large scale, cost-effective GPGPU made this shift inevitable. At its annual technology conference, GRID emerged as a vital tool for engineers across various fields.

During this same event, GPU manufacturers unveiled an architecture named Volta. While scant details were disclosed, the prevailing assumption was that this chip would serve Nvidia's entire market spectrum.

Simultaneously, AMD followed suit, leveraging its periodically updated Graphics Core Next (GCN) architecture across its gaming-focused Radeon series alongside FirePro and Radeon Sky server cards. By this time, performance statistics had become staggering— the FirePro W9100 achieved a peak FP32 throughput of 5.2 TFLOPS, a number that was beyond comprehension for even the most advanced supercomputers from just two decades prior.

Even though GPUs were still fundamentally developed for 3D graphics, advancements in rendering technologies indicated a shift towards these chips being increasingly adept at handling general computational workloads. The sole hindrance remained their limited ability to perform high-precision floating-point mathematics (i.e., FP64 or higher).

Through 2015, the number of supercomputers reliant on GPUs (such as Intel’s Xeon Phi or Nvidia’s Tesla) was relatively scarce compared to those running entirely on CPU frameworks.

This trend drastically transformed when Nvidia debuted its Pascal architecture in 2016, a product explicitly engineered for the high-performance computing market, diverging from previous architectures aimed at multiple sectors. This architecture produced only one chip (the GP100) with five products, wherein prior architectures mainly contained a handful of FP64 cores, whereas this model housed nearly 2,000.

Tesla P100 provided over 9 TFLOPS of FP32 processing power, while its FP64 capacity stood at half of that, underscoring its formidable capabilities. Conversely, AMD's Radeon Pro W9100 utilizing Vega 10 architecture was 30% faster in FP32 but lagged behind by 800% in FP64 processing. At this juncture, Intel faced imminent discontinuation of the Xeon Phi due to lackluster sales.

The subsequent year witnessed the release of Volta from Nvidia, signaling that the company aimed not only to expand its GPUs into HPC and data processing landscapes but was setting sights on new markets.

Within the domain of deep learning, a specialized field in the larger framework of machine learning—which itself is a subset of artificial intelligence—complex mathematical models (termed neural networks) are employed to extract information from available datasets.

An illustrative example involves determining the likelihood that a particular animal is depicted in a presented image. For this process, the model requires “training,” which consists of showing millions of images with and without the animal in question. This involves sophisticated mathematics rooted in matrix and tensor calculations.

For decades, such workloads were strictly suited for large CPU-based supercomputers, yet as early as the 2000s, it became evident that GPUs were remarkably primed for such tasks.

All the while, Nvidia bet heavily on the significant growth of the deep learning market, incorporating additional functionalities into its Volta architecture, thus enhancing its distinctiveness in this domain. These features include FP16 logic unit groups sold as tensor cores, operating collectively as an extensive array yet displaying limited functionality.

Indeed, their competencies were very narrow, limited to a singular function: multiplying two FP16 4x4 matrices and subsequently adding another FP16 or FP32 4x4 matrix to the outcome (a process termed GEMM operation). NVIDIA's former GPUs, as well as those from competitors, could also execute such calculations but fell severely short in speed compared to Volta. The only GPU utilizing this architecture— the Tesla V100—incorporated 512 tensor cores capable of executing 64 GEMM operations each clock cycle.

The Tesla V100 cards could theoretically deliver up to 125 TFLOPS in these tensor calculations, depending on the size of the matrices in the data set and the floating-point sizes employed. While Volta appeared targeted towards a niche market, its predecessor, the GP100, made sporadic inroads into supercomputer applications while new Tesla models were eagerly adopted en masse.

Enthusiasts witnessed Nvidia subsequently integrating tensor cores within general consumer products across its Turing architecture and developing a technology called DLSS (Deep Learning Super Sampling), employing these cores to run neural networks on computers, magnifying images while rectifying artifacts in frames.

In short order, Nvidia monopolized the deep learning market with GPU acceleration, significantly boosting revenue in its data centers, seeing growth rates of 145% in FY2017, 133% in FY2018, and 52% in FY2019. By the end of FY2019, sales in sectors such as HPC and deep learning culminated to $2.9 billion, reflecting a remarkably favorable outcome.

However, market growth was inevitable, leading to intensified competition. Though Nvidia remains the foremost GPU provider to date, other sizable tech corporations weren’t stagnant either.

In 2018, Google began providing access through its cloud services to its internally developed tensor processing chips, quickly followed by Amazon with its dedicated CPU AWS Graviton. Simultaneously, AMD reorganized its GPU division into two distinct product lines: one focused primarily on gaming (RDNA) and the other concentrated on computation (CDNA).

While RDNA exhibited marked differences from its predecessors, CDNA predominantly evolved from GCN, albeit significantly scaled up. Observing today’s GPUs utilized in supercomputers, data servers, and AI machines reveals a colossal leap in size and capability.

For instance, AMD's MI250X powered by CDNA 2 delivers 220 computing units, offering nearly 48 TFLOPS of dual-precision FP64 throughput along with 128 GB of high-bandwidth memory (HBM2e)— both highly coveted features for HPC applications. On the flip side, Nvidia's GH100, utilizing Hopper architecture with 576 Tensor Cores, showcases potential in AI matrix computations with lower-precision INT8 digital formats, capable of reaching 4000 TOPS.

Intel's Ponte Vecchio GPU is equally massive, featuring 100 billion transistors, while AMD's MI300 boasts 46 billion transistors encompassing multiple CPU, graphics, and memory chips.

However, one shared trait among them is that they are not strictly GPUs. Long before Nvidia marketed this term, the acronym stood for Graphics Processing Unit. AMD's MI250X contains no Render Output Units (ROP), even the GH100 shares similar Direct3D performance comparable to the GeForce GTX 1050, rendering the "G" in GPU inconsequential.

What should we refer to them as then?

“GPGPU” seems lacking since it awkwardly refers to the utilization of GPUs in general computation rather than the devices themselves. “HPCU” (High-Performance Computing Unit) is scarcely better. But perhaps that distinction lacks significance.

After all, the term "CPU" itself is highly generic, encompassing various processor types and applications.

So, what could GPUs conquer next?

Nvidia, AMD, Apple, Intel, and dozens of other tech giants have poured billions into GPU research. Today's graphics processors are likely to remain unchanged for the foreseeable future.

For rendering tasks, modern APIs and the software designed for them (like game engines and CAD applications) often operate independently of the hardware, meaning they could theoretically adapt to brand new technologies.

Nevertheless, the components within a GPU designated solely for graphics tasks are relatively scant; triangle setup engines and ROPs are the most evident components, while ray-tracing units featured in recent versions are also highly specialized. However, the remaining architecture fundamentally comprises large-scale parallel SIMD chips backed by robust and sophisticated memory/cache systems.

The fundamental design is as sound as ever—any future advancements in GPUs will be closely tied to advancements in semiconductor fabrication technology. In other words, they can only be enhanced by accommodating more logic units, operating at higher clock speeds, or a combination of both.

Of course, new functionalities might be integrated to enable GPUs to perform well in a broader array of scenarios. Such shifts have repeatedly occurred throughout the history of GPUs, but the transition to unified shader architecture remains particularly noteworthy. While it's ideal to possess dedicated hardware for tensor or ray-tracing computations, the cores at the heart of modern GPUs can adeptly manage both, albeit at reduced efficiency.

That's why products like AMD's MI250 and Nvidia's GH100 have striking similarities to their consumer desktop counterparts, and future designs intended for HPC and AI are likely to follow this trajectory. Hence, if the chips themselves won't undergo monumental alterations, how will their applications change?

Given that any AI-related matter is intrinsically a branch of computation, GPUs may find use in scenarios requiring substantial SIMD calculations. While there are scarce areas within scientific and engineering fields that haven't exploited such processors, a surge in GPU-derived applications remains probable.

Currently, users can procure mobile devices equipped with microchips designed solely to expedite tensor calculations. As the capabilities and proliferation of tools like ChatGPT continue to grow, we expect to see a rise in devices featuring this hardware.

Humble GPUs have transformed from merely faster devices for gaming into versatile accelerators powering workstations, servers, and supercomputers across the globe.

Millions around the world utilize them daily—not just in computers, phones, televisions, and streaming devices but also in services employing voice and image recognition or in providing music and video recommendations.

While the actual next steps for GPUs might reside in uncharted territory, it’s undoubtedly clear that graphics processing units will continue to be central tools for computation and artificial intelligence in the coming decades.

Post Comment

Your email address will not be published. Required fields are marked *+