The idea of heterogeneous systems has been an active topic of discussion and research in the computer architecture community for many years. The idea is not new. It basically posits that having a mix of processor types matching the needs of a mix of workloads can deliver greater performance and/or overall efficiency than a comparable (e.g. equally priced) homogenous design. In some very special cases such as in network appliances or in the offloading of cryptographic processing, this has established value. However, it has been very difficult to generalize this idea and to penetrate widely deployed systems and workloads for a number of reasons.
Up until about a decade ago, it was very hard to compete with the steady improvements in general-purpose microprocessors. Even if a special-purpose processor might create an advantage, that advantage was always short-lived because good software running on modern microprocessors always seemed to win in the end. To a degree, this is still true but the "good software" part is getting harder to achieve. The reason is that while microprocessors continue improving in their capabilities, it is becoming increasingly difficult to exploit them to their fullest potential. A major reason for this is that most of the new performance is coming in the form of additional parallelism vs. improved thread speed. This means that software must be written not only to have enough parallelism but also to manage parallel execution and synchronization efficiently enough to really use all the hardware, including the especially complicated task of arranging for data to flow efficiently between processors and main memory. The consequence of all this is that the rate of improvement in performance for most software running on modern general-purpose processors has slowed significantly. To put it a different way, getting good performance requires extra effort on the part of software engineers so now they get to choose where to expend that effort - on general-purpose processors, on special-purpose processors, or maybe both.
Another reason that this idea of heterogenous systems has been difficult to generalize is the pace of change in software. It's always been possible to come up with a clever algorithm or even hardware architecture to solve a given problem. Unfortunately, it often takes some time for these ideas to come to market (especially hardware designs) so, once they have, often the requirements, interfaces or standards will have changed enough that the clever idea is now solving the wrong problem, or one whose importance to end-to-end performance may have significantly diminished. We're not seeing the pace of change slowing down in software but we are seeing a sort of Cambrian explosion of new languages and programming frameworks emerging which give me hope that we'll be able to attack this problem at the platform layer (e.g. with functional languages and implicitly parallel primitives). It may be that I just can't shake the idealism I developed as a developer of optimizing compilers but I see hope here to separate the concerns of parallel algorithm development from parallel systems to a great enough extent that we can transform the task of hardware acceleration from a slow, waterfall-like process into something much closer to agile software development. I'll have much more to say on that topic later.
Anyway, the reason for my blog post is to talk a bit about the work that's been happening in the partnership between IBM and NVIDIA on building a real heterogenous system which exploits the best of the POWER architecture and the best of NVIDIA's GPU design. Check out Sumit Gupta's blog on the subject to get an overview of what we're doing together.
I'll say right here on the record that it's challenging to exploit a highly parallel and heterogenous system but, equally, the rewards can be tremendous so when we look at engineering some aspect of our software stack for high performance, the ability to pull in a massively parallel NVIDIA processor alongside POWER is very compelling.
We're already seeing amazing results.
At GTC in March, my colleague Keith Campbell showed a demo of a K-means clustering algorithm running 8x faster end-to-end by exploiting multiple levels of parallelism - cluster, SMP and GPU - and we did that with Java code! At the same conference, two more colleagues showed GPU acceleration results: Rajesh Bordawekar discussed hashing techniques on the GPU and Alon Shalev Housfater showed some great speedups for regular expression processing.
This is just the tip of the iceberg.
The message here is that with the right engineering amazing speedups and improvements in efficiency are possible and not just in the domains that have been traditionally associated with GPU acceleration like graphics, gaming and scientific computation.
Up until about a decade ago, it was very hard to compete with the steady improvements in general-purpose microprocessors. Even if a special-purpose processor might create an advantage, that advantage was always short-lived because good software running on modern microprocessors always seemed to win in the end. To a degree, this is still true but the "good software" part is getting harder to achieve. The reason is that while microprocessors continue improving in their capabilities, it is becoming increasingly difficult to exploit them to their fullest potential. A major reason for this is that most of the new performance is coming in the form of additional parallelism vs. improved thread speed. This means that software must be written not only to have enough parallelism but also to manage parallel execution and synchronization efficiently enough to really use all the hardware, including the especially complicated task of arranging for data to flow efficiently between processors and main memory. The consequence of all this is that the rate of improvement in performance for most software running on modern general-purpose processors has slowed significantly. To put it a different way, getting good performance requires extra effort on the part of software engineers so now they get to choose where to expend that effort - on general-purpose processors, on special-purpose processors, or maybe both.
Another reason that this idea of heterogenous systems has been difficult to generalize is the pace of change in software. It's always been possible to come up with a clever algorithm or even hardware architecture to solve a given problem. Unfortunately, it often takes some time for these ideas to come to market (especially hardware designs) so, once they have, often the requirements, interfaces or standards will have changed enough that the clever idea is now solving the wrong problem, or one whose importance to end-to-end performance may have significantly diminished. We're not seeing the pace of change slowing down in software but we are seeing a sort of Cambrian explosion of new languages and programming frameworks emerging which give me hope that we'll be able to attack this problem at the platform layer (e.g. with functional languages and implicitly parallel primitives). It may be that I just can't shake the idealism I developed as a developer of optimizing compilers but I see hope here to separate the concerns of parallel algorithm development from parallel systems to a great enough extent that we can transform the task of hardware acceleration from a slow, waterfall-like process into something much closer to agile software development. I'll have much more to say on that topic later.
Anyway, the reason for my blog post is to talk a bit about the work that's been happening in the partnership between IBM and NVIDIA on building a real heterogenous system which exploits the best of the POWER architecture and the best of NVIDIA's GPU design. Check out Sumit Gupta's blog on the subject to get an overview of what we're doing together.
I'll say right here on the record that it's challenging to exploit a highly parallel and heterogenous system but, equally, the rewards can be tremendous so when we look at engineering some aspect of our software stack for high performance, the ability to pull in a massively parallel NVIDIA processor alongside POWER is very compelling.
We're already seeing amazing results.
At GTC in March, my colleague Keith Campbell showed a demo of a K-means clustering algorithm running 8x faster end-to-end by exploiting multiple levels of parallelism - cluster, SMP and GPU - and we did that with Java code! At the same conference, two more colleagues showed GPU acceleration results: Rajesh Bordawekar discussed hashing techniques on the GPU and Alon Shalev Housfater showed some great speedups for regular expression processing.
This is just the tip of the iceberg.
The message here is that with the right engineering amazing speedups and improvements in efficiency are possible and not just in the domains that have been traditionally associated with GPU acceleration like graphics, gaming and scientific computation.