Software, meet hardware

Tuesday, May 6, 2014

Heterogenous systems, for real

The idea of heterogeneous systems has been an active topic of discussion and research in the computer architecture community for many years. The idea is not new. It basically posits that having a mix of processor types matching the needs of a mix of workloads can deliver greater performance and/or overall efficiency than a comparable (e.g. equally priced) homogenous design. In some very special cases such as in network appliances or in the offloading of cryptographic processing, this has established value. However, it has been very difficult to generalize this idea and to penetrate widely deployed systems and workloads for a number of reasons.

Up until about a decade ago, it was very hard to compete with the steady improvements in general-purpose microprocessors. Even if a special-purpose processor might create an advantage, that advantage was always short-lived because good software running on modern microprocessors always seemed to win in the end. To a degree, this is still true but the "good software" part is getting harder to achieve. The reason is that while microprocessors continue improving in their capabilities, it is becoming increasingly difficult to exploit them to their fullest potential. A major reason for this is that most of the new performance is coming in the form of additional parallelism vs. improved thread speed. This means that software must be written not only to have enough parallelism but also to manage parallel execution and synchronization efficiently enough to really use all the hardware, including the especially complicated task of arranging for data to flow efficiently between processors and main memory. The consequence of all this is that the rate of improvement in performance for most software running on modern general-purpose processors has slowed significantly. To put it a different way, getting good performance requires extra effort on the part of software engineers so now they get to choose where to expend that effort - on general-purpose processors, on special-purpose processors, or maybe both.

Another reason that this idea of heterogenous systems has been difficult to generalize is the pace of change in software. It's always been possible to come up with a clever algorithm or even hardware architecture to solve a given problem. Unfortunately, it often takes some time for these ideas to come to market (especially hardware designs) so, once they have, often the requirements, interfaces or standards will have changed enough that the clever idea is now solving the wrong problem, or one whose importance to end-to-end performance may have significantly diminished. We're not seeing the pace of change slowing down in software but we are seeing a sort of Cambrian explosion of new languages and programming frameworks emerging which give me hope that we'll be able to attack this problem at the platform layer (e.g. with functional languages and implicitly parallel primitives). It may be that I just can't shake the idealism I developed as a developer of optimizing compilers but I see hope here to separate the concerns of parallel algorithm development from parallel systems to a great enough extent that we can transform the task of hardware acceleration from a slow, waterfall-like process into something much closer to agile software development. I'll have much more to say on that topic later.

Anyway, the reason for my blog post is to talk a bit about the work that's been happening in the partnership between IBM and NVIDIA on building a real heterogenous system which exploits the best of the POWER architecture and the best of NVIDIA's GPU design. Check out Sumit Gupta's blog on the subject to get an overview of what we're doing together.

I'll say right here on the record that it's challenging to exploit a highly parallel and heterogenous system but, equally, the rewards can be tremendous so when we look at engineering some aspect of our software stack for high performance, the ability to pull in a massively parallel NVIDIA processor alongside POWER is very compelling.

We're already seeing amazing results.

At GTC in March, my colleague Keith Campbell showed a demo of a K-means clustering algorithm running 8x faster end-to-end by exploiting multiple levels of parallelism - cluster, SMP and GPU - and we did that with Java code! At the same conference, two more colleagues showed GPU acceleration results: Rajesh Bordawekar discussed hashing techniques on the GPU and Alon Shalev Housfater showed some great speedups for regular expression processing.

This is just the tip of the iceberg.

The message here is that with the right engineering amazing speedups and improvements in efficiency are possible and not just in the domains that have been traditionally associated with GPU acceleration like graphics, gaming and scientific computation.

Thursday, April 24, 2014

Why there will never be another mainframe

The mainframe celebrated an anniversary on April 8 that no family of computer systems ever has or perhaps ever will again – 50 years of continuous market availability. This anniversary is the story of a system that has become so entrenched in information technology that it has become irreplaceable.

Some may look at this statement and see something nefarious, fearing perhaps that IBM has cleverly tricked clients into becoming married to a technology so deeply that they can’t move away from it. However, if you talk to mainframe users, you won't hear them complaining that it is somehow old or obsolete, or that they are counting the days until they can remove it from their datacenters. Rather, you’ll hear that it is, in fact, the only way to solve specific, important IT challenges such as the need to be always available, or to always protect their client data or to handle transaction and data volumes that grow by leaps and bounds every day.

Misconceptions of the mainframe abound. This is in part because students are made so much less aware of the technology than more prosaic alternatives during their education and in part because of the error that we make over and over in the IT industry, assuming that each new technology must somehow by virtue of its novelty be better than the one that preceded it. If you look at other areas of engineering, the approach to progress is more mature. Modern architects don't assume that they have built something better than Frank Lloyd Wright or Antoni Gaudi simply because they are using new materials or new design tools. Modern civil engineers don’t assume they can build bridges as innovative as the Golden Gate or Sydney Harbour bridges simply because they were built many decades ago. Instead, engineers in these disciplines look at these older technologies and they learn from their strengths and weaknesses, with a mixture of reverence and critical analysis.

So the question that we should be asking as computer architects is not why rational users and businesses don’t move away from the mainframe faster but what it is about the mainframe that they feel is so irreplaceable.

The simple idea of the mainframe since its inception is that a computer system should be engineered like a fine automobile. New models should be created continually, improving on speed, gas mileage, comfort and other important attributes but these new cars should never require the customer to learn all over again how to drive. Applied to computer architecture, this means that the system (i.e. the hardware and the core system software and subsystems) should continually get better but the applications need not change to adapt to these system-level improvements. The people in the core hardware and software engineering teams for the mainframe are consummate professionals and they remain to this day committed to this idea. It drives us crazy sometimes trying to figure out how we’ll create the next microprocessor or storage subsystem or how we’ll continue to capture all of that hardware value in the standard transactional and batch middleware and operating systems, all in a way that delivers value transparently to customers. However, it is this set of challenges that keeps us on the leading edge of computer architecture and software design.

Not to sound too much like an IBM commercial, there is certainly a valid and useful debate to be had around choices in computer system design and the mainframe is certainly not the right solution for every workload. One useful debate, for example, is the question of how to deal with failure. Failure is inherent in both hardware and software design, whether caused by design flaws, by human error or by environmental conditions such as radiation, temperature or the occasionally errant forklift. If a computer system must provide very high availability, how can it deal with this variety and frequency of faults?

There are many solutions to this problem, involving combinations of hardware and software checking and recovery and various forms of data replication and redundancy. Most of these techniques are well known and have been used in practice for decades. The critical question in designing the computer system and software is not whether such techniques will be used but who is responsible for implementing them.

The mainframe answer has consistently been to provide a high degree of both fault avoidance and fault tolerance directly in the system design, imposing as little responsibility for resiliency as possible onto the developers of application software. This has had two positive effects. First, the intricate engineering of resiliency is left largely to a small group of well-trained engineers rather than having to rely upon a population of application developers that is, on average, less well trained and certainly orders of magnitude larger. Second, the system itself becomes so robust that it can be trusted to take on more tasks at a higher intensity over longer periods of time than an inherently less reliable system, effectively driving up utilization.

There are other ways to engineer resiliency of course. The authors of the excellent book TheDatacenter as a Computer argue that the sheer scale of Internet workloads requires a huge distributed system and, therefore, software must be designed to accommodate the distributed nature of the system. This changes one’s perspective on resiliency because, after all, if the software must be designed to tolerate faults anyway, why worry too much about the reliability of an individual computer? At huge scales, this is a perfectly valid argument but one should not infer from this that the complexity of managing distributed systems must always or even commonly be exposed to application software.

Almost 20 years ago to this day, the Parallel Sysplex clustering technology was created for the mainframe. It allows already capacious mainframes to be coupled both in physically close configurations for performance and resiliency and in geographically dispersed configurations for high availability and disaster recovery. Superficially, this shares many of the attributes of the large scale clusters used in high performance computing or in internet-scale cloud datacenters. However, the design choices in the Parallel Sysplex stay true to the idea that the computer system should implement scaling and resiliency, leaving the application free to focus principally on their business logic. Today, the largest banks in the world use the mainframe with Parallel Sysplex, some serving hundreds of millions of customers every day with fully transactional and secure services. I think it’s fair to say that there are very few workloads or customers who require anything larger than what the mainframe already provides.

A colleague asked me recently what we could be doing in the area of hardware or software innovation to mimic the kind of resiliency and availability provided by the mainframe. As I reflected on this, I came up with a number of specific ideas about how to improve the quality of specific components and how to emulate some of the software-based fault management done on the mainframe but then concluded that the mainframe and its legendary reliability is really due to the sum of many hundreds of design decisions and to the care and professionalism of the people who build the mainframe and work with customers every day. It’s hard to see how 50 years of such excellent engineering and, of course, the hard-won lessons working with demanding clients could ever be reproduced. Even if we could, I’m pretty sure we’d end up with today’s mainframe anyway.

Friday, July 27, 2012

Your next mobile device will not need a password

In hindsight, this deal is fairly obvious.

I use a number of different mobile devices. They don't all have information that I care about but some have very personal and confidential information so I take special care to protect it - using encryption and strong passwords and rotating those passwords frequently. You never know when you'll leave that device on an airplane or in a Starbucks.

But passwords really are a pain. It's been years since I've been able to come up with non-repeating, reasonably mnemonic passwords that pass strength tests (e.g. mixed case, use of numerics & symbols, etc.). Luckily I have a pretty good memory so I can use randomly generated strong passwords. This also has the nice side effect of keeping my daughters off my iPhone :-)

Luckily, all of this will go away in favour of biometric authentication. Years ago, I used an IBM Thinkpad which had a fingerprint scanner and it was fantastic. I have no idea why that technology never found it's way more pervasively into laptops but I think the universal adoption of biometrics on mobile devices is a near certainty and I for one can't wait!

Friday, January 7, 2011

Intel's chase film

Intel has produced some great content for the sandy bridge launch. Love this one:

http://www.youtube.com/watch?v=ZM0ptMqNhso

Tuesday, January 4, 2011

Maybe we'll see (full) Windows on ARM soon?

Mary-Jo Foley thinks so, based on speculations here that we'll see a demo at CES this week.

Embedded Windows on ARM isn't new of course, but full Windows on ARM would be a very interesting new platform. Initially, I suppose, this would be targeted at mobile devices like tablets but why not laptops, desktops or even servers?

Tuesday, November 30, 2010

CACM article on Stretch & ACS

Mark Smotherman & Dag Spicer's article on the Stretch & ACS in the latest CACM is worth a read. I've been handed down some of this history as part of my work at IBM but it's great to see it made more public. This was truly the Golden age of computer architecture at IBM and in the industry at large and it is still amazing to see what was accomplished so many years ago and, perhaps shamefully, how little we've progressed beyond these innovations even today.

There are a couple of interesting points I would have made that Smotherman & Spicer missed. First, the work by Cocke & Allen (and many others) on the 801 led fairly directly to the IBM RT and then the IBM RS/6000 - now known as IBM Power systems. As the co-development of architecture and compiler were so fundamental to this RISC design, the conjoined compiler effort went on and evolved as well into what is now known as the XL compiler suite. The other point that I think I would have made is to challenge computer architects today to think as boldly as these pioneers did 40 years ago. If they could imagine superscalar and out-of-order execution, what could we be imagining and prototyping today?

Sunday, November 21, 2010

ARM servers begin to appear

Many rumblings have been heard over the past year about ARM chips showing up in servers. ZT Systems has just announced one. It's a 1U rack-mounted server which includes 8 dual-core ARM chips, each of which is essentially a distinct node in a cluster. ZT calls each node a "system on module" or SOM. Each SOM gets 1GB of DDR3 ECC DRAM and 1GB of NAND flash. The processors are A9-cortex dual-core, made by STMicro. These processors are pretty slow, maxing out at 600MHz and the memory is very lean at 500MB per core so it won't be just any server side code that runs well on one of these. On the other hand, they draw very little power, apparently no more than 80W per 1U server node. So these are servers for special-purpose, very parallel apps running in very power-sensitive environments.

From a compute space density perspective, this is uninteresting as a 2-socket 1U Westmere box would be much, much faster and can carry much more DRAM (18 DIMMs in the latest System x offering). Naturally, such a box draws much more power as well but it is an open question still whether the power-performance equation tilts in favour of the ARM-based box. Can't wait to see a side-by-side comparison.