Software, meet hardware: Why there will never be another mainframe

The mainframe celebrated an anniversary on April 8 that no family of computer systems ever has or perhaps ever will again – 50 years of continuous market availability. This anniversary is the story of a system that has become so entrenched in information technology that it has become irreplaceable.

Some may look at this statement and see something nefarious, fearing perhaps that IBM has cleverly tricked clients into becoming married to a technology so deeply that they can’t move away from it. However, if you talk to mainframe users, you won't hear them complaining that it is somehow old or obsolete, or that they are counting the days until they can remove it from their datacenters. Rather, you’ll hear that it is, in fact, the only way to solve specific, important IT challenges such as the need to be always available, or to always protect their client data or to handle transaction and data volumes that grow by leaps and bounds every day.

Misconceptions of the mainframe abound. This is in part because students are made so much less aware of the technology than more prosaic alternatives during their education and in part because of the error that we make over and over in the IT industry, assuming that each new technology must somehow by virtue of its novelty be better than the one that preceded it. If you look at other areas of engineering, the approach to progress is more mature. Modern architects don't assume that they have built something better than Frank Lloyd Wright or Antoni Gaudi simply because they are using new materials or new design tools. Modern civil engineers don’t assume they can build bridges as innovative as the Golden Gate or Sydney Harbour bridges simply because they were built many decades ago. Instead, engineers in these disciplines look at these older technologies and they learn from their strengths and weaknesses, with a mixture of reverence and critical analysis.

So the question that we should be asking as computer architects is not why rational users and businesses don’t move away from the mainframe faster but what it is about the mainframe that they feel is so irreplaceable.

The simple idea of the mainframe since its inception is that a computer system should be engineered like a fine automobile. New models should be created continually, improving on speed, gas mileage, comfort and other important attributes but these new cars should never require the customer to learn all over again how to drive. Applied to computer architecture, this means that the system (i.e. the hardware and the core system software and subsystems) should continually get better but the applications need not change to adapt to these system-level improvements. The people in the core hardware and software engineering teams for the mainframe are consummate professionals and they remain to this day committed to this idea. It drives us crazy sometimes trying to figure out how we’ll create the next microprocessor or storage subsystem or how we’ll continue to capture all of that hardware value in the standard transactional and batch middleware and operating systems, all in a way that delivers value transparently to customers. However, it is this set of challenges that keeps us on the leading edge of computer architecture and software design.

Not to sound too much like an IBM commercial, there is certainly a valid and useful debate to be had around choices in computer system design and the mainframe is certainly not the right solution for every workload. One useful debate, for example, is the question of how to deal with failure. Failure is inherent in both hardware and software design, whether caused by design flaws, by human error or by environmental conditions such as radiation, temperature or the occasionally errant forklift. If a computer system must provide very high availability, how can it deal with this variety and frequency of faults?

There are many solutions to this problem, involving combinations of hardware and software checking and recovery and various forms of data replication and redundancy. Most of these techniques are well known and have been used in practice for decades. The critical question in designing the computer system and software is not whether such techniques will be used but who is responsible for implementing them.

The mainframe answer has consistently been to provide a high degree of both fault avoidance and fault tolerance directly in the system design, imposing as little responsibility for resiliency as possible onto the developers of application software. This has had two positive effects. First, the intricate engineering of resiliency is left largely to a small group of well-trained engineers rather than having to rely upon a population of application developers that is, on average, less well trained and certainly orders of magnitude larger. Second, the system itself becomes so robust that it can be trusted to take on more tasks at a higher intensity over longer periods of time than an inherently less reliable system, effectively driving up utilization.

There are other ways to engineer resiliency of course. The authors of the excellent book TheDatacenter as a Computer argue that the sheer scale of Internet workloads requires a huge distributed system and, therefore, software must be designed to accommodate the distributed nature of the system. This changes one’s perspective on resiliency because, after all, if the software must be designed to tolerate faults anyway, why worry too much about the reliability of an individual computer? At huge scales, this is a perfectly valid argument but one should not infer from this that the complexity of managing distributed systems must always or even commonly be exposed to application software.

Almost 20 years ago to this day, the Parallel Sysplex clustering technology was created for the mainframe. It allows already capacious mainframes to be coupled both in physically close configurations for performance and resiliency and in geographically dispersed configurations for high availability and disaster recovery. Superficially, this shares many of the attributes of the large scale clusters used in high performance computing or in internet-scale cloud datacenters. However, the design choices in the Parallel Sysplex stay true to the idea that the computer system should implement scaling and resiliency, leaving the application free to focus principally on their business logic. Today, the largest banks in the world use the mainframe with Parallel Sysplex, some serving hundreds of millions of customers every day with fully transactional and secure services. I think it’s fair to say that there are very few workloads or customers who require anything larger than what the mainframe already provides.

A colleague asked me recently what we could be doing in the area of hardware or software innovation to mimic the kind of resiliency and availability provided by the mainframe. As I reflected on this, I came up with a number of specific ideas about how to improve the quality of specific components and how to emulate some of the software-based fault management done on the mainframe but then concluded that the mainframe and its legendary reliability is really due to the sum of many hundreds of design decisions and to the care and professionalism of the people who build the mainframe and work with customers every day. It’s hard to see how 50 years of such excellent engineering and, of course, the hard-won lessons working with demanding clients could ever be reproduced. Even if we could, I’m pretty sure we’d end up with today’s mainframe anyway.

Software, meet hardware

Thursday, April 24, 2014

Why there will never be another mainframe

No comments:

Post a Comment