The mainframe celebrated an anniversary on April 8 that no
family of computer systems ever has or perhaps ever will again – 50 years of
continuous market availability. This
anniversary is the story of a system that has become so entrenched in information
technology that it has become irreplaceable.
Some may look at this statement and see something nefarious,
fearing perhaps that IBM has cleverly tricked clients into becoming married to
a technology so deeply that they can’t move away from it. However, if you talk to mainframe users, you
won't hear them complaining that it is somehow old or obsolete, or that they
are counting the days until they can remove it from their datacenters. Rather, you’ll hear that it is, in fact, the
only way to solve specific, important IT challenges such as the need to be
always available, or to always protect their client data or to handle
transaction and data volumes that grow by leaps and bounds every day.
Misconceptions of the mainframe abound. This is in part because students are made so
much less aware of the technology than more prosaic alternatives during their
education and in part because of the error that we make over and over in the IT
industry, assuming that each new technology must somehow by virtue of its
novelty be better than the one that preceded it. If you look at other areas of engineering,
the approach to progress is more mature.
Modern architects don't assume that they have built something better
than Frank Lloyd Wright or Antoni Gaudi simply because they are using new
materials or new design tools. Modern
civil engineers don’t assume they can build bridges as innovative as the Golden
Gate or Sydney Harbour bridges simply because they were built many decades ago. Instead, engineers in these disciplines look
at these older technologies and they learn from their strengths and weaknesses,
with a mixture of reverence and critical analysis.
So the question that we should be asking as computer
architects is not why rational users and businesses don’t move away from the
mainframe faster but what it is about the mainframe that they feel is so
irreplaceable.
The simple idea of the mainframe since its inception is that
a computer system should be engineered like a fine automobile. New models should be created continually, improving
on speed, gas mileage, comfort and other important attributes but these new
cars should never require the customer to learn all over again how to
drive. Applied to computer architecture,
this means that the system (i.e. the hardware and the core system software and
subsystems) should continually get better but the applications need not change
to adapt to these system-level improvements.
The people in the core hardware and software engineering teams for the
mainframe are consummate professionals and they remain to this day committed to
this idea. It drives us crazy sometimes
trying to figure out how we’ll create the next microprocessor or storage subsystem or how we’ll continue to capture all of that hardware
value in the standard transactional and batch middleware and operating systems,
all in a way that delivers value transparently to customers. However, it is this set of challenges that
keeps us on the leading edge of computer architecture and software design.
Not to sound too much like an IBM commercial, there is
certainly a valid and useful debate to be had around choices in computer system
design and the mainframe is certainly not the right solution for every
workload. One useful debate, for
example, is the question of how to deal with failure. Failure is inherent in both hardware and
software design, whether caused by design flaws, by human error or by
environmental conditions such as radiation, temperature or the occasionally
errant forklift. If a computer system
must provide very high availability, how can it deal with this variety and
frequency of faults?
There are many solutions to this problem, involving
combinations of hardware and software checking and recovery and various forms
of data replication and redundancy. Most
of these techniques are well known and have been used in practice for
decades. The critical question in
designing the computer system and software is not whether such techniques will
be used but who is responsible for implementing them.
The mainframe answer has consistently been to provide a high
degree of both fault avoidance and fault tolerance directly in the system
design, imposing as little responsibility for resiliency as possible onto the
developers of application software.
This has had two positive effects.
First, the intricate engineering of resiliency is left largely to a
small group of well-trained engineers rather than having to rely upon a
population of application developers that is, on average, less well trained and
certainly orders of magnitude larger.
Second, the system itself becomes so robust that it can be trusted to
take on more tasks at a higher intensity over longer periods of time than an
inherently less reliable system, effectively driving up utilization.
There are other ways to engineer resiliency of course. The authors of the excellent book TheDatacenter as a Computer argue that the sheer scale of Internet workloads
requires a huge distributed system and, therefore, software must be designed to
accommodate the distributed nature of the system. This changes one’s perspective on resiliency
because, after all, if the software must be designed to tolerate faults anyway,
why worry too much about the reliability of an individual computer? At huge scales, this is a perfectly valid
argument but one should not infer from this that the complexity of managing
distributed systems must always or even commonly be exposed to application
software.
Almost 20 years ago to this day, the Parallel Sysplex
clustering technology was created for the mainframe. It allows already capacious mainframes to be
coupled both in physically close configurations for performance and resiliency
and in geographically dispersed configurations for high availability and
disaster recovery. Superficially, this
shares many of the attributes of the large scale clusters used in high
performance computing or in internet-scale cloud datacenters. However, the design choices in the Parallel
Sysplex stay true to the idea that the computer system should implement scaling
and resiliency, leaving the application free to focus principally on their
business logic. Today, the largest banks
in the world use the mainframe with Parallel Sysplex, some serving hundreds of
millions of customers every day with fully transactional and secure
services. I think it’s fair to say that
there are very few workloads or customers who require anything larger than what the
mainframe already provides.
A colleague asked me recently what we could be doing in the
area of hardware or software innovation to mimic the kind of resiliency and
availability provided by the mainframe.
As I reflected on this, I came up with a number of specific ideas about
how to improve the quality of specific components and how to emulate some of
the software-based fault management done on the mainframe but then concluded
that the mainframe and its legendary reliability is really due to the sum of
many hundreds of design decisions and to the care and professionalism of the
people who build the mainframe and work with customers every day. It’s hard to see how 50 years of such
excellent engineering and, of course, the hard-won lessons working with demanding
clients could ever be reproduced. Even
if we could, I’m pretty sure we’d end up with today’s mainframe anyway.
No comments:
Post a Comment