The great blackout of 2003. At about 4:00 Eastern time on
August 14, 2003, the lights
went out from Michigan to as far east as Massachusetts, as well as the
Canadian province of Ontario. Major cities without power included
Cleveland, Columbus, Detroit, Toronto, and Ottawa, as well as much of the
New York City area. Approximately 50 million people were affected by
the outage, making it the largest blackout in American history.
Power in some areas was not restored until two days later.
Inadequate
software played
a major role in the outage, reported the
government task force assigned with investigating the causes. Electric
companies monitor the condition of their transmission lines and use a
state estimator to process the data mathematically and make it
consistent with the electrical system model. At 12:15, the Indiana-based
Midwest Independent Grid Operator's state-estimator software produced a
solution that was obviously in error. This was caused by a transmission
line that had been taken out of service. The analyst found the problem in
about 45 minutes, and then left to go to lunch. Unfortunately, while
looking for the problem, he had turned off the automatic trigger that runs
the state estimator every five minutes. No one noticed that it was off
until 2:40 PM. When turned back on, it again failed to reach a solution.
This was because in the interim, a 597 MW generating station
near Lake Erie had gone out of service, and a Dayton Power & Light
high-voltage transmission line had tripped due to contact with trees. In
the meantime, power was being transmitted over the wrong lines by
operators who were not aware of these outages. By the time they figured
out what was wrong at 4:04 PM, the cascade of outages was about to
begin.
Meanwhile, over in Northeast Ohio, First Energy was having trouble with
their computer systems. At 2:14 PM, FE's control-room operators lost the
alarm function that provided visual and audible indications of trouble with
equipment. Soon therafter, they lost a number of remote consoles. Their
main server went down, followed quickly by the backup server at 2:54 PM.
However, for more than an hour, the control room had no indication that
anything was wrong. Although FE's IT staff was working to bring the
systems up, they didn't notify the control room. The lack of alarms left
the controllers with no clue as to the extent of the problems facing them,
and led them to discount phone calls from outside that warned
of their growing crisis.
The damage might have been averted if software had been more persistent
in issuing warnings. The systems should have been programmed to make it
obvious that the state-estimator was not running, and that the alarm system
was inoperative. Only three
people died in the great blackout. But in Panama in 2001, lack of
software-issued warnings may have cost up to 21 lives. In that case,
28 patients at the National Cancer Institute in Panama City
recieved
massive overdoses of gamma rays because a user interface did not
inform the operator of a radiation-therapy machine of problematic input
data.
Radiation overdoses. In the Panamanian case, a doctor places
metal shields, called "blocks," above the area where tumor is located. The
purpose is to protect normal or sensitive tissue from radiation. The
software for the radiation machine assumed that up to four blocks would be
placed over the patient. However, the doctor was trying to be more
cautious by adding a fifth block. The technician found a way to enter data
for five blocks, and the system displayed a diagram that seemed to confirm
their location. However, if the data for these blocks was entered in a
certain order, the software would miscalculate the treatment times, by a
large amount. This led to overdoses of 20% to 100%, which burned the
patients and left some disabled. In the forty months after the overdoses,
21 of these 28 patients died. The case has led to a trial of the machine's
operator for second-degree murder, and a lawsuit against the machine's
manufacturer, Multidata Systems International, that threatens to bankrupt
the company.
The London Ambulance fiasco. Health-related software has
frequently been involved in fatal malfunctions. Prior to 1991, the London
Ambulance Service was using an uncomputerized dispatching system. They
would write the details of calls on pieces of paper, and pass them to an
allocator. The allocator chose the nearest available ambulance and
forwarded details to a dispatcher, who telephoned or radioed directions.
Sixty-five percent of ambulances reached their destination in fifteen
minutes. A new system was ordered from two different vendors, Datatrak and
Systems Options. Initially, it was configured as a semicomputerized
system--paper reports of calls were still printed out and passed to
dispatchers. It crashed on its first training session. In February 1992,
an operator
inadvertently switched off a screen and lost four emergency calls.
In October, 1992 a new computer
system did away with direct human responses to telephone calls. On its
first two days, fewer than 20% of ambulances reached their destination
within 15 minutes. Some callers were put on hold for up to 30 minutes.
This was not because of an unusually high volume of calls, but because
worried callers would call back. Dispatching of ambulances was delayed by
up to 11
hours. Up to 20 people may have died as a result. Calls were lost.
One disabled woman was trapped in her chair by the body of her collapsed
husband. She called the LAS every 30 minutes. On each call, she was told
that there was no record of her earlier call. An ambulance eventually
arrived 2 hours and 45 minutes after the initial call. By this time, her
husband had died.
Design flaws. The system manifested some basic design
flaws. Following cutover to the automated system, there was no
backup procedure. The user interface gave the operator no
way to scroll through the list of pending calls. The effects
of system overload had not been properly anticipated; the exception
list kept growing. Requirements were not met. Robin Bloomfield, a
government consultant on this system, said it was a fundamental
requirement to have several layers of defense against failure. "With
about a million calls a year," he explained, "the system has to be
more reliable than a nuclear reactor backup system." As put into
operation, the only backup it seemed to have was that people would
make their own arrangements if the system failed.
Forewarnings of disaster. Senior officials of London
Ambulance Service were
warned that system would be an "expensive
disaster," by Michael Page, whose company submitted a competing bid
which had been rejected a year before. Page wrote a series of
memoranda to LAS in June and July 1991, warning that the tracking
subsystem, which tracks ambulances and dispatches the nearest one to
a call, was not up to the requirements. "The rule-based analytical
approach used by the LAS cannot," he wrote, "deal as well as an
experienced operator with the small minority of difficult cases.
The system wrongly reduces the influence of operators."
Why we can't be sure that software is correct. It is hard
to establish the correctness of software. First of all, we can't
ignore design failures. In physical systems, the hardware failure
rate probably greatly exceeds the probability of failure from design
errors. So in practice, we can ignore design-failure probabilities
in safety and reliability calculations. Secondly, redundancy is no
solution. In physical systems, redundancy reduces the impact of
random hardware failures. Unfortunately, software does not exhibit
random failures. "Common-mode" failures can occur. For example,
consider the hydraulic system on United Airlines
Flight 232, July 18,
1989. This was not digital technology, but it was a complex system.
An engine failed, and the resulting explosion disabled all three
hydraulic systems. After that, the jet could not be controlled. The
three hydraulic systems had been put there for the sake of
redundancy. However, lines for all three hydraulic systems converged
in one small area near the tail. An explosion near that site
disabled them all. Now on the Boeing 777, wiring for the digital
flight-control system is routed all over the plane, so no single
explosion can disable it.
Decision paths in software are complex. Even in a program of a
few hundred lines, there are dozens of branches and thousands of
paths. Not all of these paths can be tested. Maybe the designer
misunderstood the situation responsible for those inputs, or failed
to take the situation into account at all. For example, the Patriot
missile failed to intercept Scuds during the first Gulf War. This was
due to the cumulative effect of inaccuracies in timekeeping by a computer.
The system was meant to be turned off and restarted often enough for
accumulated error never to become dangerous.
A lot of oversights, it seems, involve time.
The Y2K problem arose because programmers who used two-digit date
fields never suspected that their code would live on till the year
2000. Maybe we shouldn't blame them for that. But surely we'd
expect programmers to anticipate the beginning of
daylight-savings time next spring! But sometimes they haven't. At the end
of April 1993, when Germany went on summer time, the computer clock of a
German steel producer went from 1:59 AM to 3:00 AM in one minute.
This resulted in a production line allowing molten ingots to cool for
one hour less than normal. When the process controller thought the
cooling time had expired, his actions splattered still-molten steel,
damaging part of the facility. Similarly, latitude and longitude
have sabotaged several computerized systems. Early American
air-traffic control software could not be used in Britain because it
was unable to cope with longitudes less than 0. A bug in the
simulator for the F-16 fighter aircraft caused it to flip over
whenever it crossed the equator.
Small changes in programs make big differences. In many physical
systems, a small change to stimuli will produce only a small change
in response. But in programs, changing a single bit from 0 to 1 can
radically change the operation of a program. The destruction of the
Mariner probe is an example. This was the first US interplanetary
probe, meant to explore Venus. There was a bug in control program
for Atlas rocket that launched it. On July 22, 1962, a single
incorrect character caused it to veer off course. An equation was
missing a "bar" that told it to use a set of averaged values instead
of raw data. This led the computer to decide the rocket was behaving
erratically, although it was not. When the computer tried to correct
the situation, it did cause erratic behavior. For the safety of
people on the ground, both rocket and spacecraft had to be destroyed
shortly after launch.
In 1992, a
Dutch chemical plant exploded due to a single mistyped
character. Three firemen were killed and eleven workers injured.
Fragments were found at a distance of 1 km. The damage was estimated
at several tens of millions of guilders (or about half that much in
dollar terms). A lab-worker trainee typed tank 634 instead of 632 as
the source tank for a chemical used in a reaction. Instead of resin
feed classic, he put in dicyclopentadiene. He failed to check
whether the tank contents were consistent with the recipe. When the
reactor overheated, he called the plant fire department. The fire
department expected the reactor contents to be released via a safety
valve. So they were connecting deluge guns to prevent the fire from
spreading. They did not have safety equipment on. Instead, the
reactor ruptured and exploded. They called in the city fire department,
but it had to let the fire burn out to prevent environmental damage
from polluted firefighting water.
We can't test software "long enough." In June and July 1991, a
series of outages affected telephone users in SF, DC, VA, WV,
Baltimore, and Greensboro. The problems were in a switching program
written by DSC Communications. The program consisted of several
million lines of code. They ran it through a 13-week series of
tests, and it passed. Then they changed three lines. They didn't
think it necessary to repeat the 13-week series of tests. They were
confident they understood the effects of that change. The program
crashed repeatedly.
Overconfidence was the culprit in the most widely known case of
unsafe software. It involved the
Therac-25
, a radiation-therapy
machine. Between June 1985 and January 1987, six known accidents
occurred, involving patients being given massive overdoses of
radiation. Two patients in Galveston, TX died. A machine like the
Therac-25 can deliver two kinds of radiation, X-rays and electrons.
Electrons work well for irradiating cancers near the surface. For
cancers further in, X-rays are used. To get the X-rays, a tungsten
shield is placed over the patient's body. A very powerful stream of
electrons is directed at it, which causes it to emit X-rays. The
X-rays must be emitted only when the shield is in place. Here is the
error that caused the problem: The operator prepared to send X-rays,
then realized she had made a mistake. She switched the machine over
to electrons, and the shield retracted. But it retracted before the
intensity of the beam was lowered. Patients felt a severe burning
sensation, and this was only the start of their problems. Radiation
sickness followed.
Some of the software from the Therac-20, a previous version, was
reused in the Therac-25. The Therac-20 had included a mechanical
interlock. But it had been run for years without any software
problems being noticed. So the mechanical interlock was removed on
the Therac-25. Turns out, when the bug occurred in the Therac-20,
before any harmful radiation could be emitted, it blew a fuse.
Achieving greater confidence. Even if we can't ensure that
software is perfectly correct, how can achieve greater confidence?
One approach is
formal specification. Natural language is a wondrous
instrument. It allows us to say things like, "Do you feel more like
you do now than when you came in?" The requirements for a complex
system will almost inevitably be complex. If expressed in English,
they can easily be incomplete or self-contradictory. In a large
document, these contradictions can be far from obvious.
Writing formal specs in a specification language is difficult and
time-consuming. Therefore, large, complex systems are virtually the
only ones for which it is undertaken. Unfortunately, specifications
must also be formal. If the specification is contradictory, we
should be able to detect it with analysis techniques. If the
specification is incomplete, design may reveal this incompleteness.
But if it is functionally wrong, this can only be determined by human
review of the functioning of the system. The problem is that our
"domain experts" are unfamiliar with specification languages and
mathematical logic. For example, Teletronic Pacing Systems develops
software for pacemakers. They wrote their specs in a formal
specification language. But the cardiologists couldn't understand
them. So they had to rewrite them in English.
We can attempt to verify our systems mathematically. This
approach relies on assertions, preconditions about what has to be
true on entry to routines and postconditions about what should be
true when they are exited. However, a program can only be proved
correct with respect to something else, e.g., a specification. So if
the specification is wrong, the proof is useless. You prove you have
correctly implemented the wrong thing.
Large programs are too tedious to proved correct "by hand."
Therefore, they must be proved by theorem-proving software. But
suppose this software has bugs ... Besides, programs can't be checked
the way mathematical proofs are checked. In math, some proofs are
complicated. There are even cases where two groups of mathematicians
have proved contradictory things, and neither was able to disprove
the others' work. So, our certainty about math relies upon
mathematicians checking the work of others. Prove them wrong, and
you get to publish the results. But program proofs are too long and
mechanical for anyone to verify by reading them. John Rushby, a
proponent of formal verification methods, says, "No one reads this
stuff with a lot of care--it's too boring."
One experiment was to prove emergency shutdown system for the
Darlington nuclear plant on Lake Ontario. The system was put in
software so they could save money on specialized hardware and achieve
greater safety, by providing better information to the operators.
They divided their programmers into four teams. One team converted
the programs into logic. Another did the same with the specs. A
third team verified that the programs implemented the specs. A
fourth team checked the work of the third. This delayed the project
by six months, and in the meantime, the plant could not be opened.
The plant cost its investors $20 million for each month it was idle.
So the delay cost $120M. But at least it saved $1M of hardware.
n-version
programming. OK, if it's too hard to prove that
one program is correct, maybe we can have different teams of
programmers write entirely separate programs and see if their results
agree. If, say, three versions of the program are written, on a
critical decision, the majority can rule. The idea is that
independently developed programs will have independent bugs, and
won't fail at the same time. A test conducted by Nancy Leveson
attempted to validate this approach experimentally. Students at
different grad schools wrote many versions of the same program and
tested them extensively. The same inputs caused many different
versions to fail. It turns out that the hard parts of programs are
just hard, no matter who writes them.
Living with unreliable software. Let's face facts: We'll
have to learn to live with software that is at least a little bit
unreliable. Here are four stratagems we can use to make the
situation more palatable. First, improve programming practices. Use
safer programming languages. As Nancy Leveson says, the easiest way
to find out what features are bad is just to look at C; it has all of
them. Boeing developed systems in both C and Ada. The Ada systems
were not only cheaper to produce, but had many fewer errors. C++
with its dangling pointers and memory leaks isn't much better. For a
safety-critical system, Java or Ada 95 would be a much better choice.
Second,
program more safely. Any memory that's unused should be
initialized to a pattern that will leave system in a safe state,
should it be referenced by accident. Set critical flags and
conditions as close as possible to the code they protect. Decouple
systems. A highly coupled system is one that's highly dependent: A
failure in one component rapidly affects others. Grade separations
decrease accidents because they decouple traffic. This principle
seems lost on the engineers who rebuilt the Raleigh Beltline; they
took out grade separations at three exits and put up traffic lights.
You may write code differently if you're opting for safety. For
example, you can put code in ROM or write-restricted RAM. But
there's a tradeoff--less tightly coupled systems are less efficient.
Third, make the role of software "not too critical." In computer
science, we are under a handicap compared to other engineering
disciplines. In other fields, they know that things will wear out
and fail. But, theoretically, we can get software "right." Nancy
Leveson opines, "Maybe after 40 years of trying, it is time to be a
little more realistic."
Reliability requirements should be modest enough so that
reliability can be demonstrated before a system is deployed. One
option is relying on non-computer-controlled backup systems. A
safety backup usually performs simpler functions than the main
system, so it can be built more reliably. It could be built with
different technology, or use other sensors, actuators and power
sources. For example, a system can use gravity or another physical
property to ensure that a vent valve opens when pressure is
excessive. Then, the probability that both systems will fail
simultaneously is minimized.
Finally, we can learn to live with more modest overall system
safety. Software may still be safer than the non-software system it
replaces. In a fly-by-wire aircraft, pilots control the aircraft
through a computer that may modify their commands before acting on
them. The Airbus A320 or Boeing 777 are examples. The possibility
that
software may cause accidents has to be weighed against the
likelihood that it may avoid mishaps that would otherwise be caused
by pilot error.
Drive-by-wire
systems will avert many auto accidents, but will be more of a challenge
than fly-by-wire. Pilots have to be highly trained and physically fit,
but anyone can drive a car. Suppose a drive-by-wire system is in
control of a car that has just had a blowout, and the driver attempts
to swerve to avoid an obstacle; what should it do?
Robotic surgery will soon be feasible. There may
be bugs, but human surgeons also have fairly high failure rates.
Anyway, robots will soon be able to perform surgery that is beyond
capabilities of humans.
Summary. To end our consideration of software safety, let's
review a few of the main points. Safety problems are likely if
software is tested inadequately or deployed too rapidly. Other
problems arise when the programmer ignores relevant conditions, like
the equator, daylight-savings time, or leap years. Software is
logically more complex than other engineering systems, so it is
necessary to get the design right. In practical terms, it's
impossible to know that a program is correct, so one must at
least follow good design principles and write code in languages
without well known safety vulnerabilities. If these principles are
adhered to assiduously, it should be possible to build computerized
systems that are at least safer than the non-computerized systems they
replace.