Lecture 13: Software Safety

The London Ambulance fiasco. In 1991, the London Ambulance Service scuttled a new computerized dispatch system. It was supposed to have been ready in Summer 1990. Costs had escalated from £2.5M to £7.5M. It failed two major tests, and the LAS sued the vendor--a subsidiary of British Telecom--and a subcontractor.

At the time, the ambulance service was using an uncomputerized dispatching system. They would write the details of calls on pieces of paper, and pass them to an allocator. The allocator chose the nearest available ambulance and forwarded details to a dispatcher, who telephoned or radioed directions. Sixty-five percent of ambulances reached their destination in fifteen minutes. A new system was ordered from two different vendors, Datatrak and Systems Options. Initially, it was configured as a semicomputerized system--paper reports of calls were still printed out and passed to dispatchers. It crashed on its first training session. In February 1992, an operator inadvertently switched off a screen and lost four emergency calls. On one such occasion, the caller called back 30 minutes later and an ambulance was dispatched. The patient later died, although "it is not proven that there was any link between the delay and the death."

In October, 1992 a new computer system did away with direct human responses to telephone calls. On its first two days, Oct. 26 and 27, fewer than 20% of ambulances reached their destination within 15 minutes of being called. Some callers were put on hold for up to 30 minutes. This was not because of an unusually high volume of calls, but because worried callers would call back. There were delays of up to 11 hours in dispatching ambulances. Up to 20 people may have died as a result. Logged calls were lost. One disabled woman was trapped in her chair by the body of her collapsed husband. She called the LAS every 30 minutes. On each call, she was told that there was no record of her earlier call. An ambulance eventually arrived 2 hours and 45 minutes after the initial call. By this time, her husband had died.

Design flaws. The system manifested some basic design flaws. Following cutover to the automated system, there was no backup procedure. The user interface was not good; the operator had no ability to scroll through the list of pending calls. The effects of system overload had not been properly anticipated; the exception list kept growing. Requirements were not met. Robin Bloomfield, a government consultant on this system, said it was a fundamental requirement to have several layers of defense against failure. "With about a million calls a year," he explained, "the system has to be more reliable than a nuclear reactor backup system." As put into operation, the only backup it seemed to have was that people would make their own arrangements if the system failed.

Forewarnings of disaster. Senior officials of London Ambulance Service were warned that system would be an "expensive disaster," by Michael Page, whose company submitted a competing bid which had been rejected a year before. Page wrote a series of memoranda to LAS in June and July 1991, warning that the tracking subsystem, which tracks ambulances and dispatches the nearest one to a call, was not up to the requirements. "The rule-based analytical approach used by the LAS cannot," he wrote, "deal as well as an experienced operator with the small minority of difficult cases. The system wrongly reduces the influence of operators."

Why we can't be sure that software is correct. It is hard to establish the correctness of software. First of all, we can't ignore design failures. In physical systems, the hardware failure rate probably greatly exceeds the probability of failure from design errors. So in practice, we can ignore design-failure probabilities in safety and reliability calculations. Secondly, redundancy is no solution. In physical systems, redundancy reduces the impact of random hardware failures. Unfortunately, software does not exhibit random failures. "Common-mode" failures can occur. For example, consider the hydraulic system on United Airlines Flight 232, July 18, 1989. This was not digital technology, but it was a complex system. An engine failed, and the resulting explosion disabled all three hydraulic systems. After that, the jet could not be controlled. The three hydraulic systems had been put there for the sake of redundancy. However, lines for all three hydraulic systems converged in one small area near the tail. An explosion near that site disabled them all. Now on the Boeing 777, wiring for the digital flight-control system is routed all over the plane, so no single explosion can disable it.

Decision paths in software are complex. Even in a program of a few hundred lines, there are dozens of branches and thousands of paths. Not all of these paths can be tested. Maybe the designer misunderstood the situation responsible for those inputs, or failed to take the situation into account at all. For example, the Patriot missile failed to intercept Scuds during the first Gulf War. This was due to the cumulative effect of inaccuracies in timekeeping by a computer. The system was meant to be turned off and restarted often enough for accumulated error never to become dangerous.

A lot of oversights, it seems, involve time. The Y2K problem arose because programmers who used two-digit date fields never suspected that their code would live on till the year 2000. Maybe we shouldn't blame them for that. But surely we'd expect programmers to anticipate the beginning of daylight-savings time next spring! But sometimes they haven't. At the end of April 1993, when Germany went on summer time, the computer clock of a German steel producer went from 1:59 AM to 3:00 AM in one minute. This resulted in a production line allowing molten ingots to cool for one hour less than normal. When the process controller thought the cooling time had expired, his actions splattered still-molten steel, damaging part of the facility. Similarly, latitude and longitude have sabotaged several computerized systems. Early American air-traffic control software could not be used in Britain because it was unable to cope with longitudes less than 0. A bug in the simulator for the F-16 fighter aircraft caused it to flip over whenever it crossed the equator.

Small changes in programs make big differences. In many physical systems, a small change to stimuli will produce only a small change in response. But in programs, changing a single bit from 0 to 1 can radically change the operation of a program. The destruction of the Mariner probe is an example. This was the first US interplanetary probe, meant to explore Venus. There was a bug in control program for Atlas rocket that launched it. On July 22, 1962, a single incorrect character caused it to veer off course. An equation was missing a "bar" that told it to use a set of averaged values instead of raw data. This led the computer to decide the rocket was behaving erratically, although it was not. When the computer tried to correct the situation, it did cause erratic behavior. For the safety of people on the ground, both rocket and spacecraft had to be destroyed shortly after launch.

In 1992, a Dutch chemical plant exploded due to a single mistyped character. Three firemen were killed and eleven workers injured. Fragments were found at a distance of 1 km. The damage was estimated at several tens of millions of guilders (or about half that much in dollar terms). A lab-worker trainee typed tank 634 instead of 632 as the source tank for a chemical used in a reaction. Instead of resin feed classic, he put in dicyclopentadiene. He failed to check whether the tank contents were consistent with the recipe. When the reactor overheated, he called the plant fire department. The fire department expected the reactor contents to be released via a safety valve. So they were connecting deluge guns to prevent the fire from spreading. They did not have safety equipment on. Instead, the reactor ruptured and exploded. They called in the city fire department, but it had to let the fire burn out to prevent environmental damage from polluted firefighting water.

We can't test software "long enough." In June and July 1991, a series of outages affected telephone users in SF, DC, VA, WV, Baltimore, and Greensboro. The problems were in a switching program written by DSC Communications. The program consisted of several million lines of code. They ran it through a 13-week series of tests, and it passed. Then they changed three lines. They didn't think it necessary to repeat the 13-week series of tests. They were confident they understood the effects of that change. The program crashed repeatedly.

Overconfidence was the culprit in the most widely known case of unsafe software. It involved the Therac-25, a radiation-therapy machine. Between June 1985 and January 1987, six known accidents occurred, involving patients being given massive overdoses of radiation. Two patients in Galveston, TX died. A machine like the Therac-25 can deliver two kinds of radiation, X-rays and electrons. Electrons work well for irradiating cancers near the surface. For cancers further in, X-rays are used. To get the X-rays, a tungsten shield is placed over the patient's body. A very powerful stream of electrons is directed at it, which causes it to emit X-rays. The X-rays must be emitted only when the shield is in place. Here is the error that caused the problem: The operator prepared to send X-rays, then realized she had made a mistake. She switched the machine over to electrons, and the shield retracted. But it retracted before the intensity of the beam was lowered. Patients felt a severe burning sensation, and this was only the start of their problems. Radiation sickness followed.

Some of the software from the Therac-20, a previous version, was reused in the Therac-25. The Therac-20 had included a mechanical interlock. But it had been run for years without any software problems being noticed. So the mechanical interlock was removed on the Therac-25. Turns out, when the bug occurred in the Therac-20, before any harmful radiation could be emitted, it blew a fuse.

Achieving greater confidence. Even if we can't ensure that software is perfectly correct, how can achieve greater confidence? One approach is formal specification. Natural language is a wondrous instrument. It allows us to say things like, "Do you feel more like you do now than when you came in?" The requirements for a complex system will almost inevitably be complex. If expressed in English, they can easily be incomplete or self-contradictory. In a large document, these contradictions can be far from obvious.

Writing formal specs in a specification language is difficult and time-consuming. Therefore, large, complex systems are virtually the only ones for which it is undertaken. Unfortunately, specifications must also be formal. If the specification is contradictory, we should be able to detect it with analysis techniques. If the specification is incomplete, design may reveal this incompleteness. But if it is functionally wrong, this can only be determined by human review of the functioning of the system. The problem is that our "domain experts" are unfamiliar with specification languages and mathematical logic. For example, Teletronic Pacing Systems develops software for pacemakers. They wrote their specs in a formal specification language. But the cardiologists couldn't understand them. So they had to rewrite them in English.

We can attempt to verify our systems mathematically. This approach relies on assertions, preconditions about what has to be true on entry to routines and postconditions about what should be true when they are exited. However, a program can only be proved correct with respect to something else, e.g., a specification. So if the specification is wrong, the proof is useless. You prove you have correctly implemented the wrong thing.

Large programs are too tedious to proved correct "by hand." Therefore, they must be proved by theorem-proving software. But suppose this software has bugs ... Besides, programs can't be checked the way mathematical proofs are checked. In math, some proofs are complicated. There are even cases where two groups of mathematicians have proved contradictory things, and neither was able to disprove the others' work. So, our certainty about math relies upon mathematicians checking the work of others. Prove them wrong, and you get to publish the results. But program proofs are too long and mechanical for anyone to verify by reading them. John Rushby, a proponent of formal verification methods, says, "No one reads this stuff with a lot of care--it's too boring."

One experiment was to prove emergency shutdown system for the Darlington nuclear plant on Lake Ontario. The system was put in software so they could save money on specialized hardware and achieve greater safety, by providing better information to the operators. They divided their programmers into four teams. One team converted the programs into logic. Another did the same with the specs. A third team verified that the programs implemented the specs. A fourth team checked the work of the third. This delayed the project by six months, and in the meantime, the plant could not be opened. The plant cost its investors $20 million for each month it was idle. So the delay cost $120M. But at least it saved $1M of hardware.

n-version programming. OK, if it's too hard to prove that one program is correct, maybe we can have different teams of programmers write entirely separate programs and see if their results agree. If, say, three versions of the program are written, on a critical decision, the majority can rule. The idea is that independently developed programs will have independent bugs, and won't fail at the same time. A test conducted by Nancy Leveson attempted to validate this approach experimentally. Students at different grad schools wrote many versions of the same program and tested them extensively. The same inputs caused many different versions to fail. It turns out that the hard parts of programs are just hard, no matter who writes them.

Living with unreliable software. Let's face facts: We'll have to learn to live with software that is at least a little bit unreliable. Here are four stratagems we can use to make the situation more palatable. First, improve programming practices. Use safer programming languages. As Nancy Leveson says, the easiest way to find out what features are bad is just to look at C; it has all of them. Boeing developed systems in both C and Ada. The Ada systems were not only cheaper to produce, but had many fewer errors. C++ with its dangling pointers and memory leaks isn't much better. For a safety-critical system, Java or Ada 95 would be a much better choice.

Second, program more safely. Any memory that's unused should be initialized to a pattern that will leave system in a safe state, should it be referenced by accident. Set critical flags and conditions as close as possible to the code they protect. Decouple systems. A highly coupled system is one that's highly dependent: A failure in one component rapidly affects others. Grade separations decrease accidents because they decouple traffic. This principle seems lost on the engineers who rebuilt the Raleigh Beltline; they took out grade separations at three exits and put up traffic lights. You may write code differently if you're opting for safety. For example, you can put code in ROM or write-restricted RAM. But there's a tradeoff--less tightly coupled systems are less efficient.

Third, make the role of software "not too critical." In computer science, we are under a handicap compared to other engineering disciplines. In other fields, they know that things will wear out and fail. But, theoretically, we can get software "right." Nancy Leveson opines, "Maybe after 40 years of trying, it is time to be a little more realistic."

Reliability requirements should be modest enough so that reliability can be demonstrated before a system is deployed. One option is relying on non-computer-controlled backup systems. A safety backup usually performs simpler functions than the main system, so it can be built more reliably. It could be built with different technology, or use other sensors, actuators and power sources. For example, a system can use gravity or another physical property to ensure that a vent valve opens when pressure is excessive. Then, the probability that both systems will fail simultaneously is minimized.

Finally, we can learn to live with more modest overall system safety. Software may still be safer than the non-software system it replaces. In a fly-by-wire aircraft, pilots control the aircraft through a computer that may modify their commands before acting on them. The Airbus A320 or Boeing 777 are examples. The possibility that software may cause accidents has to be weighed against the likelihood that it may avoid mishaps that would otherwise be caused by pilot error. Robotic surgery will soon be feasible. There may be bugs, but human surgeons also have fairly high failure rates. Anyway, robots will soon be able to perform surgery that is beyond capabilities of humans.

Summary. To end our consideration of software safety, let's review a few of the main points. Safety problems are likely if software is tested inadequately or deployed too rapidly. Other problems arise when the programmer ignores relevant conditions, like the equator, daylight-savings time, or leap years. Software is logically more complex than other engineering systems, so it is necessary to get the design right. In practical terms, it's impossible to know that a program is correct, so one must at least follow good design principles and write code in languages without well known safety vulnerabilities. If these principles are adhered to assiduously, it should be possible to build computerized systems that are at least safer than the non-computerized systems they replace.