BA for computer hardware faults, reliablity and design



How reliable are our computer systems? As reliable as we make them.

Human beings are notoriously unreliable so that means that anything we make is unreliable. The made object can never be one hundred percent reliable. So what can we do about it? Check, check and check again is the advice for professional designers and testers. That means all manufactured hardware and written software must be backed up at least five times. No one can be relied upon. Trust no one.

We have all bought a new computer thinking it is going to be wonderful only to find out that we are the people testing it and until we find what has been overlooked it will not work properly. Some new computers never work properly.

So what if a bigger product relies on a computer? How safe are our aeroplanes, our trains, our cars, our cookers. our televisions or what they eventually turn into. Faults, errors and failures will occur so what are they likely to be and how are they categorised?

  • Faults are adjudged or hypothesed causes of error.
  • Errors are part of a system state liable to lead to failure.
  • Failure is deviation of the delivered service from specified conditions.

    Human input starts with design faults that can be induced by outside influences such as component failures. Even when the designer thinks that they have got it right and puts it to the test there can be as many as two hundred and fifty thousand million pathways to be looked at, leaving lots of possible problems that can be missed. A reasonable assumption for the time taken to check this would be eight years.

    To get dependability into design there must be back up and tolerance built into a component and its software.

    Now we get to the word bug. The original bug was a moth that got between two mechanical connectors and caused an electronic component to break down in a computer. A bug cannot be blamed for every eventuality. So designers try to build components that do not have bugs. One way is to get several designers trying to solve the same problem and using different solutions. The same problem solving approach is used with scripts for software. It can be seen that if you use five versions of a solution to a problem there is more chance that if a problem does arise the problem will be bypassed.

    Examples of problems that were not solved in electronics are the Patriot missile that was used in the Gulf war of 1991. They were built to intercept the Scud missiles, but a bug in the Patriot software processing their clock times caused them to fail to intercept. The clocks were originally meant to be reset frequently, but as they were in a one place for more than one hundred hours the software failed causing the missiles to miss their Scud targets. Just a simple thing that had been overlooked.

    Fly by wire aircraft are another example. In 1989 a SAAB Gripen developed problems with pitch oscillations which caused delay in the pilot’s controls — the commands were out of phase with the plane. The computer took control from the pilot who could then not do anything to regain control of the plane. This is an eventuality that had been overlooked by the software designers.

    The 1992 London Ambulance service switched to a voice and computer control system which logged all its activities. However, when the traffic to the computer increased the software could not cope and slowed down and lost track of the ambulances. In the end no one knew where anyone was, or should be, with disastrous results. The system had not been tested in all circumstances.

    Three Mile island nuclear power station in the United Sates is another example of a human interface problem in computers. In 1979 a pump failed causing the plant to shut down. The temperature increased and a valve opened. The temperature returned to normal, but the valve stayed open and the water started to evaporate. The Human operator misinterpretated this. The temperature increased in the fuel chambers and more water was added. The problem was that cold water was added on to hot fuel rods and the rods cracked.

    A recent example of lack of testing is the Apple iMac. Put in a beautifully designed case someone forgot to test the newly designed CD ejection software, for if a CD is accidentally left in the slot when it is shut down disaster results. When the computer is restarted it thinks the CD disk is a start up disk and crashes. The operator then has to resort to using an unbent paper clip inserted into the small hole in the CD slot to eject the CD disk. If they are lucky the computer will still work properly. All this because the software was not written to eject the disk on close down.

    To get over some of the problems in aeroplanes an odd number of computers are fitted up to five main computers. The reasoning behind this is that there will always be one that solves the problem and is working in the case of failure in the others. A big in flight problem is the use of mobile phones and laptop computers used by passengers that interfere with antennae outside the plane which run the plane’s systems.

    Another way to try to overcome some of these problems is by parallel processing, by dividing the task into smaller tasks and placing these sub tasks into separate processors that all work at the same time. This bring with it increased complexity due to multiple elements and communications that allow errors to migrate. Reliability is increased by restricting fault tolerance.

    The latest trend is by copying nature and using biologically inspired systems to build more complex systems. Again the components are broken down into small parts and added together to form the whole. The fertilisation of an egg is an example of how this may be done. A cell replicates time after time until a whole is produced with some of these cells changing to start running specific parts of the whole. This type of design is called self structuring (VLSI) or genome based design to give an embryonic array. By using Immunotronics to create an immune system disease within a system can be fought off from the whole and immunity acquired. The immune cells produced within the whole detect when things go wrong.

    Unreliability can never be eliminated, but by giving different specifications to different design teams for the same problem the chances of increasing reliability are increased. Using different hardware and software on the same problem will increase reliability. In evolution there are many variants in its attempt to select the best solution and so it should be with design. © BA Education

    Where will it all lead ? Keep reading.

    Main IndexBA EducationThis Index