Monday, July 19, 2010

Bhopal’s man-made disaster (Testing Tragedies #2: Learning from past)

Bhopal’s gad tragedy is one of the most sad example of ‘things-went-wrong’ due to human negligence. We primarily blame the company, the management (& rightly so) but one thing we often overlook is that other than human error it is the system (machines) that failed us too. If the quality was given its due importance, we could have possibly averted this. This blog is part of the series where we are trying to learn from our past error in various fields and apply those learning in software testing world. In Churchill’s words “Those that fail to learn from history, are doomed to repeat it.”

 

History

The Bhopal disaster or Bhopal Gas Tragedy is the world's worst industrial catastrophe. It occurred on the night of December 2-3, 1984 at the Union Carbide India Limited (UCIL) pesticide plant in Bhopal, Madhya Pradesh, India

Loss:

Government agencies estimate 15,000 deaths

Bad Quality

Attempts to reduce expenses affected the factory's employees and their conditions. Kurzman argues that "cuts ... meant less stringent quality control and thus looser safety rules. A pipe leaked? Don't replace it, employees said they were told ... MIC workers needed more training? They could do with less. Workers were forced to use English manuals, even though only a few had a grasp of the language”

Timeline, Summary

  • 21:00 Water cleaning of pipes starts.
  • 22:00 Water enters tank 610, reaction starts.
  • 22:30 Gases are emitted from the vent gas scrubber tower.
  • 22:30 First sensations due to the gases are felt—suffocation, cough, burning eyes and vomiting.
  • 1:00 Police are alerted. Residents of the area evacuate. Union Carbide director denies any leak.
  • 2:00 The first people reached hospital Symptoms include visual impairment and blindness, respiratory difficulties, frothing at the mouth, and vomiting.
  • 2:10 The alarm is heard outside the plant.
  • 4:00 The gases are brought under control.
  • 7:00 Immediate death toll raises to 2,259
  • 7:00 A police loudspeaker broadcasts: "Everything is normal".

 

BHOPAL-GAS-TRAGEDY

Flaw:

During the night of December 2–3, 1984, large amounts of water entered tank 610, containing 42 tons of methyl isocyanate(MIC). The resulting exothermic reaction increased the temperature inside the tank to over 200 °C (392 °F), raising the pressure to a level the tank was not designed to withstand. This forced the emergency venting of pressure from the MIC holding tank, releasing a large volume of toxic gases into the atmosphere. The gases flooded the city of Bhopal, causing great panic as people woke up with a burning sensation in their lungs. Thousands died immediately from the effects of the gas and many were trampled in the panic

Incident Logs & My Observations:

Factors leading to the gas leak include:

  • The use of hazardous chemicals (MIC) instead of less dangerous ones
  • Plant's location near a densely populated area

      [raj]: Its a perfect example of poor risk management. We keep saying that test plan should have an important component which should talk about risk identification, risk assessment, risk mitigation and risk contingency but do we really do our part with sincerity.

      When you have alternatives one should always evaluate the worst case impact of each risk and then decide the alternative which is least impactful to the end users as user’s impact in many cases can be much more important than the cost effective   solution which might look better deal to us. In software world we can say that if there is a trade-off then choose the option which is say slightly more time consuming or costly over cheapest solution if it can pose some risk to the end users which can ultimately become too expensive. 

  • Storing these chemicals in large tanks instead of over 200 steel drums. Large-scale storage of MIC before processing

      [raj]: Poor capacity planning. Over utilization of resources can lead to fatigues and  fatal outcomes as it happened in this case. It is crucial to understand the threshold of each resource (software, hardware or humans) and understanding how much you can stretch them before they will collapse or explode. Resources are like a rubber band in your hand which can be stretched to meet your needs as long as you are not going beyond its threshold but if you keep stretching it will break and will hurt your finger real bad because of its rebound speed)

  • Safety systems being switched off to save money—including the MIC tank refrigeration system which alone would have prevented the disaster
  • lack of skilled operators due to the staffing policy
  • there had been a reduction of safety management due to reducing the staff, there was insufficient maintenance of the plant and there were only very loose plans for the course of action in the event of an emergency

      [raj]:  Cost controlling by resource reduction (software, hardware or humans) should NOT be done blindly. This is more applicable now when the economies are struggling, as many times it is seen as the simplistic and the most obvious solution but it can have such adverse impacts as we have witnessed in this tragedy. We all should have moral responsibility toward our users and we should think twice before taking such measures. There can be other ways to save cost but leaving critical systems running in modes where they are not maintained or giving in the hands of people who are not trained to do is nothing less than playing with the life of our users.

  • The MIC tank alarms had not worked for four years
  • There was only one manual back-up system, compared to a four-stage system used in the US

      [raj]: This only emphasizes the need of practicing security measures constantly to ensure that there are enough preventive techniques in place to avoid such accidents. Example in software could be anti-viruses or the scans/tests to be performed on periodic basis to know the possible threats in time.

Learning

      #1 Risk management should not just be done as a formality. Its not something that should only concern management or a project manager.

           It should be one of the most important item which should be raised, managed, controlled and monitored by every team throughout project lifecycle.

     #2 Choices made by humans CHANGE the course of future

         When you do risk management and decide on a choice, always remember that each choice you make can decide the fate of millions of users and hence ALWAYS think of your users while taking a call.

         Risk management is not just about you and  your team, its also about the impact of that risk on your end users.

     #3 Over utilization beyond a limit can backfire. Know your boundaries and use them diligently. Cost reduction is important but constant overloading can instantly fail your system if not used judiciously. 

          for example, In software world, load testing is important but stress testing is even more to understand when your software is going to break so that we can be better prepared for the failures. “Hope for the best; Prepare for the worst :P”

    #4  Cost reduction when done without understanding the impact on the end users can be counter productive.

    #5. For any application which involves direct or indirect physical interaction with users, SAFETY testing is must and needs to be done constantly to avoid accidents. e.g. Healthcare industry, Aviation etc.

    #6  Security testing is extremely critical when there are chances of sabotage in the system. In software it can be a malicious user who is trying to harm the system for his interest or to hurt the users.

    #7  Testability should be built into every system to find out the exact root cause of the behavior when something goes wrong.

         Even today the interested parties haven’t been able to reproduce the exact conditions that resulted into this gas tragedy.

         Only if the software has logging mechanism which can be used to find out the exact sequence of events that occurred before the failure, the root cause can be determined unanimously without leaving any room for speculation.

Thoughts for you

[raj]: If you think we have really learnt from past then read the comments below. Forget about applying the learning into other industries or fields, the very same company’s toxic waste is still lying there waiting to explode anytime again. 

In software testing, I learn that if a product crashes or causes a major loss to the customer, then there will no guarantee that preventive or even corrective actions will taken for sure. History has proved that we humans are so good at repeating our mistakes and suffering from it again and again without really fixing it. If you want to be good tester, make sure making mistake once is acceptable, repeating it again and again is sheer stupidity.

Clean-up operations: Lack of political willpower has led to a stalemate on the issue of cleaning up the plant and its environs of hundreds of tonnes of toxic waste, which has been left untouched. Environmentalists have warned that the waste is a potential minefield in the heart of the city, and the resulting contamination may lead to decades of slow poisoning, and diseases affecting the nervous system, liver and kidneys in humans. According to activists, there are studies showing that the rates of cancer and other ailments are high in the region.

1 comment:

Anonymous said...

the report are good but should be qualitated