Monday, July 26, 2010

A software testing stage act (funny) - written and directed by yours truly :)

I would be lying if I say I always wanted to do this and as far as i can remember, i was never known for participating in any extra-curricular activities, leave alone  plays or stage acts. Lately it somewhere started to grow on me as I started watching (and appreciating) movies and more importantly when my sister (Savita , the cinematographer) instilled confidence in me. Thankfully i got my first break , when i was,asked ( let’s rather say forced ) to write and direct a play for Microsoft’s Tester’s Day. Like many others, i thought of Googling (oops …Binging) for it but i didn’t find one good play on software testing (good for me ) and  i finally gathered the courage to write one. I am sharing the script with you all if you want to play it again :)

Plot:

A bunch of unusual testers unintentionally making the life of developers really terrible.

Actors:

<Testing Team>

Quality Advocate :: Ranchoddas

He is awesome force in testing that will stand for nothing else but quality. Ranchoddas is often called the quality police and are at times known to be too quality focused. He will enforce quality all over, from the build and release to the documentation. Even if no-one asked for his input. He will rip any spec to pieces; enforce quality in every meeting they attend and settle for nothing but the most ardent quality in every aspect of the software development life cycle.

One liner: He lives, breathes and sleeps quality.

Ranchoddas is respected and required on every project but he often over-looks commercial aspects meaning he can be known as troublesome, unbending and inflexible. He often has a reputation of sticking this his nose in or bullying. He also believes they are running the project.

Drama queen :: Sambhavna

Drama Queen over-reacts with each bug found. Whether it is a spelling mistake or a catastrophic failure The Drama Queen always reacts the same.

The Drama Queen can be seen hurtling towards the programming team, returning with at least 5 of them whilst wildly pointing at the screen and shouting 'Look, Look, it's broken'. In reality it's often 'just a bug' that is fixed immediately with little or no fuss. The Drama Queen really gets programmers backs up as they are never sure whether their code is really that bad or whether they've just overlooked something. Cry Wolf?

Socialiser :: Ritesh

One Liner: Let's get a drink and chat about it.

Laughing, chatting, enjoying their work and having fun. The Socialiser is often referred to as The Moral Officer due to his key role as team outing organizer and coffee break manager.

The socialiser basically knows everyone, everywhere

Last but not the least is our old Chaturlingam who has a knack of finding unusual bugs. He is most feared by developers. All he has to say in his defence is: “I don't go looking for defects. Defects find me. “

EXPLORER :: Chaturlingam

Chaturlingam who has a knack of finding unusual bugs. He is most feared by developers. All he has to say in his defense is: “I don't go looking for defects. Defects find me. “

He is what's known as a responsive thinker. He set himself a charter, often defined from a test case and they explore the app looking for interesting scenarios and paths reacting to information being given back by the app and in turn altering their test ideas. The defects he find are often show stoppers, are truly inventive and are often so difficult to fix that programmers have exploded in rage. The Explorer can test anything, in any state and at any time. They aren't bound by the constraints of a checklister.They are free. They explore.

 

<Development team>

ANGRY DEVELOPER:: Rakhi -

She is an aggressive and impatient person by nature. She of course doesn’t like being bugged too much by the test team. Though she doesn’t say it openly she disdains the test team for their lack of technical knowledge and for doing a job which she feels is not worth appreciating.

 

SEASONED DEVELOPER:: Anil

All my sympathies to Anil who is a senior campaigner who has got tired of arguing with test in his long career and decided to take an easier way out of not getting into confrontation mode. He tries to convince, fails miserably and in the end other party wins and he accepts it whether he was wrong or right.

Anil doesn’t like to go into confrontation modes with testers or for that matter with anyone. He is a composed person who is definitely not witty and he often finds himself giving up in the end by accepting the other’s terms. In short, people feel sympathy for him coz he really can’t fight back even when innocent

 

Opening Scene:

Developers are looking drowsy, sleepy and tired in the morning (as they were working till 6 in the morning). Testers are fresh and come late to start a new day.

< SCENE #1>: Installation

Chaturlingam to Sambhavna: (Chaturlingam looking excited to start the testing; Sambhavna looking tensed as she is installing and she’s looking anxious)

Hey Sambhavna, What’s taking the installation so long? Is it done yet?

Ranchoddas: (interrupts suddenly as if was just waiting for it)

Are you kidding? I am not going to anyone touch the build till my design inspection bugs are closed and I see the signed off Tech Design Spec doc. In the name of our Test Process Improvement, I can’t compromise on these key deliverables

<in high pitch to developer>

Ranchoddas: <Stands up and looking at Rakhi who is looking very lazy at the moment>

Why Bug tracker stills shows my design bugs are still active and Tech Spec is not updated? You have already wasted our 15 mins this morning

Rakhi: (trying her best to look reasonable)

Ranchoddas, Tech Design Specification is checked-in and complete. Check again.

Ranchoddas: (with sarcasm)

Yes, I can see the Tech Design Spec but that was for the last release and only thing you seem to have updated is the RELEASE VERSION NO. Don’t provoke me to reject the build.

Anil (waking up suddenly & trying to solve the matter; trying to get sympathy):

Guys…We just finished coding only this morning at 6 a.m. so we would have missed it. You start with the installation and give us time till EOD and we will update the tech spec

Ranchoddas: (looking surprised; no pity that devs were working till morning)

Are you suggesting me that you have not done Unit Testing in that case? I want to see Unit Test report now? Mister quality is not only our responsibility.

Rakhi (feeling trapped):

There was a requirement change yesterday night and because of that we didn’t get enough time to code and unit test.

Ranchoddas: (Giving an expression as if he is back stabbed):

OMG!!! So there was a CR nobody bothered to inform the test team why the test team was not in the loop? What is wrong with our PM? What about the impact analysis.? And when will we update our test cases? I need to escalate this right away to senior management. Nobody breaks law in my territory.

Ritesh: cool man. That’s ok. The CR is in my work item & I have updated it in my test case. I think we should start off with our testing

Anil (looking relaxed again. Passes a broad smile to Ritesh)

Ranchoddas, I am resolving your tech spec comments. It won’t happen again.

Ranchoddas (shrugs his shoulders)

Fine.

 

< PAUSE OF few second. Team is doing their usual stuff>

 

Drama Queen: <doing the installation; she squeals and runs toward developers>

The build failed. Oh God why today? Deadline is so close. It always happens to me. I don’t know what we are going to do next.

Anil (Looking relaxed after seeing the error and speaks in a soft tone of contempt for testers):

Yes Sambhavna, It will fail coz you don’t have D drive on your machine

Drama Queen:

Mister don’t get technical with me. How on the earth was i supposed to know that?

<Rakhi gives a disgusting look to Sambhavna. Test team looks embarrassed;. Sambhavna doesn’t know what was wrong with that. Anil has no words and everyone gets back to work>

 

< PAUSE OF 10 second. Team is doing their usual stuff. Devs have gone back to their seats>

 

Sambhavna: Not again !!!! There is another problem. After I install the build, dialog box says “Installation recommends reboot. Do you want to reboot now?”

Rakhi (casually) What is the problem. Say NO !!!!

Sambhavna: (Looking irritated)

There is only one button available and that says. “YES”

Rakhi : oops…looks like a minor miss 

Rakhi gives a smirk, and Sambhavna is puzzled.

 

<ANNOUNCEMENT: so finally the installation succeeded. There were few bugs here and there but the team is still managing quite okay>

 

< PAUSE OF 10 second. Team is doing their usual stuff. >

 

< SCENE #2>: (BVT & Functional Testing)

 

Ranchoddas:

I can see two bugs where “how found” and “found in environment” is not entered. Can you fix it right away? I am not going to tolerate such process compliance, I tell you!

Ritesh:

Sorry!!!. That has to be me. Leave it to me dude, I will take care of it.

 

< PAUSE OF 10 second. Team is doing their usual stuff. >

 

Ritesh:

Guys, as you know our last release had just 5 Severity 1 bugs production, our management is really happy with our performance and has given us budget for an outing. How about a lunch today.?

Ranchoddas:

Did we scope for this time out in our schedule? Are you sure we will not be missing out on our time to test by such outings? I suggest we take an informed decision on this. Send an email with voting option, so that we have this documented somewhere for future reference.

Ritesh:

Ok boss. Whatever u say. Let’s do it fast. I have to check the reservation then.

Rakhi:

Lets go to KFC. Finger licking good.

Anil:

Today is Tuesday. Can we go to a Veg restaurant

Ritesh:

Hmm.. How about Little Italy, Mozarella, Ohris or Malgudi ?

Ranchoddas:

Ohris? All the way to Banjara Hills? *sarcastically*

Do you want this to be a full day activity????

Sambhavna:

How can you even think about it? I am not confident about the code quality and there is so much to test. Too little time and too much to test such a crappy code!

Anil:

Hello madam? How can you call my code a crappy code? Do you even have an idea about the amount of design and planning that has gone into it? For starters, Each module is designed in way that it can be reused!!!

Sambhavna:

Hello Mister! Let your code be usable first. We can think about reusing it later!

Ritesh:

<<Consoling Anil …>> Chill out dude .. it is not as crappy …<<little pause>> as it was last time .. !!! (Glancing the test team members)

Chaturlingam:

I agree with our drama queen for once....lunch can wait...there are too many bugs.. I need to call for an immediate triage? We can wind that up and then go for lunch.

TEST TEAM IN CHORUS: Then it will DINNER !!!!

<ANNOUNCEMENT : Triage meeting is called out>

 

< SCENE #3>-- (THE TRIAGE)

Chaturlingam:

To start with let’s looks at Defect ID #3601, I get a system error when I give an invalid file (negative testing)

Anil:

Man..…go and see the error report first to see the root cause

Chaturlingam:

Really? This is what the error is!

clip_image001

 

Anil: (Looks embarrassed)

Err… I will take care of it.

Sambhavna:

Look, look, it’s not working. I swear, I saw it coming. I knew it would break. Thanks god i found it in test otherwise i don’t know what would have happened in production. How could you guys miss it?

Rakhi:

What’s the issue?

Ranchoddas:

Don’t show them. First log it in the tool. That’s the process.

Rakhi # (Gives up and walks to drama queen’s cabin):

Can I have a look at it?

Ranchoddas:

Don’t you dare fixing the code on her machine?

Rakhi:

Dude. I cannot change the code through an exe. Remember?

Ranchoddas:

Of course I know that! But I cannot trust you guys at all!

Sambhavna (finally explains);

See the spelling of “ORGANISE” . it has to be “ORGANIZE”.

Rakhi (Almost cursing herself):

What’s the big deal? I thought we have customer in UK as well.

Sambhavna:

For your information, 51 % of the customers are in US and hence it has to be US English

Rakhi:

Plz go log a bug

Sambhavna:

S1 / P1 ?

Rakhi (seeing her patience tested):

S3/P3

Sambhavna:

How is it S3? Can you not see this is such a severe bug?

Shijavi (takes out a doc which he keeps ready all the time)

Wait a minute. I have a reference, exactly for such situations. Microsoft’s Severity /Priority guidelines

<Anil & Rakhi are surprised that he carries such a thing all the time>

Ranchoddas:

As per Slide 4, line 2, this bug is S2/P2 to say the least

Anil (seeing they are losing the ground)

Agreed.

Ritesh: (to developer):

Boss …, looks like there is indeed some problem, but not big enough

Anil (looking tensed):

What happened?

Ritesh:

Look at this dialog box. Is it not silly ?

….. <<getting a call in between …>> … <<resume after little pause>>

I will log a S2-P3 bug, I’m sure you can fix it!

clip_image002[4]

Ranchoddas:

Can’t you take personal call later ?

Ritesh:

Oh..dude, we can manage this triage meeting. Managing home triage is more difficult...!!

Ranchoddas:

Okay … I wanted to ask, how is it a S2-P3 man? Can you not see that it’s a Fatal error? Has to be a S1!

Ritesh:

Oh common, its sure is not a show-stopper! No panic dude !..<<little pause>> Btw …., I have one more.

clip_image003

Look at this screen, it’s throwing error that “Keyboard failure” but asking me to press F1 to continue …. No big deal. Your code is not crappy dude, I think, I should change the keyboard & try this scenario

Ranchoddas:

Dude, what is a Sev 1 bug in your dictionary!!

Chaturlingam:<Mighty pleased with himself>

Okay. I now have an awesome bug. Something which stands up to my reputation and will give you guys sleepless nights

<Rakhi & Anil make a face and wait for Chaturlingam to continue>

I put up a 30GB file for copy from the Source to Destination folder, and then I put some load on the system as it is copying. I then tried to remove the network cable and see what happens, and viola, I have the bug! See this! Do you expect my grandkids to come and test this?

clip_image004[4]

And this ? What were you doing when you created this dialog box in your code?

clip_image005

Rakhi & Anil: <Look at each other and say at the same time>

Oh come on, man! Who on earth would even think of doing something as foolish as this!!!

Rakhi:

Did you log this bug? I am going to close this as Not-Repro!

Anil:

Its okay Rakhi. Lets fix it! Lets ensure that a graceful error is thrown when someone this foolish does something this weird on the application!

Sambhavna:

Okay, now my bug! This one took off my confidence on the code totally! What on earth am I supposed to do?

clip_image006[4]

Rakhi:<smirks>

Press ‘Proceed’ and see na. I ensure it would format your system!

Sambhavna:

I have logged this bug. Fix it! What if a customer gets this bug? Imagine what would happen to our reputation if we ship this?

Rakhi:

Oh well, no customer will get this bug! I ensure that this happens only on your machine! This block of code would get executed only for yours and Ranchoddas’s login! But now that you proved to be smart enough to find it, I will fix it!

Sambhavna:

See this one! I told it I don’t want those drivers on my machine! God knows what you guys have coded into the drivers!

clip_image007

Anil:

Uncalled for but I accept this one!

Ranchoddas:

If all these weren’t enough, I have one now. And because of this one, I reject the build. Please give a new build!

clip_image008[4]

Rakhi & Anil: <Too stumped to say anything>

Rakhi:

That’s it! Enough is enough. I am going to set up a 1:1 with my manager. Cannot work with these loonies anymore!

Ranchoddas:

Can someone send minutes of this meeting ?

Chaturlingam:

Boss … you are the quality “Guru” .. you can take care of sending minutes.

Ritesh:

Hey guys … itz been long battle today, let’s chill out. It seems some tester’s day is going on and looks like they are providing free lunch J lets go.

 

Narrator:

Thus ended yet another eventful day in the lives of these team members.

In spite of all the weird bugs, the stringent processes, the long coffee breaks and the amazing histrionics, this software was shipped and was reported to have touched a user base of 1M customers!

Anil is now a Dev lead, and it is heard that he has extra long sessions with his shrink, all for his sanity

Rakhi has moved away from the team, into a totally different discipline!

All the test team members are still flourishing in the same team, and continuing to give sleepless nights to their Devs!

 

PS: By the way, did i tell you that this play was quite a success at the event?  Feel free to use it -- Raj, Microsoft, India , raj.kamal13@gmail.com

Tuesday, July 20, 2010

Using Code coverage in Black Box testing ain’t no rocket science :) Overcome your CODEPHOBIA

I deserve few more comments here as I travelled 6 km and 30 mins extra to write this blog as in the excitement of writing this i forgot to get down at my stop and ended up at a place where the driver had to tell me that it was the last stop :P

“Yeah our developers are not doing code coverage and hence our code looks ugly” OR “we are not doing white box testing so we really cant measure code coverage"  Heard it before ? Now Rocky (our tester) was in one such meeting recently and he was the only odd man out who believed otherwise. It took us some time to realize that it was the “CODEPHOBIA of the testers which is the major resistance when it comes to measuring code coverage and not as much as the actual process”. 

This reminds me of one of my favorite commercial which has this one liner “Jhooth bolte hai woh log joh kehte hai unhe darr nahi lagta, darr sabko lagta hai. darr se aage bado, kyunki darr ke aage jeet hai”  (For people who don’t understand Hindi: “People who say they aren’t afraid of anything are plain lying. Everybody gets scared (and its very human) but the key is to face it and then only you emerge as a winner”

When i interviewed the teams, I came to this understanding that the testers feel when they are no experts in .Net, Java, Ajax etc then how can they measure and improve the code coverage.  I don’t completely agree by this notion as measuring code coverage doesn’t necessarily need very strong programming knowledge. Provided that you have a basic understanding of any programming language like C, Perl etc and a strong willingness to learn, you can still ramp up pretty fast to the extent which is required for you to be able to measure/improve code coverage. Understanding code is much easier than writing it :) and all it needs is aptitude and basic understanding of the syntax and semantic of the language in question. I am saying this based on my own experiences where I went an learned the technology when it was required to be able to accomplish targets like measuring and improving code coverage.

Now if you are saying “Why we are even talking about code coverage”? 

Requirement coverage is ensured by tracing the requirements to your test cases and ensuring user needs are met but what about the traceability between the code and the test. In other words, what happens if there is code which is not exercised  by your test and results into unexpected behavior at customer’s end.

In short because Test Coverage = Requirement Coverage + Code Coverage

Lets accept it that “requirements” and “code” are two different entities and validating just one of these would be unfair to the other one.

Now if agree that code coverage can be done by black box testers and there are no major obstacles then lets move on to “How” part. Well, so if there is a will then there is a way and fortunately in this case there are many ways. We need to choose what fits best for us.

Step 1: Measure you baseline and set target: is to see where you stand now by measuring the BASELINE code coverage for your project (as you cant improve what you cant measure) and then decide your realistic TARGET code coverage number. Now don’t get me wrong for highlighting keyword  “realistic”. Yes, I am no conservative and everybody want 100 out of 100 (no less) but the practitioners would tell you its same as “exhaustive testing” which we know doesn’t come for free. So you got to decide on a number by seeing the criticality of your application and a trade off between the money spent in increasing code coverage vs. risk of missing a defect due to insufficient code coverage.

for e.g. if your test (manual + automated) provides you 40 % coverage today then it becomes your baseline and you can set a target say 70-80 % for the future releases by doing all calculation on (investment vs. ROI)

Step 2: Improve and measure (continuous)

Then you can go and measure release-over-release IMPROVED code coverage and then compare it against the baseline to detect the trend (if there are any improvement), till you reach your target. Once you achieve your target code coverage number, you can raise the bars to take it to next level and continuously improve it.

 

There are tools that can be used to start these trace using command line options which will instrument and track the code execution (statement coverage, path coverage, branch coverage etc) based on the tests being run (both automated & manual) and then help you generate reports to see the impact/coverage of your tests.  I have given few links below which will point to some of these tools and the tool selection/adoption is at the reader’s discretion.

Is there any process which can be used to achieve this. Answer is YES. Read on.

 Code Coverage process:

I have come up with a generic iterative code coverage process which we can be used for measuring & improving code coverage while test execution (black box or white box).

1. Start the Trace using Code Coverage tool

    Here you install, configure and start the CC tool before starting your test execution.

2. Test Execution – Phase 1

    You are running you test cases or scenarios or conditions  (manual or automated whatever)

3. Measure Coverage & Identify areas where coverage is low

  At this point lets say when you have done one pass of testing or have run your test cases once, you can Generate code coverage reports using the CC tool in question and IDENTIFY rooms for improvement where code coverage is not good.

4. Add Test cases / conditions to improve code coverage

  Now you know which segment of the code was not covered by our test so you can go and either

  a) write new test cases or scenario to cover those parts

  b) Include the test data which can exercise those missing conditions

5. Test Execution – Phase 2

   You start again and this time you run these new test cases / conditions with new test data.

6. Measure Coverage again

 You generate code coverage reports AGAIN and you see if there are improvements

Loop: Keep repeating steps 2 to 6, till you reach your target. End the trace

  code coverage

 

Pointers to few CC tools:

http://www.codecoveragetools.com/index.php/coverage-process/code-coverage-tools-c.html

http://www.codecoveragetools.com/index.php/coverage-process/code-coverage-tools-java.html

 

If you still with me then you are among those who are willing to learn new things in life and i wish you a ''Happy Code Coverage”. Go celebrate it. If it helps then why not use it. Technical knowledge shouldn’t be an excuse for not doing it. \

After all it is no rocket science.

 

PS: CODEPHOBIA in my dictionary is “fear of code; not only writing even reading or trying to understand it”

Monday, July 19, 2010

Bhopal’s man-made disaster (Testing Tragedies #2: Learning from past)

Bhopal’s gad tragedy is one of the most sad example of ‘things-went-wrong’ due to human negligence. We primarily blame the company, the management (& rightly so) but one thing we often overlook is that other than human error it is the system (machines) that failed us too. If the quality was given its due importance, we could have possibly averted this. This blog is part of the series where we are trying to learn from our past error in various fields and apply those learning in software testing world. In Churchill’s words “Those that fail to learn from history, are doomed to repeat it.”

 

History

The Bhopal disaster or Bhopal Gas Tragedy is the world's worst industrial catastrophe. It occurred on the night of December 2-3, 1984 at the Union Carbide India Limited (UCIL) pesticide plant in Bhopal, Madhya Pradesh, India

Loss:

Government agencies estimate 15,000 deaths

Bad Quality

Attempts to reduce expenses affected the factory's employees and their conditions. Kurzman argues that "cuts ... meant less stringent quality control and thus looser safety rules. A pipe leaked? Don't replace it, employees said they were told ... MIC workers needed more training? They could do with less. Workers were forced to use English manuals, even though only a few had a grasp of the language”

Timeline, Summary

  • 21:00 Water cleaning of pipes starts.
  • 22:00 Water enters tank 610, reaction starts.
  • 22:30 Gases are emitted from the vent gas scrubber tower.
  • 22:30 First sensations due to the gases are felt—suffocation, cough, burning eyes and vomiting.
  • 1:00 Police are alerted. Residents of the area evacuate. Union Carbide director denies any leak.
  • 2:00 The first people reached hospital Symptoms include visual impairment and blindness, respiratory difficulties, frothing at the mouth, and vomiting.
  • 2:10 The alarm is heard outside the plant.
  • 4:00 The gases are brought under control.
  • 7:00 Immediate death toll raises to 2,259
  • 7:00 A police loudspeaker broadcasts: "Everything is normal".

 

BHOPAL-GAS-TRAGEDY

Flaw:

During the night of December 2–3, 1984, large amounts of water entered tank 610, containing 42 tons of methyl isocyanate(MIC). The resulting exothermic reaction increased the temperature inside the tank to over 200 °C (392 °F), raising the pressure to a level the tank was not designed to withstand. This forced the emergency venting of pressure from the MIC holding tank, releasing a large volume of toxic gases into the atmosphere. The gases flooded the city of Bhopal, causing great panic as people woke up with a burning sensation in their lungs. Thousands died immediately from the effects of the gas and many were trampled in the panic

Incident Logs & My Observations:

Factors leading to the gas leak include:

  • The use of hazardous chemicals (MIC) instead of less dangerous ones
  • Plant's location near a densely populated area

      [raj]: Its a perfect example of poor risk management. We keep saying that test plan should have an important component which should talk about risk identification, risk assessment, risk mitigation and risk contingency but do we really do our part with sincerity.

      When you have alternatives one should always evaluate the worst case impact of each risk and then decide the alternative which is least impactful to the end users as user’s impact in many cases can be much more important than the cost effective   solution which might look better deal to us. In software world we can say that if there is a trade-off then choose the option which is say slightly more time consuming or costly over cheapest solution if it can pose some risk to the end users which can ultimately become too expensive. 

  • Storing these chemicals in large tanks instead of over 200 steel drums. Large-scale storage of MIC before processing

      [raj]: Poor capacity planning. Over utilization of resources can lead to fatigues and  fatal outcomes as it happened in this case. It is crucial to understand the threshold of each resource (software, hardware or humans) and understanding how much you can stretch them before they will collapse or explode. Resources are like a rubber band in your hand which can be stretched to meet your needs as long as you are not going beyond its threshold but if you keep stretching it will break and will hurt your finger real bad because of its rebound speed)

  • Safety systems being switched off to save money—including the MIC tank refrigeration system which alone would have prevented the disaster
  • lack of skilled operators due to the staffing policy
  • there had been a reduction of safety management due to reducing the staff, there was insufficient maintenance of the plant and there were only very loose plans for the course of action in the event of an emergency

      [raj]:  Cost controlling by resource reduction (software, hardware or humans) should NOT be done blindly. This is more applicable now when the economies are struggling, as many times it is seen as the simplistic and the most obvious solution but it can have such adverse impacts as we have witnessed in this tragedy. We all should have moral responsibility toward our users and we should think twice before taking such measures. There can be other ways to save cost but leaving critical systems running in modes where they are not maintained or giving in the hands of people who are not trained to do is nothing less than playing with the life of our users.

  • The MIC tank alarms had not worked for four years
  • There was only one manual back-up system, compared to a four-stage system used in the US

      [raj]: This only emphasizes the need of practicing security measures constantly to ensure that there are enough preventive techniques in place to avoid such accidents. Example in software could be anti-viruses or the scans/tests to be performed on periodic basis to know the possible threats in time.

Learning

      #1 Risk management should not just be done as a formality. Its not something that should only concern management or a project manager.

           It should be one of the most important item which should be raised, managed, controlled and monitored by every team throughout project lifecycle.

     #2 Choices made by humans CHANGE the course of future

         When you do risk management and decide on a choice, always remember that each choice you make can decide the fate of millions of users and hence ALWAYS think of your users while taking a call.

         Risk management is not just about you and  your team, its also about the impact of that risk on your end users.

     #3 Over utilization beyond a limit can backfire. Know your boundaries and use them diligently. Cost reduction is important but constant overloading can instantly fail your system if not used judiciously. 

          for example, In software world, load testing is important but stress testing is even more to understand when your software is going to break so that we can be better prepared for the failures. “Hope for the best; Prepare for the worst :P”

    #4  Cost reduction when done without understanding the impact on the end users can be counter productive.

    #5. For any application which involves direct or indirect physical interaction with users, SAFETY testing is must and needs to be done constantly to avoid accidents. e.g. Healthcare industry, Aviation etc.

    #6  Security testing is extremely critical when there are chances of sabotage in the system. In software it can be a malicious user who is trying to harm the system for his interest or to hurt the users.

    #7  Testability should be built into every system to find out the exact root cause of the behavior when something goes wrong.

         Even today the interested parties haven’t been able to reproduce the exact conditions that resulted into this gas tragedy.

         Only if the software has logging mechanism which can be used to find out the exact sequence of events that occurred before the failure, the root cause can be determined unanimously without leaving any room for speculation.

Thoughts for you

[raj]: If you think we have really learnt from past then read the comments below. Forget about applying the learning into other industries or fields, the very same company’s toxic waste is still lying there waiting to explode anytime again. 

In software testing, I learn that if a product crashes or causes a major loss to the customer, then there will no guarantee that preventive or even corrective actions will taken for sure. History has proved that we humans are so good at repeating our mistakes and suffering from it again and again without really fixing it. If you want to be good tester, make sure making mistake once is acceptable, repeating it again and again is sheer stupidity.

Clean-up operations: Lack of political willpower has led to a stalemate on the issue of cleaning up the plant and its environs of hundreds of tonnes of toxic waste, which has been left untouched. Environmentalists have warned that the waste is a potential minefield in the heart of the city, and the resulting contamination may lead to decades of slow poisoning, and diseases affecting the nervous system, liver and kidneys in humans. According to activists, there are studies showing that the rates of cancer and other ailments are high in the region.

Tuesday, July 13, 2010

Life Saver or Life Taker ? (Therac-25) – Impact of poor testing (Testing Tragedies #1: Learning from past)

This blog is for everyone who wants to know how software testing job touches human lives and why defects in applications such as healthcare cant be ignored.

History:

The Therac-25 was a radiation therapy machine produced by Atomic Energy of Canada Limited (AECL) It was involved with at least six accidents between 1985 and 1987, in which patients were given massive overdoses of radiation, approximately 100 times the intended dose.

Loss: 

Four of the six patients died as a direct result of poor design, coding and testing

 

therac

Company’s Response

After careful consideration, we are of the opinion that this damage could not have been produced by any malfunction of the Therac-25 or by any operator error

[raj]: Only if the company had taken very first of these incidents seriously they could have saved 3 precious lives. Every critical issue which customer finds should be given utmost priority before it becomes much worse.  

Facts

Only One person did the programming for this system and he largely did all the testing.

Therac-25 was tested as a whole machine rather then in separate modules.

[raj]: Yes, that was my reaction too. We left the lives of so many hundreds in the hands of ‘One/ person only . We are humans and to err is humans. Plus we humans are not so good in finding errors in our own work. 

If System and Integration testing is important, so is Unit testing, we can undermine the importance of any of these. They are meant to complement each other. becomes much worse.  

Incident Log & My Observations

Severity 1 Production Defect #1: 
A 40 year old women was receiving her 24th Therac-25 treatment. The machine stopped 5 seconds into the treatment with an error. The technician seeing that "No Dose" had been administered (according to the computer) hit the 'P' key thus proceeding with the dose. This was done a total of 5 times giving the patient 13 000 - 17 000 rads. To give an idea of how much of an overdose this is; a regular treatment is around 200 rads and 1000 rads of radiation to the entire body can be fatal. The patient died 3 months after the overdose.
Severity 1 Production Defect #4: 
The patient required only a small dose and according to the machine that is all he received. Yet again when the treatment was underway and error paused the machine and the technician hit the 'P' key to proceed. A overdose was administered and the man died just 3 months later.
[raj]: Better testability could have warned the technician that the dose had already been delivered. Misleading information and lack of transparency through the system confused him and he went on repeating the procedure again and again which made it fatal.
 Severity 1 Production Defect #2: 
A male required radiation treatment on his back. The machine was set to X-ray mode instead of Electron mode so the technician just used the "cursor up" key up and quickly changed this mistake. However, this only made things worse as a software bug had been mistakenly stumbled upon. While administering the first treatment an error "Malfunction 54" flashed up telling the technician an underdose had been administered. The technician hit the 'P' key and a 2nd dose was delivered. The patient had been given an overdose after the first treatment, and he knew something was wrong, due to the burning sensation he felt in his back. As he attempted to get up the 2nd dose was administered. The technician would have known the man was in pain if the audio and visual equipment was working. his man within weeks, lost the use of both legs and his left arm. Five months later he became the first fatality directly related to the Therac-25 system.
Severity 1 Production Defect #3: 
A month later at the same hospital, with the same technician another fatal dosage was given. The technician made the same error of quickly changing the mode from X-ray mode to Electron mode using the 'cursor up' key. This again caused "Malfunction 54". The patient this time was receiving treatment on his face. When the overdose was administered he yelled and then began to moan. The audio equipment was working this time but the initial dose was too much for the man. He received severe neurological damage, fell into a coma and died only 3 weeks later.
[raj]: If system was designed considering that a simple wrong choice can have such adverse effects then a choice made by technician could have warned him and possibly stopped him from making that mistake.
 

Learning

Learning #1: Never dismiss any failure without reaching the bottom of it. Over confidence about your quality can take you and your customer down

Learning #2: Never depend on just one resource for the entire functionality. It’s dangerous. and it takes two to tango  (certain activities can’t be achieved singly like arguing, fighting, dancing, making love :))

Learning #3: Unit, Integration and System testing, they all are equally important and one shouldn’t undermine importance of any of these.

Learning #4: Poor testability is extremely fatal. Lacks of user’s ability to validate the completion of the software operation/task can take lives as we have seen above

Learning #5: Don’t repeat any important function/operation/task without confirming the behaviour of the previous operation. Many times we  think that running the software function again is case of failure is perfectly fine but that can be risky if the last operation resulted into corruption or left the machine in inconsistent state

Learning #6: For critical functions in your software, ensure there are provisions to handle silly human errors where we perform an action what we don't intend.  Design should consider that humans can make mistakes and for important tasks, there should be a warning/message confirming the change (that can possibly warn him and correct the action as he intended).

Example of such human mistakes.

we want to click on Checkbox Yes but because page scroll happens and we click on NO and we don't even notice it

or

we are not 100 % concentrating and our brain is lost in it thoughts and we humans are sometime unaware of the  the action performed.

e.g. I bet you would have felt the same more than once  “Have i left the tap open after the bath?” when you would have closed it.

 

Thoughts for you

If you are thinking,this was a rare scenarios and example of worst engineering and the machine would have got retired for ever then i want to leave you with this fact that this machine is still in use today and there might be someone you know who might be sitting in front of the machine as we speak and that’s why it is important to find defects before a life-saver turns life-taker