Have you ever been in a situation where you had to explain why you can’t test everything or why sometimes bugs escaped your best efforts to find them. When I am talking to (most) people involved in the technical aspects of developing software this explanation can be straight forward. They will likely understand the impossibility of complete testing. That testing involves the evaluation of risk and investigating in areas of the perceived highest risk at that time. However, when I am dealing with non-technical (and some technical) people then the explanation can be difficult. I think that I have come up with a pretty good, easy to understand metaphor that closely relates to SW testing and the difficulties testers face when trying to determine how much to test and where to test. Hopefully the following post will assist you if you run across this situation in the future.
In the autumn of 2010 I was preparing my backyard for a 30 ft by 40 ft (9 m by 12 m) ice rink. I was raking the space where I was going to place a big white tarp onto the grass and over a surrounding wooden frame (12″ or 30cm tall). I then fill this “pool” with 2-6 inches (5-15 cm) of water which then freezes and forms my ice rink (For those interested, I also put up 2 ft (60 cm) tall boards which hold the tarp in place on the frame and helps keep the hockey pucks from going into the snow – but that has nothing to do with this post).
Anyway, I was raking a large patch of yard trying to ensure that there were no rocks, sticks, toys, pointy plants, or any other debris that might be detrimental to maintaining the integrity of the tarp and thus not allow the dihydrogen monoxide to remain long enough to cool and transform into its solid state (I was trying to make sure nothing would poke a hole in the tarp and let the water run out). While I was raking I started to think about how my search for potential hazards to my tarp was very similar to a software tester’s search for bugs.
We can perform a sanity/smoke test: We can perform a visual scan to see if there are any obvious hazards that are sticking up out of the grass. This is shallow testing as we do not inspect what lies below the surface. We can even use a tool called the “lawn mower” which can help us in this task. This tool will deal with any object that is sticking up above a specified height and automatically remove it. About 99.99% of what this tool removes is non-hazardous grass. but most people feel better when we use the tool frequently because they feel that things look better in general even though the tool is only performing a shallow cosmetic check for hazards. The tool is mostly cosmetic (just like automated sanity checks). So, the tool has been utilized across the entire area and there are no hazards sticking up above the height of the grass. Well, none except the ones the tool missed – you know, those pesky weeds that manage to somehow lay down when the mower goes overhead. Sometimes the tool misses objects that we thought it would catch – some manual intervention may be required to either fix the tool or deal with the missed object. Sometimes when the tool is being used it alerts the operator to an unexpected hazard by sounding an alarm (making a noise or vibrating). The tool does not know if the alarm is being sounded for a real hazard or not: a piece of wire (hazard), a stick (hazard) or a piece of string (not a hazard but wraps itself around the blade and still causes a vibration). It is up to the operator to investigate the alarm and determine if the cause is a real hazard or not.
Now that the shallow scan is complete we still need a manual deeper scan to help find the missed objects. We can perform some deeper testing by looking through the grass – either manually (on our hands and knees) or by using a tool (a rake). If I choose to look through the grass manually I can be fairly confident that the areas that I have covered will be free of any important hazard (those that will damage the tarp) but the problem here is the time commitment is too large (it also would make my back sore, but that is not part of the metaphor). So, I decided to use the rake. I made a first pass fairly quickly and raked the entire area where the tarp was going to be placed. This pass simply removed the larger hazards (loose sticks) and the leaves that had fallen from the trees. I was certain that my coverage was still not good enough. I knew that there was still a very real possibility that there was a critical hazard for my tarp. I decided that I had better do a second, deeper pass and rake a lot more carefully. I started at one corner and I worked my way along the narrow edge while I raked about 1 meter. I made many passes with the rake and I was paying particular attention looking for signs of any potential hazards. It was at this point when I realized that the raking was similar in many ways to SW testing. How do I know when I have raked any particular area enough to be confident that the tarp will not get punctured? With about one-third of the lawn raked, the amount that I was raking was giving me fantastic coverage but I now had two new problems: 1. I was getting physically tired, and 2. I was going to run out of daylight about two-thirds of the way through if I continued at this pace. So, I decided to do what some software projects do when the deadline is approaching. I decided to increase the risk and decrease the coverage. I sped up my raking by about 50%. My coverage was noticeably lower but I would be able to finish raking and lay down the tarp before it was dark (I would meet my deadline).
So, the first third of my lawn was covered MUCH better than the other two-thirds (not unlike some features I have tested in the past). I managed to get the tarp down before dark and fortunately there were no missed hazards that caused the tarp to be damaged.
Then I started thinking about the metaphor a bit more. I was wondering how well I would have done if I had a script that I had to follow. Start in the corner and brush the rake through the grass three times while looking for potential hazards. Take one step forward and brush three times with the rake. And so on. If I had to follow a script I wouldn’t have been able to so easily adapt to the shrinking time frame. I would have been stuck having to complete the prescribed raking or else my report would have shown incomplete raking vs. my plan. I would not have been able to brush 10 or 20 times with the rake in areas where there was more debris or areas that I suspected to have more debris.
What if the “clear the lawn of hazards” had been an automated script – written before the lawn was available to be investigated. I likely would have covered come areas multiple times and missed other areas completely. The part of the lawn closest to the gate would likely have had the most coverage and the farther reaches of the lawn might have only had one or two rake strokes “planned”.
If I had brushed the rake at least once across the entire lawn, I would have had 100% lawn coverage – but that would definitely not have been enough for some portions (those that had rocks half buried or multiple hazards in a small area). My coverage stats would have looked fantastic even though my hazard detection would have been insufficient. I could also have claimed to have had 100% coverage with just the lawn mower. Had I stopped after just mowing the grass, I would have ended with no water in my “pool”.
What if I had recorded my raking pattern and then been able to replay them the next year? How effective would that have been? I would spend a lot of time in last year’s problem areas and not adapting to this year’s new problem areas.
How well would I find problem areas that are just below the surface? An area that would be hiding a large hazard not easily detectable by regular raking. Would I realistically expect to find that type of hazard before starting to fill the “pool” with water? Would the thin layer of dirt be enough to protect the tarp or will the hazard cause a catastrophic failure? Is the raking that I have done enough to proceed to lay down the tarp and start filling the “pool”? The tarp cost over $150. How does that impact my decision? What is the likelihood of being able to fix a hole? These are the risk areas that I needed to consider when deciding how much coverage with the rake – similar in many ways to risk areas for software projects. The tester likely will not have all the information and must be able to inform business decision makers of the risks and the coverage they have performed.
There is more that I could write about this but I feel that I have passed along the essence of what I was trying to convey. I hope that this metaphor will be of some use to you when needing to describe test coverage to non-technical people.
Postscript: If you live in warmer climates and the idea of a backyard ice rink is foreign (or even baffling) then you can think of laying down a “slip ‘n slide” (in this case, the tearing of the plastic would be secondary to the potential tearing of skin of the children sliding – so I would likely have done a more thorough search than I did for the rink).
Paul, I like your real world example for this. I tend to have this specific conversation with candidates in interviews, either because they come from someplace that taught them that they had to test 100%, or because they are naive, non-testers applying for an entry level test position. I tend to try and explain it quickly like this:
The majority of our software is not used to transfer funds, run life support systems, or send things into space. Companies that do things like write software to get a rocket into space, with people in it, and have it and it’s contents arrive safely at the ISS test 100% of their code (though that doesn’t mean it can’t still fail). We currently ship a major release once a year. If we wanted to test 100% of everything, we would need to release every 4 or 5 years. Instead we test the highest risk areas first, and rely on the skill and knowledge of our testers to say when they feel comfortable enough with the product based on their experience and the amount of testing they accomplished. This gives our customers a balance of new features available more often and a high degree of confidence in the product we ship.
I had to change it a year or so ago to start with “the majority” because we acquired a class III FDA regulated medical device (software) that has to (obviously) be tested differently.
Great analogy. This topic came up at the recent #MEWT and this post was mentioned. It’s a great demonstration, one I’ll share with the whole team, and certainly something to keep up my sleeve next time I’m challenged on not using test scripts for a particular project! Thanks!