Hitting the Pothole of Trying to Automate Everything

One particular client wanted to automate as much of their testing as possible. I thought of a metaphor to help describe how trying to automate almost all “regression testing” is a difficult path to success. There are benefits and weaknesses to automation that need to be considered in addition to the costs.

Consider your software to be a country and the people in your country are your users. The inhabitants drive around the country and experience the different aspects of each geographical region (use the software in different ways). As they drive they will encounter potholes in the roads. Some potholes will be so large that driving down that particular street might cause damage to the vehicle or might make the street impassible. Other potholes are small and just inconvenient because you have to drive slower or pay more attention to avoid them. There are also some potholes that don’t look too big or nasty when you hit them but they can cause damage to your car that you are not aware of at the time.

How would you find and fix potholes in the roads? Everybody wants streets with no potholes (bug free software) – but road crews realize that is an impossible goal given their budget and time constraints. They need to focus on finding and fixing the most dangerous and disruptive potholes.

One solution might be for the road crew to send droids to drive along roads and report back any potholes that they detect (automated scripts).

Wow, that sounds great! What could go wrong?

 

Problem 1 – There are more kinds of problems than automation can be programmed to recognize

Droids may miss reporting a pothole because that pothole looks different to what it was specifically programmed to search for. Plus people are reporting other road problems that the droids miss because they cannot be programmed to detect every possible problem. Problems like missing barricades (security checking, input constraints), problems with road signs (help screens), and narrow lanes causing slow and unclear navigation (usability).

Problem 2 – Investigating reported failures takes a long time

Droids will sometimes report potholes that aren’t really potholes. False positives might be caused because there is a problem with the droid, the road layout has changed, or there might be a puddle or some garbage in the street. All of the pothole reports from the droids need to be investigated to validate that they are reporting a legitimate pothole which takes quite a lot of time. The road with the reported pothole needs to be revisited by a tester to determine if the report is legitimate or not. If there are a lot of false positives it may take longer to investigate the failures than it would have taken to have a person perform testing instead.

Problem 3 – There are more checks to automate than can possibly be written

The road crews have decided to create a fleet of automated drones that drive along as many streets as possible. The road crew started with droids checking all the major highways. Next they want to start working on other major roads. Every time a significant pothole is reported in a road that is not covered by a droid a new droid is programmed to check that road in all future road scans. A full scan may start out taking a day, but soon it takes weeks. And the goal of full road coverage always seems to recede over the horizon: the road crew thought that if they could cover the major roads then they have covered the entire country, but there are many times more miles of secondary roads that are equally important to some drivers, and even more prone to potholes.

Problem 4 – Automation is expensive to build and maintain

The crew finds that they are so busy maintaining the droids checking the highways and investigating pothole reports that they don’t have time to create new droids. To continue to grow the droid network they need more people and more money. The cost of this approach is so large that the road crews have spent far more than their allocated budget already. They have asked for more funding and are waiting for approval. They hope to get the extra funding as they have invested so much already it would be a shame to stop now, activating the sunk-cost fallacy.

Problem 5 – Some things are too difficult to automate effectively

There are things that humans are really good at such as investigating, observing, and detecting something is wrong. It is impossible to automate investigation. After the investigation has been performed, it is possible to create a check that will redo the steps that cover one implementation of the original intent of the investigation. Attempts to automate difficult algorithms will often cost more to write and maintain than they would otherwise take to execute manually over the life of the product.

Conclusion

Attempting complete automated coverage costs huge amounts of money and time without finding all the problems

 

A Strategic Mixed Approach

Perhaps a better solution would be to send drones to drive along the most important roads and report any potholes that they see, and also deploy actual people to drive around and look for potholes. These people could also use other tools to discover potholes like aerial photos, previous pothole reports and reports of areas where recent roadwork has occurred. Many factors could influence the selection of which roads to inspect and by what means. The people who pay the highest taxes may get their roads checked, along with the most heavily used roads, roads that are frequently used by emergency vehicles, and so on. In other words, the people and droids are deployed strategically and not uniformly across all roads.

 

Summary

So, I don’t recommend automating all the checks that you can. Focus your automation to key areas and spend time every release actually testing your product. By running only the same automated checks over and over again you will never find different problems that already exist in your software; such as problems that exist in code paths not checked by the scripts as well as problems in checked paths that are missed by the automation. When testers test the software they will explore different paths and observe far more than a script ever can.

When planning your automation coverage, when should you consider automating a check?

  • When the cost is low (including on-going maintenance costs)
  • When the item being checked is important enough
  • To act as a benchmark for future tests
  • To reproduce a failure (especially intermittent problems)
  • To see if a failure has not returned
  • When the check is difficult for a person to perform (high volume, critical timing, etc.)
  • When there is a clear set of established rules that the software must follow (e.g.: communication protocols)
  • If the check will need to be executed often in the future (regression)
  • If you want to compare different platforms (although this can present new issues with automation)

Functional Specification Blinders

During my last few years as a test manager at Alcatel-Lucent, I decided to try a slightly different approach to the development of test ideas.
Before this experiment my testers would be given a feature to test and they would start by reading the functional specification (user stories if you are in an agile environment). They would develop a list of potential test ideas based on the specification document and THEN add any other ideas they could think of.
I found that when I reviewed their list of test ideas that I would often think of fairly creative test ideas that were missing from their list. Now, I would like to think that the root cause of my creativity was my highly endorsed “general awesomeness” (see my LinkedIn profile for all endorsement bombings I have received) but I am not that self-confident. I thought I would try an experiment. The experiment went very well and it changed how my team approached new testing situations.
When my team was first presented with a new feature I would allow them time to explore the feature and develop test ideas BEFORE I let them look at the functional specifications. I tried this as an experiment on one feature and I found that the creativity shown by the team was noticeably higher than when they had started with the functional specifications. I have labelled this “Functional Specification Blinders”. By reading the information in the functional specification I have found that it becomes very difficult to be able to think of testing ideas that are not included in the functional spec document. I think the reason behind this goes back to how we were exposed to learning as children. We are asked to think of explanations of how something might work (or react or behave, etc.) and then we are told how it works by a teacher (or other “expert”). Once we are told “the truth” we are expected to accept that and move on with our new knowledge. The problem with this approach is that often “the truth” is wrong or at least incomplete or misleading. I have found that the same phenomena happens when thinking about how software might be tested (or used or implemented). By allowing my team to use their creativity before reading the functional spec they have the freedom to think of test ideas that are more creative, unusual and potentially more powerful. In one class I was teaching of Rapid Software Testing, during an exercise on creating models, the students were asked to think of how to factor an object (identify all the dimensions of it) or in other words try to identify a thorough list of anything that we might want to test. One student discovered a published standard for the object before he had tried to think of any dimensions on his own. For the remainder of the exercise (roughly 5-10 minutes) he was completely incapable of thinking of ANY dimension that was not mentioned in the standard. I was pleased that he was able to show me a bit more proof in my “Functional Specification Blinders” theory.
So, if you have the freedom to investigate your feature in whatever method you choose, then I encourage you to start with other methods of learning about the feature than reading the specifications. You can any of the methods from this incomplete list:

  1. Play with the feature. If the feature is available (in any form) then interact with it.
  2. Talk to the product owner/product manager
  3. Talk to marketing
  4. Talk to customers
  5. Talk to the developers (or their manager)
  6. Talk to customer support
  7. Read marketing materials, claims, web sites, etc.
  8. Investigate competitor’s products
  9. Have discussions with other testers
  10. Talk with the architects/designers

By exploring some of these venues BEFORE reading the specifications then you can allow your creativity to blossom. The specification will still be there, waiting for you, once you have tried thinking for yourself.

Reply to Human and Machine Checking

The last couple of days have had some posts regarding the difference between testing and checking – in addition to a further comparison of machine checking compared to human checking.

The posts that I have seen so far are from James Bach and Michael Bolton: Testing and Checking Refined

Iain McCowatt’s blog in response to that post: Human and Machine Checking

This is my response to Iain’s post. You may need to read the above posts before reading this post – or perhaps not.

First let me say that I have been a big fan of the distinction of testing vs. checking since Michael Bolton first mentioned it to me at TWST (Toronto Workshop on Software Testing) about 4 years ago (I think). I am very grateful to James, Michael and Iain for making their stance on this topic very clear with very nice blog posts.

I have a point that I would really like to make and I feel has been clear all along, but after talking with Michael Bolton last night, it is clear to me now that the point I hope to make is not currently clear in the minds of all (or even most) testers.

When a “tester” (for this post I will use this term to refer to someone that is assigned to execute one or more manual scripts) starts to execute a script they will not behave in the same way as a machine executing an automated script execution would. This should be quite obvious to anyone who stops for a moment to think about it. The machine has the ability to execute commands and compare the results much faster than a human BUT the machine can only compare what it has been programmed to compare. In Iain’s post he quite nicely states in his summary:

Computers are wondrous things; they can reliably execute tasks with speed, precision and accuracy that are unthinkable in a human. But when it comes to checking, they can only answer questions that we have thought to program them to ask. When we attempt to substitute a machine check for a human check, we are throwing away the opportunity to discover information that only a human could uncover.

Iain also very eloquently mentions that humans will always be able to do more than just checking:

What a machine cannot do, and a human will struggle not to do, is to connect observations to value.  When a human is engaged in checking this connection might be mediated through a decision rule: is this output of check a good result or a bad one? In this case we might say that the human’s attempt to check has succeeded but that at the point of evaluation the tester has stepped out from checking and is now testing. Alternatively, a human might connect observations to value in a way such that the checking rule is bypassed. As intuition kicks in and the tester experiences a revelation (“That’s not right!”) the attempt to check has failed in that the rule has not been applied, but never mind: the tester has found something interesting. Again, the tester has stepped out from checking and into testing.

The part that I didn’t see Iain mention (and this is the point that I wanted to make) is that not all “testers” will notice much more than a machine. I suggest that the tester will likely only notice more than what they are asked to check (i.e.: more than a machine) IF they possess at least one of these traits:

  • engaged,
  • motivated,
  • curious,
  • experienced,
  • observant,
  • thoughtfully applying domain knowledge (I couldn’t think of a way to shrink this one down to a single word).

Some of the traits above may exist and will not mean that the tester will necessarily notice something “outside” the script – but without any of these traits being present during the script execution I suggest there is little hope that the tester will notice anything “off script”.

I have an acquaintance who is the director of a large system test group at a Telecom company (and not at my previous employer – Alcatel-Lucent). She was wanting to assess the effectiveness of her manual test scripts so she had over 1000 fault reports raised by manual testers analyzed for the trigger that made the tester raise the bug. She found that over 70% of the fault reports that had been raised over the past year had been raised by a tester noticing that something was wrong that was NOT specified in the script. Only 30% of the faults were triggered by following the script.

To me this is incredibly important information! If I was to replace all of those tests with automated tests, then my fault finding rate would drop by 70%. If I was to outsource my testing to the cheapest bidder then I predict that my fault finding rate would drop off dramatically because the above traits would likely not be present (or not as strong) in the off-shore test team.

As I reflect on what I have been saying about testing vs. checking over the past few years I have been assuming that when I talk about “checking” I am talking about unmotivated, disinterested manual testers with little domain knowledge OR I have been talking about machine checking. Once you introduce a good manual tester with domain knowledge, then you will find it very difficult to “turn off” the testing. To me the thought of a good tester just “checking” is absurd.

Good testers will “test” whenever they are interacting with the system – whether they are following a script, a charter, or just galumphing. Bad testers will tend to “check” when they interact with the system – either because they don’t care enough or because they don’t have the required knowledge to “test”. Machines will “check” (at least for now – who knows what the future will bring?).

 

Killer Interview Questions

I have been interviewing and hiring testers since 1999. I like to think that I am pretty good at finding testers who can think, communicate, and fit into the team. I have three questions that I like to ask that really help me determine if the person can think and communicate. I don’t have any questions regarding “fitting into the team”, that is just based on how well I (and any other interviewers) interact with the person during the interview. In the 13 years that I have been interviewing testers I have hired roughly 120 and only regretted the hiring of three of them:

    One of them quit just before I was going to put him on a PIP (Personal Improvement Plan). This tester was the fourth person I hired and he was the husband of a friend. He did not answer my three questions well but I hired him because of the friend connection (drat),
    One was a very good tester but she did not really “fit in with the team”. She transferred from another division and I only did a phone interview with her. I am quite confident that if I had been able to interview her “in person” then I likely would not have hired her – but that is pure speculation on my part.
    One was let go during a round of layoffs. This tester is the only “anomaly” to my otherwise explainable hiring record. In other words, I can explain why I mistakenly hired the other two – but I can’t explain this one. She answered the questions well and then was just a really bad tester – not willing to learn, change the way she approached problems, not a good problem solver, seemed to like documentation over testing.

So, I think that my record is pretty good (a >97% success rate – and >99% if you remove the one who didn’t answer the questions and the one I didn’t interview in person).

How did I achieve such a high success rate? Well, first off I did not search for domain experts. I often had to hire for roles that were very specific (e.g.: Physical layer tester for DSL lines – there are not very many testers with that type of experience). I found out fairly early on that in my context it was much easier to find a critical thinker, problem solver that could then learn the domain instead of a domain expert who would not be able to think his way out of any simple issue in the lab.

In my last 4 years at Alcatel-Lucent there were three test managers. John Hazel and I worked very well together and he is the author of one of the questions that I use. Whenever possible, John and I would interview together and we would enjoy the entire process. The other manager never really followed what John and I were doing. I’ll call the other manager Michel (because that is his real name). Michel had a position to fill and he went about hiring a new University graduate without asking for help from John or me. The tester that he hired was very “book smart” but was scared to do anything in the lab. He would ask other testers to pull cards, run cables, and reprogram SIM cards. It was not a surprise to us when he transferred to a different role within 6 months of joining Michel’s team.

One important aspect of my hiring practice is to focus my hiring on people directly out of University (or College) with a good working knowledge of computers. I have found that I am FAR more likely to find a superstar this way than to look at people who are floating around in the industry with many years of bad habits (er, experience). These people will have graduated, and if they have a computer science or electrical engineering degree then they should be able to code tools or automation to a satisfactory level. I honestly just make the assumption, and if I find out that I am wrong, I can either move them into a manual test role, or just let them go. I have never had to do this though.

A trick in the hiring process is to make sure that you are hiring people who not only want to be testers, but who will also probably be good at it. We all know that many people are just not capable of being good testers – I typically refer to this group of narrow-minded people as “software designers” :-P.

It is also very important that the testers in your organization get treated with respect and similar pay as the S/W designers. If this is not the case, then you have a much harder battle to fight with your management organization. It also helps a lot if the new tester is not “the” tester, but a member of a testing team (even a small team of 2 or 3 other people), so they have a opportunity to see that “test” is a viable career. If they are the only tester (or just one of two), then they can see that they have no career path, no mentor, no where to go –> except over to the “dark side”.

Reader: “Enough history and boring background crap. Cut to the chase, Paul. I’m about to start to skim for the questions.”

Okay. Okay. Here are my three interview questions that I like to ask in the interviews. John really like to ask other (fairly useless) background questions like “What was your biggest mistake? What was the result? and What did you learn from it?” He also liked to do the resume crawl. “So, tell me what are most proud of from your last job?” I can’t say that it was a complete waste of time as there were some interesting nuggets that got exposed but I found that more often than not they did not really add to my decision whether or not to hire that person.

The questions that I ask are fairly long (in situation) but the answers are fairly simple:

Question 1: I describe a hypothetical situation to them. I let them know that the setup is fairly long but the answer I am looking for is fairly
simple. I tell them they are in charge of daily automated sanity testing of new S/W builds. It takes about an hour to load the S/W onto the system and perform the test. The run the test on Friday and everything is fine. They send out an email to the team telling them to promote this code. If they had found an error, then they would have had to send an email to the owner of the failing section of code to have them fix it. They are going to take a week off, and they ask me to perform the testing while they are away. “No problem”, I reply. When they return from vacation 10 days later they discover that the load does not work. They ask me how the testing had gone while they were away, and I tell them that I completely forgot to perform the tests. They quickly send an email off to the owner of the failing module and she says that there many check-ins throughout the week including weekends. She needs to know the first load that failed. Oh, she is going on vacation in 4 hours. So, they have a load that worked last Friday, and now, 10 loads later (we do builds on the weekends too), they have a load that does not work. All loads are available on the server. We need to find out which load first broke so the SI (system integrator in charge of the builds) can review the many check-ins on from that day and locate the problem. The question is: How would they determine which load last worked and which one first broke in four hours or less. They only have one setup (if the person asks if they have multiple setups – I like that, but the answer is “no”).

What to look for in the answer: Analyze their thought process. If they are trying to review the check-ins, talk to designers about when they thought the code might have broken, or look at code themselves – that is a red flag to me – usually indicates that they are thinking more like a S/W designer. If they want to check the loads sequentially (from either end) – this is also a red flag – but less so than the previous one. You can reiterate at this point that you need to know quickly (restate the 4 hour time limit).
Hopefully, they will then realize that they need to do a binary (or bisection) search. I have found that about half the people get the binary search right away. It often surprises me how often this situation arises in testing. For example, I last tested feature A about 4 months ago and everything was fine. Just before a major release I go to test it again and it is broken. The designers have no idea what could be wrong as no one has touched that code in months. They need to know the load where it first broke to be able to find the interaction that is causing the problems. If you don’t do a binary search, you will likely be trying an awful lot of loads before you find the answer.

 

Question 2: This question involves a simplified network diagram. Your own computer (A) and another computer (B) in the next (empty) cubicle are on the same subnet, a computer (C) in the lab (a five minute walk away) is on another subnet, and the mail server on a third subnet (I usually draw the main subnet routers (X,Y, and Z) as well). You get a phone call from my boss asking you to go into work. There is something critical she just sent to you via email. Her cell phone battery dies at this point so you can’t talk to her anymore. I am not accessible either (drat). You come into work even though it is a statutory holiday – there is no one else is around to help you. You really need to check their email, but the problem is that you can not access the mail server (“mail server is not responding” message is being displayed). What would you do?

Answer to #2: Their initial response is usually quite interesting. Some people just say that they would call the “help desk” or go home and wait until the next working day. This indicates to me that they may not be willing to investigate problems fully (especially considering that was their answer during an interview). Let them know that they “need” to get their email, or at least exhaust their possibilities.
I let the problem change depending on their answer and their experience level. I let them tell me what they would try, and I give them the result. As an example, they might say they would try and ping router “X”, or workstation “B”. I tell them “X” is alive, or no response. I also usually start with a simple “the cleaner knocked the cable out” problem and then tell them that they have now come in on another stat holiday and start the question over again. You have to be careful that you are not assessing their networking knowledge (unless required for the job) so much as their problem solving abilities. In some cases, I have had to let applicants
know that there is “ping” command, or that they can try and telnet to “C”. Another interesting thing to analyze is what they do once they can get their mail. For instance, if they try from “A” and it doesn’t work.
Then they try from “B” and it does work. Some people have been happy that they got their email, and they do not show any indication of looking for the cause of why their own computer is not functioning (I find this behaviour quite scary 🙂 ). I usually make them work it out until they discover a good probable cause like router “Z” is down, or that the mail server application is down (but sometimes they can still ping the server itself). I let them guide the answer, and the more they know about networking, the harder I make the failure.

 

Question 3: I would like to take credit for this last question but this one is by John Hazel.

The question presents a brief (and flawed) functional specification to the interviewee. The FS describes the billing rules of a small business called “Al’s Delivery Service”. I verbally describe Al’s business as a small business (with only one office) with dedicated clients and limited to no competition. He delivers only single bricks to any customer (the size, shape and approximate weight of a white board eraser). The actual composition of the brick can be anything you like (I usually imply an illegal substance – to explain the dedicated customers and unreasonably high rates charged by Joe).

What I write on the board:

Al’s Delivery Service
1. If the delivery is < 10 km from Al’s office the base cost of the delivery is $10.00
2. If the delivery is > 10 km from Al’s office the base cost of the delivery is $25.00
3. If there are any stairs involved in the delivery path, there is an additional $10.00 charged
4. If the delivery occurs on the weekend the cost of the delivery is a flat rate $50.00 (verbally clarify: regardless of distance and/or stairs)

I let them know right away that there are two parts of the question:
Al had a friend of his write the software that does the billing to the above spec. He has asked you as a professional tester, to test the math in the billing software before he starts using it. You can stress that you do not want them testing the UI – just test the math portion.

3a. What are the test scenarios that you would suggest for Al to verify that the math is correct in the typical situations? Sometimes I say that the UI is limited to two radio buttons for the <10km or >10km and two check boxes: one for stairs and one for weekend. There is a “go” button. That is it so they can not get into “I would enter negative 1 for the distance”, etc.
3b. What corrections, clarifications, and/or limitations would you make to the existing FS that will help Al’s business succeed?

Answer to question 3a: I am only looking for the following answer for the scenarios:

< 10 km
> 10 km
< 10 km w/stairs
> 10 km w/stairs
weekend (either on its own, or with the above 4 conditions)

So essentially I am looking for either 5 or 8 scenarios as the “correct” answer as long as they have an acceptable reason for going with 5 or 8. My preferred answer is 8, but I accept 5 as it was John’s original answer, yet he claimed to have knowledge of the implementation. I also am willing to listen to any plausible context for different answers.
Frequently I have to put limitations on the presented user interface to prevent the literally countless possible inputs. I tell them that they have two radio buttons for “<10 km” and “>10 km”, and two check boxes (weekend and stairs). If they are being a bit more thorough, then I let them know that the default is <10 km, and both check boxes unchecked. To be clear, the radio buttons are mutually exclusive,, and the check boxes are independent.
What I am looking for in the answer (besides either of the two answers above) is the thought process that they display coming up with the list. Some people fumble around and have no clear thought process, while others come across as much more organized and clear in their method. Honestly, I have not seen much difference in the overall effectiveness of either type, but I have only ever hired one person who messed up this question (both parts), and they were mentioned at the beginning of this post – they quit after 4 months.

The second question provides much more interesting information for the hiring decision.
Once again, I am looking for some particular answers, yet there are MANY others that I have heard and appreciated as very valid (although sometimes obscure). The “main” answers are:

What about = 10 km (more of a design issue than a FS issue as it would be almost impossible to measure that far exactly)
What, if any, is the maximum distance?
What, if any, is the maximum number of stairs?
What about statutory holidays?

A small sample of other possible answers:
What time do weekends/holidays start?
What about evenings?
What about the minimum number of stairs?
What if the delivery starts on a Friday, but is a long distance and ends on the weekend?
Offer a discount to repeat customers.
Charge by mile traveled (similar to main point 2)
Have more divisions in the distance (5, 10, 15, 20, 25 miles) to smooth $10 to $25 change.
Charge by flight of stairs (similar to main point 3)
What about parking?
Does he do international deliveries? What about custom charges?
There are many other possibilities – you can easily use your own judgment. Until they get the main 4 points,
I usually allow them to sweat it out with occasional small hints (normally just pointing out which of FS lines
needs clarification/limiting/fixing). After they get the 4 main points, I will let them finish on their own pace.

Well, those are the three main questions that I ask in my interviews. I hope they serve you as well as they have served me over the past 13 years.

How to Describe Test Coverage to non-Testers

Have you ever been in a situation where you had to explain why you can’t test everything or why sometimes bugs escaped your best efforts to find them. When I am talking to (most) people involved in the technical aspects of developing software this explanation can be straight forward. They will likely understand the impossibility of complete testing. That testing involves the evaluation of risk and investigating in areas of the perceived highest risk at that time. However, when I am dealing with non-technical (and some technical) people then the explanation can be difficult. I think that I have come up with a pretty good, easy to understand metaphor that closely relates to SW testing and the difficulties testers face when trying to determine how much to test and where to test. Hopefully the following post will assist you if you run across this situation in the future.

In the autumn of 2010 I was preparing my backyard for a 30 ft by 40 ft (9 m by 12 m) ice rink. I was raking the space where I was going to place a big white tarp onto the grass and over a surrounding wooden frame (12″ or 30cm tall). I then fill this “pool” with 2-6 inches (5-15 cm) of water which then freezes and forms my ice rink (For those interested, I also put up 2 ft (60 cm) tall boards which hold the tarp in place on the frame and helps keep the hockey pucks from going into the snow – but that has nothing to do with this post).

Anyway, I was raking a large patch of yard trying to ensure that there were no rocks, sticks, toys, pointy plants, or any other debris that might be detrimental to maintaining the integrity of the tarp and thus not allow the dihydrogen monoxide to remain long enough to cool and transform into its solid state  (I was trying to make sure nothing would poke a hole in the tarp and let the water run out). While I was raking I started to think about how my search for potential hazards to my tarp was very similar to a software tester’s search for bugs.

We can perform a sanity/smoke test: We can perform a visual scan to see if there are any obvious hazards that are sticking up out of the grass. This is shallow testing as we do not inspect what lies below the surface. We can even use a tool called the “lawn mower” which can help us in this task. This tool will deal with any object that is sticking up above a specified height and automatically remove it. About 99.99% of what this tool removes is non-hazardous grass. but most people feel better when we use the tool frequently because they feel that things look better in general even though the tool is only performing a shallow cosmetic check for hazards. The tool is mostly cosmetic (just like automated sanity checks). So, the tool has been utilized across the entire area and there are no hazards sticking up above the height of the grass. Well, none except the ones the tool missed – you know, those pesky weeds that manage to somehow lay down when the mower goes overhead. Sometimes the tool misses objects that we thought it would catch – some manual intervention may be required to either fix the tool or deal with the missed object. Sometimes when the tool is being used it alerts the operator to an unexpected hazard by sounding an alarm (making a noise or vibrating). The tool does not know if the alarm is being sounded for a real hazard or not: a piece of wire (hazard), a stick (hazard) or a piece of string (not a hazard but wraps itself around the blade and still causes a vibration). It is up to the operator to investigate the alarm and determine if the cause is a real hazard or not.

Now that the shallow scan is complete we still need a manual deeper scan to help find the missed objects. We can perform some deeper testing by looking through the grass – either manually (on our hands and knees) or by using a tool (a rake). If I choose to look through the grass manually I can be fairly confident that the areas that I have covered will be free of any important hazard (those that will damage the tarp) but the problem here is the time commitment is too large (it also would make my back sore, but that is not part of the metaphor). So, I decided to use the rake. I made a first pass fairly quickly and raked the entire area where the tarp was going to be placed. This pass simply removed the larger hazards (loose sticks) and the leaves that had fallen from the trees. I was certain that my coverage was still not good enough. I knew that there was still a very real possibility that there was a critical hazard for my tarp. I decided that I had better do a second, deeper pass and rake a lot more carefully. I started at one corner and I worked my way along the narrow edge while I raked about 1 meter. I made many passes with the rake and I was paying particular attention looking for signs of any potential hazards. It was at this point when I realized that the raking was similar in many ways to SW testing. How do I know when I have raked any particular area enough to be confident that the tarp will not get punctured? With about one-third of the lawn raked, the amount that I was raking was giving me fantastic coverage but I now had two new problems: 1. I was getting physically tired, and 2. I was going to run out of daylight about two-thirds of the way through if I continued at this pace. So, I decided to do what some software projects do when the deadline is approaching. I decided to increase the risk and decrease the coverage. I sped up my raking by about 50%. My coverage was noticeably lower but I would be able to finish raking and lay down the tarp before it was dark (I would meet my deadline).

So, the first third of my lawn was covered MUCH better than the other two-thirds (not unlike some features I have tested in the past). I managed to get the tarp down before dark and fortunately there were no missed hazards that caused the tarp to be damaged.

Then I started thinking about the metaphor a bit more. I was wondering how well I would have done if I had a script that I had to follow. Start in the corner and brush the rake through the grass three times while looking for potential hazards. Take one step forward and brush three times with the rake. And so on. If I had to follow a script I wouldn’t have been able to so easily adapt to the shrinking time frame. I would have been stuck having to complete the prescribed raking or else my report would have shown incomplete raking vs. my plan. I would not have been able to brush 10 or 20 times with the rake in areas where there was more debris or areas that I suspected to have more debris.

What if the “clear the lawn of hazards” had been an automated script – written before the lawn was available to be investigated. I likely would have covered come areas multiple times and missed other areas completely. The part of the lawn closest to the gate would likely have had the most coverage and the farther reaches of the lawn might have only had one or two rake strokes “planned”.

If I had brushed the rake at least once across the entire lawn, I would have had 100% lawn coverage – but that would definitely not have been enough for some portions (those that had rocks half buried or multiple hazards in a small area). My coverage stats would have looked fantastic even though my hazard detection would have been insufficient. I could also have claimed to have had 100% coverage with just the lawn mower. Had I stopped after just mowing the grass, I would have ended with no water in my “pool”.

What if I had recorded my raking pattern and then been able to replay them the next year? How effective would that have been? I would spend a lot of time in last year’s problem areas and not adapting to this year’s new problem areas.

How well would I find problem areas that are just below the surface? An area that would be hiding a large hazard not easily detectable by regular raking. Would I realistically expect to find that type of hazard before starting to fill the “pool” with water? Would the thin layer of dirt be enough to protect the tarp or will the hazard cause a catastrophic failure? Is the raking that I have done enough to proceed to lay down the tarp and start filling the “pool”? The tarp cost over $150. How does that impact my decision? What is the likelihood of being able to fix a hole? These are the risk areas that I needed to consider when deciding how much coverage with the rake – similar in many ways to risk areas for software projects. The tester likely will not have all the information and must be able to inform business decision makers of the risks and the coverage they have performed.

There is more that I could write about this but I feel that I have passed along the essence of what I was trying to convey. I hope that this metaphor will be of some use to you when needing to describe test coverage to non-technical people.

Postscript: If you live in warmer climates and the idea of a backyard ice rink is foreign (or even baffling) then you can think of laying down a “slip ‘n slide” (in this case, the tearing of the plastic would be secondary to the potential tearing of skin of the children sliding – so I would likely have done a more thorough search than I did for the rink).

Reinventing yourself – Remain useful

There are some jobs in the world that have routines that do not change much (they involve repetitive steps without much modification) and some jobs that require frequent new approaches and more thought (changes to what is done on a daily basis). Unfortunately, a lot of people/companies place testing into the first category and not into the second category – where it belongs.

Some jobs that do not require much changing your actions are: assembly line workers, grocery store cashiers, gas station attendants, and TSA guards. Yes, every one of these jobs would involve some thought and modifications to routine in some situations (e.g.: TSA guards and cashiers need to react to unruly customers) but for the most part these jobs continue, day after day, without too much modification to their routine. It is also important to note that these jobs involve a lot of “checking” by following set procedures (scripts) and not too much “thinking” about what is happening in the surrounding environment.

Good stand-up comics are continuously developing new material to keep their audiences laughing and coming back. How boring would it be to always hear the same jokes each time you heard a particular comedian? Would that comedian serve his purpose of entertaining you?

Why do many testers continuously use the same tests over and over and over again? Sure, those tests (checks) may have found important bugs once. Sometimes they still manage to stumble across some issues but mostly those tests (checks) are washed up. That does not necessarily mean that those checks need to be executed every release and it DEFINITELY doesn’t mean that new tests are not required. Some tests (checks) are important enough to warrant being executed each release, but those checks will not find new bugs in the existing code.

As a group, testers need to work towards moving the profession of software testing into the category of jobs that require thought. We need to be advocates for creating new tests to supplement/replace the old ones. We need to  make our jobs more interesting and engaging by creating a situation where a higher level of thinking is required and where mindless testing is frowned upon.

Testers also need to work on frequent self-improvement to be able to develop new ideas and new approaches to their testing challenges. Attending conferences, peer workshops, reading blogs and articles are all wonderful ways to become a better thinking tester.

If more software testers push toward better testing, then the big ship might slowly start to turn away from bad testing. In general, tester will gain more respect, be more useful to their projects and be able to enjoy their jobs a lot more.

A Guide to Peer Conference Facilitation

Peer Conference Facilitation

Have you ever wondered what it takes to be a facilitator of LAWST-style peer conferences? Are you interested in knowing how the facilitators keep track of so many threads? How they decide who gets to speak next? Well in this post I hope to be able to shed some light on the “secrets” of peer conference facilitation.

I can cover the administration aspects of facilitating in this post, but I cannot cover the methods that I use to control the room, keep people for getting out of control, handle difficult participants, use of tone & humour, or the full details about the way I keep track of threads and the stack. It is simply not the correct forum to be able to convey that information. I have found the only way to teach those elements is by being personally mentored during an actual conference. If you are very interested in being mentored in facilitation and you have attended at least one (preferably more) LAWST-style conferences, then please contact me directly via email. I have mentored four people, so far: Eric Proegler, Nick Wolf, Simon Schrijver, and Raymond Rivest. I would be confident that any of these four could facilitate a LAWST-style peer conference with up to 20 people (for Eric, up to 25 people).

The majority of the conferences that I facilitate are LAWST-style peer conferences. There is a very strict set of facilitation rules for these workshops – mostly implemented to avoid bad experiences/situations from occurring again. Many of these “bad experiences” have involved my friends (James Bach and Doug Hoffman, to name two). As a result of their influences we have developed an awesome set of rules that  create a wonderful situation for learning from and sharing with our peers. I will go through a typical peer conference chronologically and identify the important aspects as we go.

The Opening of the Conference

Introductions

At the beginning of the conference the facilitator should welcome everyone, introduce the content owner and the facilities prime along with a brief explanation of the role of these people plus their own role as facilitator. I typically summarize the role of the facilitator as the “air traffic controller”. No one is allowed to speak without first getting permission from the facilitator (in the same way planes can’t take off without permission from air traffic control). So, essentially, the facilitator controls who is allowed to address the room at any particular time. The content owner attempts to keep the meeting “on theme”. They do this by selecting the order of presenters, by asking “focusing questions” (sometimes immediately following an experience report, the content owner will ask some questions to help relate the story back to the theme), and by identifying discussions that have moved too far away from the theme of the meeting. Finally, the facilities prime is the one who looks after the room, explains where the restrooms/water/snacks are located, any rules regarding escorts (if applicable), food (if applicable), and technical issues. Note, any of these roles can be split among more than one person – although there should only be one “active” facilitator at any time.

Except for administration points (location of restrooms, escorts, etc.) there should not be any other discussion at this point of the meeting.

The facilitator should then review the schedule. I like to start at 9:00am and then follow a cycle of 1 hour of meeting followed by 15 minute breaks. This allows for an hour break for lunch (from 12:30-1:30) and finishing at 5:00pm. You can adjust your schedule however you like. I strongly recommend at least one 15 minute break every hour of meeting because it allows networking and discussions that are very valuable to attendees.

The IP Agreement

The next part of the meeting is quite dry but very necessary. You need to read the Intellectual Property agreement. I also project the IP agreement onto the screen and mention where it can be downloaded. Briefly, the IP agreement states that whatever is presented or comes out of discussions can be used by anyone at the meeting (with proper attribution). If something interesting comes out of a discussion, then it is not necessary to attribute that to any one person. All attendees are to be listed as contributing. The full IP agreement used at WOPR conferences is located here. Once the IP agreement has been read out loud, every participant must agree to it by saying out loud “I agree.” If someone does not agree to the IP agreement, then they should be asked to leave the conference. You should not need to worry because in over 35 conferences that I have facilitated there has never been anyone not agree to the IP agreement.

Check In

The next item on the agenda is the “check in”. Everyone at the workshop has a minute or two to introduce themselves (where they work, what they do, years of experience, how many LAWST-style conferences they have attended, etc.). They can also mention specific information they hope to get out of the conference. This is also an opportunity to mention anything that might be “burning in their mind” (this idea was suggested by Scott Barber) which may be causing distractions during the conference. I have heard items such as “I am waiting to hear back on a new job I applied for”; “My child is quite sick at home”; “I just sold my company yesterday”; “I have an interview tomorrow at 10:00am, so I will have to sneak out at that time”. By mentioning it here, the other participants are aware of why that person might be behaving differently.

Have the Content Owner check in last so after their check in they can immediately review the theme of the workshop.

Review The Rules For Questions During Presentations

I introduced the use of “Just in Time Facilitation” at LAWST-style conferences – which means that I will not explain all the rules at the beginning of the meeting. Instead I will explain the rules of the next element of the meeting as they arrive. This has reduced the boring opening by about 20 minutes and allowed the participants to hear the rules just before they are required (for better retention).

At this time of the workshop I review that only “clarifying questions” are allowed during the time the speaker is telling their story. These are questions that are needed to help you understand the story. (e.g.: What does that acronym stand for? What year did this happen? Etc.) Questions that start with “Why” are rarely clarifying questions. “Open Season” questions will be allowed once the story has been told by the presenter.

At this time I have not yet handed out the K-Cards and I will not until the first presenter has finished telling their story. It is not time to explain their use so I do not distribute them.

During The Presentation

If someone has a clarifying question, they can simply raise their hand. I normally allow the speaker to recognize clarifying questions if they notice the raised hand before me. If the question asked is not a clarifying question then the facilitator must stop the question and ask the participant to keep that question for “open season”.

After The Story Has Been Told

Once the presenter has finished telling their story then the K-Cards can be distributed. I sometimes hand them out during the last portion of the presentation or ask someone to help give out the cards while I explain their use. Regardless, the cards need to be handed out in a manner that does not disturb the presenter.

There are currently five colours of K-Cards (I strongly recommend using bright/neon colours):

  • Green: Please place my name on the “new thread” stack
  • Yellow: Please place my name on the “same thread” stack
  • Red (or pink): Oooh! Oooh! I must speak now! Please put me at the very top of the stack. If this card is used to too often by a participant then you should take the card from them. This card should also be used for important issues that impact someone’s ability to understand/participate (e.g.: I can’t hear you).
  • Blue (or purple): I feel this thread is becoming a “Rat Hole”. The facilitator and/or content owner are the only two who can actually kill a thread because they feel it is a rat hole. A rat hole is: a discussion that is going nowhere; or a discussion that is only engaging two or three people in the room; or the start discussion that has previously proven to be unproductive (e.g.: What is the definition of a test case?).
  • Orange (new): I agree with what the speaker is saying. You can consider this as a “Like” or a “+1” card. This card is special because it is the only card not intended to notify the facilitator but instead to show support for the speaker. This is a new card (suggested by Eric Proegler) and time will tell if it has the staying power to permanently join the K-Card family.

Open Season

Once the cards have been explained to the group then the facilitator opens the floor for “open season”. Participants who have a question to ask must hold up a green card. The facilitator writes down the names of those people and selects one of them to ask the first question. If during the ensuing discussion someone in the room wants to add a comment or ask a question on that “thread” then they must hold up a yellow card.

Managing The Threads

The facilitator works through each person who wants to comment on the current thread before going back to select another new thread. This means that the queue is not a FIFO (first-in-first-out) queue but instead is primarily based on following threads.

If there are multiple people on the same stack, then the facilitator decides who speaks next – not necessarily based on the order the cards were held up. I advocate choosing the person who has spoken the least to speak next and leave the heavy talkers on the stack for a while longer. By doing this I try to equalize the “air time” of everyone.

Be aware that some participants will try to jump to the top of the stack by timing when they hold up their yellow card. If their comment/question is not on the current same thread then you must stop the discussion and inform them to hold up a green card instead. If it happens more than once then I suggest explaining the methodology to them again at a break.

The same stack thread can go multiple levels deep. If someone raises a yellow card during the discussion that occurs during another yellow card comment/question then a deeper level “same thread” should be created. It is important that the facilitator follow the discussion carefully to decide which thread to place the new person by what was being said when they raised their yellow card. The deepest I have been is six levels; I recommend “capping” the discussion at four or five levels. Once the stack is back to the top of the original “same thread” then open the discussion to more yellow cards again. Stack management is one of the more difficult aspects of facilitating.

The facilitator should “review the stack” about every three to five speakers to let the group know who is on each level of the stack and who is on the “new thread” list. Doing this helps new participants know that their card has not been forgotten – or if it has then they can let the facilitator know about it.

The discussion is allowed to continue until the stack is empty. I try to have the final question come from a “strong” participant so that the presentation ends on a high note. This is not easy to control as often someone holds up another card while your “strong closer” is talking.

The first presentation is typically the longest of the peer conference. This is often because there is a lot of discussion “on theme” but not necessarily related to the presentation. We have had two opening presentations that have lasted the entire first day. That is OK but not ideal. One way that I use to help shorten the length of the first open season discussion is to ask that questions that are “on theme” but not specifically related to the presentation be held for subsequent presentations.

Don’t forget about taking the breaks. You will often need to break up “open season” with breaks so you should try to time the start of the break with a logical place in the discussion – which is not always possible. It is better to not break up a presentation – so you should ask them how long they need to tell their story before they start. Try to keep the breaks within 10 minutes of the scheduled time.

Check Out

You should try to be at the end of a discussion with about 15 minutes to go before the stop time of the conference to do the “check out” (10 to 30 minutes is a good stopping range). Sometimes the discussion may need to carry over to the next day or it may need to be time-boxed (end at a specific time regardless of the stack size).

At check out, the participants are asked to share some thoughts about how the day went and any particular “take aways” they collected during the day. They can also share areas where they would like to see more discussion the next day (if there is a next day).

On subsequent days the meeting continues as normal but you start with a check in. This check in allows the participants to share what they did the night before and whatever revelations they had upon reflection of the first day.

There is a lot more to discuss but this is a good start and should be a good reference guide for those who are being asked to facilitate.

The history of K-Cards

The History of K-cards – a Revolution in Peer Conference Facilitation

Before I can talk about the “K-cards” I feel it is necessary to explain how I became a facilitator of peer conferences. If you just want to read about the K-cards then please just skip down to that section. In this post, when I refer to a peer conference I am talking about conferences that are based on LAWST (the Los Altos Workshop on Software Testing).

How I Became a Facilitator

In September of 2004 I was very excited to be attending my third peer conference, WOPR3 (Workshop on Performance and Reliability). I have to admit that one of the main reasons for my excitement toward WOPR3 was that Cem Kaner was going to be the facilitator and I wished to meet this testing legend in person. I had previously attended WOPRs 1 & 2 where I met wonderful, passionate people, learned a lot about performance testing, and thoroughly enjoyed myself. The first two WOPRs were facilitated by James Bach and it was absolutely fantastic to spend three days with him and watch him “work the room.”

Three days before WOPR3 was scheduled to begin we received an email from Cem apologetically informing us that due to hurricane Ivan he was unable to leave Florida and consequently would be unable to attend/facilitate WOPR. Although I am sure that being in Florida with hurricane Ivan passing directly overhead was a significant event in Cem’s life, Ivan was to become the single event that would shape my testing career more than any other (except perhaps my first meeting with Ross Collard at a testing conference in Ottawa, Canada in 2000, which was the first step of moving my testing “in the right direction”).

Ross Collard, co-founder of WOPR and the content owner of WOPR3, contacted the three other people on the invitee list who had any LAWST-style conference experience. All three of us had attended WOPRs 1 & 2 – but that was it for our experience. I volunteered to facilitate as I felt that it was something that I would be able to do. After some discussion it was decided that we would start with me as the facilitator; then, when I started to struggle, one of the other two would take over – and we would just cycle through the three of us, as required.

I ended up facilitating the entire 3-day workshop and immediately following Ross asked me to facilitate WOPR4. Since that time I have facilitated over 35 peer conferences/workshops, including all subsequent WOPRs and all CAST conferences (technically not LAWST-style, but very much adapted from LAWST-style) in addition to many others.

One of the best compliments of my facilitation was during a meeting in 2007 with Cem Kaner and Michael Kelly just before WOC2 (Workshop on Open Certification). During the conversation Cem, seemingly out of the blue, looked at me and asked, “So, what does it feel like to be the most sought after facilitator of peer conferences?”  Up until that point I had thought that I was “good” but I had no idea that I had achieved “most sought after” status.

I had never received any formal training in facilitation. I watched James Bach facilitate the first two WOPRs and, later, Scott Barber facilitate two WTSTs (Workshop on Teaching Software Testing). It appeared that I just had a knack for facilitating peer conferences.

OK. That’s enough about me and how I began facilitating.

 

How K-Cards Were Invented

One of the elements that have made WOPR the most “successful” peer conferences was the use of feedback in the first few years. The feedback was in the form of surveys sent after each WOPR where participants were asked to provide their comments all aspects of the workshop so we could improve the complete WOPR experience. A lot of excellent improvements have come out of those surveys and I thank Ross Collard for his huge efforts creating them, compiling the feedback, and encouraging the organizers to implement improvements.

During WOPR4 one of the “improvements” I felt I had made was the use of hand signs. For example, pointing at the centre of my left hand with my right index finger meant I wanted to talk on the “same thread”. I had signs for “new thread”, “same thread”, and “remove me from the thread list”. I felt that the signs were VERY simple and they definitely helped me keep control of the thread stack without interrupting the meeting.

Just before WOPR5 (which I was hosting in September 2005 at my office of Alcatel-Lucent in Ottawa, Canada) I was reviewing some of the feedback from WOPR4. I was shocked to read that one of the participants had found my hand signs were “too confusing”. I immediately started thinking that this “anonymous feedback” person must be a few bricks short of a full load. How could someone think my signs were too confusing? I was at a loss as to how to make the situation “less confusing” so I did the logical thing: I complained to my wife, Karen. I wasn’t really seeking her advice; I just wanted to complain to someone about the participant who made the comment. I just wanted to vent my frustrations.

In response to my rant my wife responded with a simple question. A question that has changed the face of peer conferences more than anything else I am aware of in the past 10 years. She said, “Why don’t you just have them hold up coloured cards? A different colour for each action.”  I immediately recognized the simple elegance of the solution.

On September 8, 2005 K-Cards were introduced for the first time. The name was suggested by Scott Barber, a co-founder of WOPR. I had told him that my wife had said she did not want credit nor did she feel credit was necessary for the idea. Scott felt that we had to credit Karen for the idea, so he suggested the name “K-Cards” to attribute the idea to her without specifically identifying her as the creator.

The cards were an immediate success. The feedback was overwhelming positive from WOPR5. Scott Barber used the K-Cards for the second time in a conference he facilitated in January 2006. Since then the K-Cards have been used in well over 50 peer workshops and conferences. Now, only 7 years since their introduction, I find it hard to imagine a facilitated conference without them.

Much to my surprise, up until September 2012 the original four cards and their colour selection are still being used:

·         Green: Please place me on the new thread list
·         Yellow: Pleas place me on the same thread list
·         Red (or pink): Oooh, oooh, I must speak now (or important admin issue: e.g.: I can’t hear)
·         Blue (or purple): I feel this discussion is becoming (or has become) a rat hole. – This one is not used at larger conferences.

There was a very brief period of time (at two CAST conferences), where the yellow and green cards were mistakenly switched but we have since managed to fix that.

 

A New K-Card Colour is Born (Maybe)

In September, 2012 at WOPR19 we decided to try to implement a new K-Card. Eric Proegler, one of the WOPR organizers, made the suggestion for the new card.

We tried to have an orange card which was to be held up as a sign of agreement with what someone is saying. Eric thought of it as a “Like” or a “+1” card. Although I can’t say that the orange card was a raving success on its first use there was some good aspects to it. I, personally, found it a little distracting when the card was held up as it is the only card which is not a sign to the facilitator but rather it’s a sign of agreement to the speaker. The participants also didn’t use the cards as much as I thought they would. Despite the limited success of the card, I think we will try to use it again at WOPR20. Perhaps it would be more applicable to larger conferences like Let’s Test or CAST. Only time will tell.

Bad Metrics

I have talked about and against bad metrics during my Rapid Software Testing courses and at conferences for a couple of years now. The “Lightning Talk” metrics rant that I did at CAST 2011 is available on YouTube (click here to view) although it is very “rough”. I felt that it was time that I cleaned up my thoughts on this subject and so I have created my first blog post.

Earlier this year I was very flattered to be the keynote speaker at the KWSQA conference and my topic was “Bad metrics”. I felt that this was not enough information for a keynote so I added: “and what you can do about it”. For this post I will just cover off the bad metrics aspect. I will leave the discussion alternatives for a later post. I don’t want to run out of blog ideas too quickly. 🙂

Characteristics of Bad Metrics:

I have identified four characteristics of “bad metrics“, in addition to “needing to serve a clear and useful purpose”. I do not expand much on this point because many organizations feel that their metrics are providing them clear and useful information with which they can make decisions on the quality of their product. Simply stating that their metrics do not meet this description is of little use. The problem is that they are typically unaware that their metrics are not only unclear, but are probably also causing highly undesirable behaviour within their teams.

1. Comparing elements of varying sizes as if they are equal

What is a test case? If you ask 10 testers you will likely receive 10 different answers. How long does it take to execute a test case? Again, this question does not have a simple answer. The effort could vary from less than a minute to over a week – yet when tracking progress many companies track test case completion and do not differentiate the effort required. If you work at a company that still counts test cases and the execution rate of test cases and you are looking to instigate change then you can ask your management this:

“How many containers do you need for all of your possessions?”

They will likely answer that it depends on the size of the container – and then you can say “Exactly!”  I have heard the argument “we are tracking the average execution time” but that has issues, too. The easy stuff tends to happen first and goes quickly. Then the slower harder stuff comes along and before you realize what is happening the test team is holding up the release.

Heck, I’ll admit it. I used to track test case completion. I created pretty charts that showed my progress against my plan. All the quick and easy tests (including automated tests that counted the same as manual tests) were executed first – giving the illusion that we were well ahead of schedule. Unfortunately, when the harder, slower tests were the only tests left the executives felt that we had started to fall behind.

2. Counting the number of anything per person and/or team

On the surface this one sounds like it isn’t a problem. Why not measure how many bugs each tester (or team) raises? You will be able to identify your top performers, right? Wrong! Metrics like “number of bugs raised per tester” or “number of test cases executed per tester” cause competition which results in a decrease in teamwork within teams and information sharing between teams.

Imagine the situation where the amount of my annual raise depends on my bug count (regardless of whether I’m being measured on sheer volume of bugs, or hard to find bugs, or severity 1 bugs). One day I will be happily testing and I will find an area that has a large number of bugs. Will I help the company by immediately mentioning to my manager that I found this treasure chest of bugs? Perhaps, but more likely I would mention it only after I have spent a few days investigating and writing bug reports thus securing myself a nice raise. Wouldn’t the company benefit more if the tester shared the information sooner rather than later?

What if I had a cool test technique that frequently found some interesting bugs. Would I share that technique with others and potentially let them find bugs that I would otherwise find myself? Or, would I keep the technique to myself to help me outperform my peers?

If members of a team are measured on the number of test cases they each execute per week, then some testers would probably decide to execute the quick and easy tests to pad their numbers instead of executing the tests that make the most sense to execute.

Will testers take the time extra required to investigate strange behaviour if that would mean that they will fall behind in their execution rate? Likely not, thus leaving more bugs undiscovered in the product.

3. Easy to game or circumvent the desired intention

Making a metric into a target will cause unwanted behaviours. If you have a target of a “95% pass rate” before you can ship then your teams will achieve that pass rate no matter how much ineffective testing they have to perform to meet the target. I used to think that looking at pass rate was a good metric until I had a discussion with James Bach about 8 years ago. We had a 10 minute long discussion where I was trying my hardest to fight for the validity of the coveted “pass rate” metric. Here is a brief summary of the conversation:

If you had a pass rate of 99.9% but the one failure caused data base corruption and occurred fairly regularly, would you ship the product? (“No”, I answered) OK. What if you had an 80% pass rate but all the failures were minor and our customers could easily accept them in the product (obscure features, corner cases, etc), would you ship the product? (“Probably”, I answered) So, what difference does it make what the pass rate is? (“We use it as a comparison of general health of the product”, I attempted) That is nonsense. My previous questions are still valid. If the pass rate dropped but all the new failures are minor inconveniences what difference does it make? If the pass rate climbed by 5% by fixing minor bugs, but a new failure was found that caused the product to crash, is the product getting better? Why not just look at the actual list of bugs and your product coverage and ignore the pass rate?

I really felt that the metric was good before that conversation. Now I am invited to talk at conferences to help spread the word about bad metrics.

4. Single numbers that summarize too much information for executives (out of context)

Some companies require 100% code coverage and/or 100% requirements coverage before they ship a product. There can be some very useful information gathered by verifying that you have the coverage that you are expecting. These metrics are very similar to testing in general. We cannot prove coverage (prove no bugs) but we can show a lack of coverage (find a bug). These metrics may help the test team identify holes in their coverage but they cannot show the lack of holes. For example, a single test may test multiple requirements but only touch a small portion of requirement “A”. As long as that test is executed, the requirements coverage will show that requirement “A” has been tested but if no other tests have been executed against requirement “A” it is actually not being very well at all. This fact is hidden by the summary of the coverage into a single number taken out of context.

If a product is tested to a 100% code coverage that only means that each line of code was executed at least once. That can provide useful information to the test team much in the way that designers find information in their code compiling without warnings. There is some merit to the test team seeing that they have executed every line of code, but the code also needs to be executed with extreme values, different state models, and varying H/W configurations (to name only a few variants).

When executives see “100% code coverage, 100% requirement coverage, and a 99% pass rate” they will likely feel pretty good about shipping the product. The message they think they are seeing is that the product was very thoroughly tested but that may not be the case.

Coverage metrics can be useful to the test team to show them areas that may have been missed completely but will not replace other means of determining the coverage of their testing.

Summary

I hope this post will help some people explain to their management just why some (most) metrics are misleading and can cause unwanted and unexpected actions by their test teams.

Instead of using traditional metrics why not look at:

  • the actual list of open defects
  • an assessment of the coverage
  • progress against planned effort (not test cases but actual effort)

There will be more to come in a future blog on the alternatives to the typical bad metrics commonly used today.

Top