Bad Metrics
I have talked about and against bad metrics during my Rapid Software Testing courses and at conferences for a couple of years now. The “Lightning Talk” metrics rant that I did at CAST 2011 is available on YouTube (click here to view) although it is very “rough”. I felt that it was time that I cleaned up my thoughts on this subject and so I have created my first blog post.
Earlier this year I was very flattered to be the keynote speaker at the KWSQA conference and my topic was “Bad metrics”. I felt that this was not enough information for a keynote so I added: “and what you can do about it”. For this post I will just cover off the bad metrics aspect. I will leave the discussion alternatives for a later post. I don’t want to run out of blog ideas too quickly. 🙂
Characteristics of Bad Metrics:
I have identified four characteristics of “bad metrics“, in addition to “needing to serve a clear and useful purpose”. I do not expand much on this point because many organizations feel that their metrics are providing them clear and useful information with which they can make decisions on the quality of their product. Simply stating that their metrics do not meet this description is of little use. The problem is that they are typically unaware that their metrics are not only unclear, but are probably also causing highly undesirable behaviour within their teams.
1. Comparing elements of varying sizes as if they are equal
What is a test case? If you ask 10 testers you will likely receive 10 different answers. How long does it take to execute a test case? Again, this question does not have a simple answer. The effort could vary from less than a minute to over a week – yet when tracking progress many companies track test case completion and do not differentiate the effort required. If you work at a company that still counts test cases and the execution rate of test cases and you are looking to instigate change then you can ask your management this:
“How many containers do you need for all of your possessions?”
They will likely answer that it depends on the size of the container – and then you can say “Exactly!” I have heard the argument “we are tracking the average execution time” but that has issues, too. The easy stuff tends to happen first and goes quickly. Then the slower harder stuff comes along and before you realize what is happening the test team is holding up the release.
Heck, I’ll admit it. I used to track test case completion. I created pretty charts that showed my progress against my plan. All the quick and easy tests (including automated tests that counted the same as manual tests) were executed first – giving the illusion that we were well ahead of schedule. Unfortunately, when the harder, slower tests were the only tests left the executives felt that we had started to fall behind.
2. Counting the number of anything per person and/or team
On the surface this one sounds like it isn’t a problem. Why not measure how many bugs each tester (or team) raises? You will be able to identify your top performers, right? Wrong! Metrics like “number of bugs raised per tester” or “number of test cases executed per tester” cause competition which results in a decrease in teamwork within teams and information sharing between teams.
Imagine the situation where the amount of my annual raise depends on my bug count (regardless of whether I’m being measured on sheer volume of bugs, or hard to find bugs, or severity 1 bugs). One day I will be happily testing and I will find an area that has a large number of bugs. Will I help the company by immediately mentioning to my manager that I found this treasure chest of bugs? Perhaps, but more likely I would mention it only after I have spent a few days investigating and writing bug reports thus securing myself a nice raise. Wouldn’t the company benefit more if the tester shared the information sooner rather than later?
What if I had a cool test technique that frequently found some interesting bugs. Would I share that technique with others and potentially let them find bugs that I would otherwise find myself? Or, would I keep the technique to myself to help me outperform my peers?
If members of a team are measured on the number of test cases they each execute per week, then some testers would probably decide to execute the quick and easy tests to pad their numbers instead of executing the tests that make the most sense to execute.
Will testers take the time extra required to investigate strange behaviour if that would mean that they will fall behind in their execution rate? Likely not, thus leaving more bugs undiscovered in the product.
3. Easy to game or circumvent the desired intention
Making a metric into a target will cause unwanted behaviours. If you have a target of a “95% pass rate” before you can ship then your teams will achieve that pass rate no matter how much ineffective testing they have to perform to meet the target. I used to think that looking at pass rate was a good metric until I had a discussion with James Bach about 8 years ago. We had a 10 minute long discussion where I was trying my hardest to fight for the validity of the coveted “pass rate” metric. Here is a brief summary of the conversation:
If you had a pass rate of 99.9% but the one failure caused data base corruption and occurred fairly regularly, would you ship the product? (“No”, I answered) OK. What if you had an 80% pass rate but all the failures were minor and our customers could easily accept them in the product (obscure features, corner cases, etc), would you ship the product? (“Probably”, I answered) So, what difference does it make what the pass rate is? (“We use it as a comparison of general health of the product”, I attempted) That is nonsense. My previous questions are still valid. If the pass rate dropped but all the new failures are minor inconveniences what difference does it make? If the pass rate climbed by 5% by fixing minor bugs, but a new failure was found that caused the product to crash, is the product getting better? Why not just look at the actual list of bugs and your product coverage and ignore the pass rate?
I really felt that the metric was good before that conversation. Now I am invited to talk at conferences to help spread the word about bad metrics.
4. Single numbers that summarize too much information for executives (out of context)
Some companies require 100% code coverage and/or 100% requirements coverage before they ship a product. There can be some very useful information gathered by verifying that you have the coverage that you are expecting. These metrics are very similar to testing in general. We cannot prove coverage (prove no bugs) but we can show a lack of coverage (find a bug). These metrics may help the test team identify holes in their coverage but they cannot show the lack of holes. For example, a single test may test multiple requirements but only touch a small portion of requirement “A”. As long as that test is executed, the requirements coverage will show that requirement “A” has been tested but if no other tests have been executed against requirement “A” it is actually not being very well at all. This fact is hidden by the summary of the coverage into a single number taken out of context.
If a product is tested to a 100% code coverage that only means that each line of code was executed at least once. That can provide useful information to the test team much in the way that designers find information in their code compiling without warnings. There is some merit to the test team seeing that they have executed every line of code, but the code also needs to be executed with extreme values, different state models, and varying H/W configurations (to name only a few variants).
When executives see “100% code coverage, 100% requirement coverage, and a 99% pass rate” they will likely feel pretty good about shipping the product. The message they think they are seeing is that the product was very thoroughly tested but that may not be the case.
Coverage metrics can be useful to the test team to show them areas that may have been missed completely but will not replace other means of determining the coverage of their testing.
Summary
I hope this post will help some people explain to their management just why some (most) metrics are misleading and can cause unwanted and unexpected actions by their test teams.
Instead of using traditional metrics why not look at:
- the actual list of open defects
- an assessment of the coverage
- progress against planned effort (not test cases but actual effort)
There will be more to come in a future blog on the alternatives to the typical bad metrics commonly used today.
Hey Paul,
Very much enjoyed your post and the video! I’m not really against metrics (as a concept), but I am on my toes when people suggest dangerous metrics. I think management/clients can ask what they want, but I am there to advocate against them. Ultimately, it’s their choice to ask for what they want, but we need people to educate them of the dangers and problems.
Looking forward for the follow-up on this. I’ve had my ideas for it and it seems you have also something more to share. Very hot topic for testers all around!
Best regards,
Jari
This is a great post. I wish I had this four or five years ago, it would have made my life so much easier.
I am pretty lucky at this point. My co. doesn’t use metrics for too much anymore. The one we use, “% comfort” is expressed in conversations so it can be questioned and explained (rather than simply plastered on a exec’s dashboard). We derive % comfort from the individual tester/test managers experience. It (so far) has worked out really well for us.
Hi Paul – great stuff! You are right, one has to be very careful about what one measures as it can drive undesired behaviour. I wrote a blog on a similar subject the day after you! Here is it – much shorter – but hopefully of some use! Cheers, Tony.
http://doogleonline.blogspot.co.uk/2012/11/how-to-measure-effectiveness-of-your.html
Nice post.
I especially agree with bullet 2 since I have experienced this first hand. It lead to hastily written bug reports with bad quality, less communication and also people testing outside their area (running other people’s areas) so they could find the bug first.
Looking forward for next blog.
BR Andreas
Very good stuff, Paul. All of those can be temptingly easy to use, and the trap isn’t always obvious. Guilty as charged on using the pass rate to track testing progress (though it was early on in my testing career, so I’m claiming ignorance as a defence). A few blog posts from Mr. Bach changed that thinking, and my manager was receptive to the fact that tests run/passed/failed wasn’t a particularly good indicator of what state the software was in.
These days my team (dev, test, BA) review the outstanding bugs near the end of the test window while I review the test effort using/updating a mind map “test plan” (which is an interesting topic in it’s own right).
Hi Paul
Happy to see you writing! And I liked the container question. I think I am going to steal it and use it from time to time – of course attributing it to you.
I think the root of the problem is one of people’s weaknesses – the wish to believe anything that is quantified. Metrics provide handy quantifications, and data validity is not questioned.
It might be a good idea to stack some copies of ‘Measuring and Managing Performance in Organizations’ for emergency hand outs.