Let's say you don't know anything about software testing. It makes sense that, to figure out status, you could be able to reduce the work to a set of basic primitives - maybe test cases. Then you can count the total number of test cases, the number to go, do some simple arithmetic, and come up with the percentage complete of the project, right?
A little more arithmetic, and we can come up with a predicted end-date of the project.
Likewise, if we have another primitive - say a bug report, and a field in bug tracker to indicate if the bug was discovered pre-or-post release, we can calculate our defect detection percentage - DDP.
With just a few of these "hard" measurements in place, we can manage our test - even who development projects quantitatively
, accomplishing the goals of CMMI level four, possibly five.
No, really, you can. I interviewed the CMMI product manager, J. Michael Phillips, in December of 2009, and he said so. The core metrics he recommended were something like time, features, productivity (source lines of code), and defects. If you want advice on how to perform this kind of best practice, you only need look to the latest issue of Professional Tester Magazine
What follows is a true story.
The Great CD Debacle
Not too many years ago I was hired into a remote office of a Fortune 200 company we'll call BigCo. BigCo has many businesses, but the purchasing organization created a sort of yellow pages for construction materials - big, thick, multi-volume books that had every product you could think of, and how to buy them. BigCo bought this much smaller product company, which I will call TinyDiv, in order to get technology to create and compile electronic catalogs onto a CD. The general idea was to ship compact disks instead of it's multi-volume (thick) sales catalogs. (Yes, folks, this was before the web was popular)
The project was quantitatively managed
, at least in that the director of software development had goals for budget and schedule.
Then Microsoft did something funny; they released Windows 98. (It might have been NT; It was over a decade ago) Now the outsourced testing company was happy to put Windows 98 on the list of operating systems to be tested - but this destroyed the schedule and the budget, costing the executives a few bonuses.
The Director of Development took a pen, marked a line through Windows 98, and said "I will have our team QA that operating system."
Only, he didn't tell anyone that they, specifically, were responsible for testing it.
The CD did not just fail to install under the newest windows.
It did not just cause a blue screen.
It corrupted the hard drive such that the machine was unable to boot.
The result? The software was recalled. Within a year, corporate HQ moved the work that the CD-product team did to other offices. By the time I was hired, all that remained was the original TinyDiv, still working on that original product that worked and made money.
In my first week, two of my teammates took me upstairs, to show me an entire floor of Comerica building in downtown Grand Rapids, empty. "The CD people used to work here", they said, "If you see anything you want, take it. The lease is up in a month and it's going to the scrap yard."
How do I know about the project? Because I overheard people talking to the lawyers from corporate who wanted to sue to the QA company, and I heard the developers saying things the lawyers did not want to hear.
But gosh, what great defect detection percentage! Look at how many bugs QA found versus how many were found in the field! We only found one serious bug in the field! (Well, we might have found more, but we had to recall the product.)
Hopefully, it's obvious in the story above that D.D.P. just doesn't make much sense; in order to quantitatively measure, you've got to be comparing apples to apples. At the very least, you'd have to factor in some kind of defect severity, possibly including how often we expect the users to encounter the defect. And what these really are, are guesses
that we'll plug into a formula. Even with severity, it's unlikely that a "sev one" is exactly five times as bad as a "sev five", or that five "sev fives" equal a "sev one" - but a simplistic formula will come to that conclusion. And, just as obviously, in the example from professional tester magazine (on page seven), if more bug reports are good, then we're likely to get more
bug reports. Some of these could have been handled by talking directly to the developers, or we may get a different bug report for every typo out of a list of 100 on the help screen. And YES, I once knew a developer who was evaluated by the number of change controls he put in, so in order to move fifteen files that were essentially one change, he put in - you guessed it - fifteen change control requests.
But where does this desire for metrics come from:
1) Lack of trust.
The Manager says "I need a week", and his stakeholders do not believe him. So he pulled out metrics. Or perhaps he needs three weeks and his stakeholders want it in one. Or perhaps he thinks the test team is doing well and wants bigger raises "look we found more bugs this year." Or, just perhaps, the outsourcing company wants to attempt to prove it's value.
2) Desire for control.
Simplistic measures promise to make management easy. After all, all a manager should have to do is look at a spreadsheet every once in a while, and if he sees green, everything is fine. If there's yellow or red, call the direct reports, demand status, ask what they are doing about, and check back in a few days. The problem is, they can't deliver. The organization might be spiraling out of control, but the report is all green. (Anyone who worked at a big organization in the 1990's, consider my servant the gannt chart. That was all about creating the illusion of control now, wasn't it?)
3) Lack of understanding.
Have you ever wondered what the purpose of grades are in school? The teacher doesn't need them; he knows how well the students are doing. The grade "B+" is actually a lossy abstraction - it lumps the student who has mastered the material but never does homework in with the one that tries hard but always misses the harder problems. It turns out that grades are a benefit to the parents, administration, and college entrance people
, who aren't in the classroom and need some advice on what is going on. In our home school, we don't give out grades to our students; we actually know
how well they are doing.
A few alternatives I have tried and had success with:
1) If you don't know what's going on in your organization - find out - by actually being involved. The Scrum and XP folks suggest that customers attend the daily stand-up
meeting, or that the customer be embedded in the team
, both of which I have had success with. Another option is management by walking around
2) Note my concern is the use of naive metrics as a sum total strategy to figure out "how we are doing". Thus, you look at your metrics for the week, and if things are green, you breathe a sigh of relief and go play golf - or, if they aren't, you call your direct reports, yell and scream, then check back again next week. I do not have a problem with digging into the numbers as an investigatory process, as part of a balanced breakfast.
3) Likewise, I expect that individuals are using metrics every day, in order to figure out dynamics and make plans. These numbers are part of a one-time problem solving strategy, often thrown away after the fact. DDP used one-time - say, looking at the numbers from this year to last year - as part of a balance breakfast - might not be that terrible. It's when it is used repeatedly that the act of observing tends to skew behavior, and we begin to see dysfunction.
4) Earlier I mentioned dysfunction. Keep in mind, you'll tend to get exactly what you measure. If you measure test cases, you'll likely get lots of test cases, and even some productivity -- at first. But, eventually, the team will realize that test cases and productivity are two different things, and find the shortest possible way to get you the test cases. By exploiting this difference, each individual test case will likely provide less value - thus, the assumption that "counting test cases" is roughly equivalent to productivity becomes less and less valid over time. There's a gentleman names Robert Austin who earned his PhD at Carnegie Mellon who studied dysfunction. He concluded that since projects are multi-dimensional, any single metric (even a handful) is likely to leave things un-measured, and teams are likely to steal from those "peters" to pay the "paul" that is measured. The classic example is that, if you are measured by time, features, AND defects, you take on technical debt.) Austin's book, Measuring and Managing Performance in Organizations
is a classic in it's field. His recommendation is to pick a small percentage of projects and do a thorough after-action review, or retrospective, that takes everything into account, and try to take home real lessons from that.
5) When evaluating qual
ity, consider qual
itative metrics and rules of thumb, as opposed to hard numbers. This can be as simple as a thumbs up or down "should we ship" crowd-source decision by the team - or at least using that as input for a decision maker. For a detailed analysis of software engineering metrics, consider the classic paper by Kaner and Bond
, which ultimately recommends qual
Putting it together
That leaves us with a small handful of tools - get actively involved with project, manage by walking around, conduct detailed retrospectives, or use metrics as part of a balanced breakfast to inform, not to convince, evaluate, or to control. But what if you really want to use metrics to do lightweight to control with integrity - say you have an organizational mandate?
Well, first, you could get a better job. No, really. I'm serious. I'm reluctant to offer advice on how to make a bad idea work. That said ...
I do think organizations can use metrics in a mature and sophisticated way
. To do that, I would introduce the metrics in context, as part of a story, including the limits and weaknesses in the approach. For example, when a particular idea should not work, and how to figure out why it worked in this instance, and the complex dance of experience and research we used to validate our opinion.
When I see these naive metrics, I smell a theory and ideology that has never been tested; yet another Pied Piper of Hamelin
, telling people what they want to hear.
We can, and should, demand better.