About STP / 877.257.9531
Log In Join Now

Author



Rating

6


Published

Tuesday July 13th 2010 12am

Quality Vs. Quantity - I

Testing Software Test and QA Trends Exploratory Security
Did you know that the first vending machines were not actually mechanized?

Oh, yes, you had the familiar interface of a vending machine. The customer would put money in a slot and pull a lever indicating his purchase.

The difference was what that behind the 'wall' of the vending machine there was another person, who would look at the lever, make change, and insert a sandwich into a little slot.

In other words, the art of pretending machines can do more than they are capable of is not new at all. We've actually been faking it for hundreds of years!

There's even a term for this - the mechanical turk.  That term comes from an apparent chess-playing machine in the 18th century that -- surprise, surprise -- was actually driven by a human being.

The whole idea of the mechanical turk brings me to ask the question - why do we care? What would make a machine doing the work of a human so impressive that we removing the /appearance/ of a human from a vending machine seems to be a good thing? Does it even make sense that someone could build fame and reputation by creating a fake machine playing a game children learn in grammar school?

Maybe. It would seem that machines are, well ... dumb. Making them capable of solving puzzles, even simplistic puzzles with well-defined rules, was not really possible until the 19th century. Even today, the closest thing we have to artificial intelligence is a game that plays 20 questions, or a google search result that realizes you spelled a word wrong in your search and suggests a correction.

Both the google search and the twenty questions game are interesting because they are trained. The google search might know the dictionary, sure, but it also knows what everyone else has searched on, and can recognize that "Shakespere" is linguistically close to "Shakespeare", yet ten times as many people have searched for the latter search term.

This sort of using humans in a massive way to simulate artificial intelligence isn't a new idea. Amazon even has a sort of marketplace for it.  Ironically enough, the marketplace is named "Amazon Mechanical Turk".

That marketplace was started when Amazon wanted to be able to drop off problems that were just a little bit too complex for a computer -- say voice transcription. The idea being that you could drop off a digital audio file one day, along with ten dollars, and pick up a text file the next.

A real, live, 21st century version of that vending machine.

I've used Mechanical Turk a bit; it's a fascinating sort of reverse ebay. Because the work can be done by anyone who knows English and has an internet connection, customers can pay incredibly low rates for these small pieces of work.

If you've noticed how close this is to the rhetoric that senior management doesn't care about testing as much as they "just want testing to be done", well, you ain't the only one. And yes, crowd sourced vendors for testing do exist -- I'm partial to the folks at uTest -- I think they can be part of a "Balanced Breakfast" for our industry. But more about that in another post.

Some of the offers on Mechanical Turk are clearly scams; they want you to fill out a survey with your "real email address" to "prevent fakes", but the filling out the survey switches your phone service or signs you up for CreditProtect or some other low-value service.

What really fascinates me is not the scams on Mechanical Turk, nor the transcription - but instead some of the crowd sourced work. Some of it is trying to gauge human reaction to certain key words on a scale from one to ten -- say, to write software to predict how people will respond to a paragraph or article. Others ask you to look at pictures and try to figure out "which of the people in these thirty pictures are" sad, mad, happy or wearing earrings.

It doesn't take a genius to figure out what the customers are doing with Mechanical Turk: They are training some some of artificial intelligence to recognize facial expressions, and they way to do it is to give the software millions of examples. In order to do the analysis, the computer doesn't need to "think" as we humans understand it; instead, it can look at any picture and compare it to other pictures that do or do not have that attribute, ultimately making a best guess.

Next you'll have a human evaluating the guesses, and, eventually, you've got a piece of software that can come to the same conclusions as a human being would for facial recognition.

Just don't ask it to understand body language.

The model here -- training a computer to do something by playing "which one of these is not like the other" with a huge space of examples -- is one I find fascinating. The folks in artificial intelligence have a name for it, a "Neural Network." I've even given a fair amount of thought to trying to train some software to do this -- for example, login, add to cart, checkout, and some standard errors on web input forms are the kinds of things where it might be possible to train an AI to do a modest amount of mediocre-to-good testing. I just don't think that is the most valuable idea I could be pursuing right now. Instead of that, let me ask another question:

Start with an untrained neural net, the kind you need a human being to "add examples" for. When you ask the human beings to add examples -- what are they actually doing? What is that thing the people are doing that is so far for computers, but so trivially easy that you can pay a penny a question, and feel confident that bored teenagers trying to earn money for a CD will answer it just as correctly as workers in a developing nation or prisoners on work release?

What is the thing those people are actually doing?

Well, I've used one word, over and over, to describe the behavior, and I think it's about right: Evaluation. By evaluation, I mean some sort of squishy, hard-to-define conclusion based on a bunch of sensory data. On one end of evaluation you have decisions based on values, and could have multiple right answers -- but on the other you have things like "is this man smiling?" that most people, even in most cultures, can agree on the correct answer for. In many cases, people can come to the same conclusions without being able to explain how they reached those conclusions. Two more controversial examples of this might be "is this hazing?" or "is this pornography?"

Anytime someone says "I know it when I see it", but can't describe a formal rule set for it, they are asserting their divine-spark right to evaluate. And I'm okay with that.

The human race has a rich tradition of evaluation; there are even schools designed to provide students with tools to this sort of fuzzy, subjective, hard-to-describe decision making. One term for them is "the liberal arts."

The seven schools of the liberal arts encompass grammar, logic, rhetoric, arithmetic, geometry, music and astronomy. The thing is, they don't think of math as a formula, to plug in, but a tool that can be used in thinking - like Eratosthenese used when he determined the circumference of the world using the length of shadows and locations on the earth. (If you want to give you children a great example of the liberal arts in action, consider The Librarian Who Measured The Earth, which tells this story in wonderful picture book form.  OR just buy it for yourself.  Yes, it is that good.)

This idea of looking at value and using human judgement is, well, one way of describing the word 'Quality'.

The Romans knew a little bit about this -- they gave us two root words. "Qual" is one, for Quality, and "Quant", is the other, from which we get Quantity. A Quantity is a measurement that results in a number.

"Six feet" is a quantity. "About long enough that I should get a hair cut" is a qualitative measurement -- it requires judgement. You know, that thing we humans are good at and computers not so much.

Academics also get this, for example when we talk about research. Two terms academics use to discuss research are qualitative and quantitative. (Hey, check it out, there's those fancy Latin root words again).

Quantitative research yields hard numbers, generally by repeatable experiment with only one thing changing at a time. Qualitative research is based more on reactions. If you think numbers are good, it is possible to survey a large number of people and ask them to, say, evaluate their experience on a scale from one to ten. Still, for some reason I always get more value of the comments than the numbers.

Another way to look at quality is to consider it sort of the average of a bunch of attributes. These attributes are each themselves hard to measure and usually end in "ity." Security, Reliability, Scalability, and Usability are four easy attributes, each themselves hard to measure.

Pulling a meaningful "overall" score out of these is surprisingly hard. Say you rated each of those four from zero to a hundred. If the other three were all one hundred, but usability is zero -- does the application "average" to seventy-five? Probably not. Worse, there may be other factors, like internationalization, that we leave off as the quality police, but matter to the end customer.  For a meaningful overall score, we would have to weigh and consider those -- assuming our customers agreed with us, and each other, about how those things should be weighed.

Given all that, when it comes to measuring quality, the best we can do is probably some sort of dashboard, or balanced scorecard.

And yet ...

And yet ...

Every time we start to talk about 'Measuring Quality', I start to get the heebe jeebies. After doing interviews, focus groups, and lots of writing and reading on the subject, I kept coming back to people asking for a number. Instead of evaluation, they wanted a metric.

The overwhelming goal of the leaders and decision makers I was interviewing was to take Quality and sort of 'morph' it into the world of Quantity. If possible, really, they wanted to eliminate that first Latin root.

Me? I would rather celebrate evaluation. Yes, it is another aide in decision making -- and a big one, at that.

More to come.



Comments

You must be logged in to comment.
Retrieving Comments...


Advertisement




Friend SoftwareTestPro on Facebook
Follow @SoftwareTestPro on Twitter
Create or Join a Crew

Tweets You Care About





Explore STP