Did you know that the first vending machines were not actually
mechanized?
Oh, yes, you had the familiar interface of a vending machine. The
customer would put money in a slot and pull a lever indicating his
purchase.
The difference was what that behind the 'wall' of the vending machine
there was another person, who would look at the lever, make change, and
insert a sandwich into a little slot.
In other words, the art of pretending machines can do more than they are
capable of is not new at all. We've actually been faking it for
hundreds of years!
There's even a term for this - the mechanical turk. That term comes from an apparent chess-playing machine in the 18th
century that -- surprise, surprise -- was actually driven by a human
being.
The whole idea of the mechanical turk brings me to ask the question -
why do we care? What would make a machine doing the work of a human so
impressive that we removing the /appearance/ of a human from a vending
machine seems to be a good thing? Does it even make sense that someone
could build fame and reputation by creating a fake machine playing a
game children learn in grammar school?
Maybe. It would seem that machines are, well ... dumb. Making them
capable of solving puzzles, even simplistic puzzles with well-defined
rules, was not really possible until the 19th century. Even today, the
closest thing we have to artificial intelligence is a game that plays 20
questions, or a google search result that realizes you spelled a word
wrong in your search and suggests a correction.
Both the google search and the twenty questions game are interesting
because they are trained. The google search might know the dictionary,
sure, but it also knows what everyone else has searched on, and can
recognize that "Shakespere" is linguistically close to "Shakespeare",
yet ten times as many people have searched for the latter search term.
This sort of using humans in a massive way to simulate artificial
intelligence isn't a new idea. Amazon even has a sort of marketplace for it. Ironically enough, the marketplace is named "Amazon Mechanical Turk".
That marketplace was started when Amazon wanted to be able to drop off problems that were just
a little bit too complex for a computer -- say voice transcription. The
idea being that you could drop off a digital audio file one day, along
with ten dollars, and pick up a text file the next.
A real, live, 21st century version of that vending machine.
I've used Mechanical Turk a bit; it's a fascinating sort of reverse
ebay. Because the work can be done by anyone who knows English and has
an internet connection, customers can pay incredibly low rates for these
small pieces of work.
If you've noticed how close this is to the rhetoric that senior
management doesn't care about testing as much as they "just want testing
to be done", well, you ain't the only one. And yes, crowd sourced
vendors for testing do exist -- I'm partial to the folks at uTest -- I
think they can be part of a "Balanced Breakfast" for our industry. But
more about that in another post.
Some of the offers on Mechanical Turk are clearly scams; they want you
to fill out a survey with your "real email address" to "prevent fakes",
but the filling out the survey switches your phone service or signs you
up for CreditProtect or some other low-value service.
What really fascinates me is not the scams on Mechanical Turk, nor the
transcription - but instead some of the crowd sourced work. Some of it
is trying to gauge human reaction to certain key words on a scale from
one to ten -- say, to write software to predict how people will respond
to a paragraph or article. Others ask you to look at pictures and try to
figure out "which of the people in these thirty pictures are" sad, mad,
happy or wearing earrings.
It doesn't take a genius to figure out what the customers are doing with
Mechanical Turk: They are training some some of artificial intelligence
to recognize facial expressions, and they way to do it is to give the
software millions of examples. In order to do the analysis, the computer
doesn't need to "think" as we humans understand it; instead, it can
look at any picture and compare it to other pictures that do or do not
have that attribute, ultimately making a best guess.
Next you'll have a human evaluating the guesses, and, eventually, you've
got a piece of software that can come to the same conclusions as a
human being would for facial recognition.
Just don't ask it to understand body language.
The model here -- training a computer to do something by playing "which
one of these is not like the other" with a huge space of examples -- is
one I find fascinating. The folks in artificial intelligence have a name
for it, a "Neural Network." I've even given a fair amount of thought to
trying to train some software to do this -- for example, login, add to
cart, checkout, and some standard errors on web input forms are the
kinds of things where it might be possible to train an AI to do a modest
amount of mediocre-to-good testing. I just don't think that is the most
valuable idea I could be pursuing right now. Instead of that, let me
ask another question:
Start with an untrained neural net, the kind you need a human being to
"add examples" for. When you ask the human beings to add examples --
what are they actually doing? What is that thing the people are doing
that is so far for computers, but so trivially easy that you can pay a
penny a question, and feel confident that bored teenagers trying to earn
money for a CD will answer it just as correctly as workers in a
developing nation or prisoners on work release?
What is the thing those people are actually doing?
Well, I've used one word, over and over, to describe the behavior, and I
think it's about right: Evaluation. By evaluation, I mean some sort of
squishy, hard-to-define conclusion based on a bunch of sensory data. On
one end of evaluation you have decisions based on values, and could have
multiple right answers -- but on the other you have things like "is
this man smiling?" that most people, even in most cultures, can agree on
the correct answer for. In many cases, people can come to the same
conclusions without being able to explain how they reached those
conclusions. Two more controversial examples of this might be "is this
hazing?" or "is this pornography?"
Anytime someone says "I know it when I see it", but can't describe a
formal rule set for it, they are asserting their divine-spark right to
evaluate. And I'm okay with that.
The human race has a rich tradition of evaluation; there are even
schools designed to provide students with tools to this sort of fuzzy,
subjective, hard-to-describe decision making. One term for them is "the
liberal arts."
The seven schools of the liberal arts encompass grammar, logic,
rhetoric, arithmetic, geometry, music and astronomy. The thing is, they
don't think of math as a formula, to plug in, but a tool that can be
used in thinking - like Eratosthenese used when he determined the circumference of the world using the
length of shadows and locations on the earth. (If you want to give you
children a great example of the liberal arts in action, consider The Librarian Who Measured The Earth, which tells this story in wonderful picture book form. OR just buy it for yourself. Yes, it is that good.)
This idea of looking at value and using human judgement is, well, one
way of describing the word 'Quality'.
The Romans knew a little bit about this -- they gave us two root words.
"Qual" is one, for Quality, and "Quant", is the other, from which we get
Quantity. A Quantity is a measurement that results in a number.
"Six feet" is a quantity. "About long enough that I should get a hair
cut" is a qualitative measurement -- it requires judgement. You know,
that thing we humans are good at and computers not so much.
Academics also get this, for example when we talk about research. Two
terms academics use to discuss research are qualitative and
quantitative. (Hey, check it out, there's those fancy Latin root words
again).
Quantitative research yields hard numbers, generally by repeatable
experiment with only one thing changing at a time. Qualitative research
is based more on reactions. If you think numbers are good, it is
possible to survey a large number of people and ask them to, say,
evaluate their experience on a scale from one to ten. Still, for some reason I
always get more value of the comments than the numbers.
Another way to look at quality is to consider it sort of the average of a
bunch of attributes. These attributes are each themselves hard to
measure and usually end in "ity." Security, Reliability, Scalability,
and Usability are four easy attributes, each themselves hard to measure.
Pulling a meaningful "overall" score out of these is surprisingly hard. Say you
rated each of those four from zero to a hundred. If the other three were
all one hundred, but usability is zero -- does the application
"average" to seventy-five? Probably not. Worse, there may be other
factors, like internationalization, that we leave off as the quality
police, but matter to the end customer. For a meaningful overall score, we would have to weigh and
consider those -- assuming our customers agreed with us, and each other, about how those things should be weighed.
Given all that, when it comes to measuring quality, the best we can do
is probably some sort of dashboard, or balanced scorecard.
And yet ...
And yet ...
Every time we start to talk about 'Measuring Quality', I start to get
the heebe jeebies. After doing interviews, focus groups, and lots of
writing and reading on the subject, I kept coming back to people asking
for a number. Instead of evaluation, they wanted a metric.
The overwhelming goal of the leaders and decision makers I was
interviewing was to take Quality and sort of 'morph' it into the world
of Quantity. If possible, really, they wanted to eliminate that first
Latin root.
Me? I would rather celebrate evaluation. Yes, it is another aide in decision
making -- and a big one, at that.
More to come.