Tuesday March 13th 2012 8am
Quality in the Cloud: The new Role of TestOps
Test and QA
Software Test Professionals Conference
The term "TestOps" is a new one, but it encapsulates a new way of thinking about how we test and ensure quality of cloud services. It is a portmanteau of "Test" and "Ops", so let's start with "Test", a term I expect this readership to know all about but will briefly define here for clarity. When we test we apply a series of actions on the system and check for respective expected results, so as to assess the soundness of the system for use by our customers -- or looked at the other way, the risk of exposing the system to end users.
"Ops", which is short for "operations", generally owns the operation and maintenance of the servers, network, and support systems in the data center where your service is deployed and running. One of Ops' most powerful tools is monitoring - the observation of data such as server health, network load, and user traffic to assess overall data center health. And by health we mean soundness for customer use and insight into potential risks to end users, just like our testing.
In this way we as testers have the same goals as ops, so can we also make use of the data signal continuously being emitted by our systems in production to do our jobs? Recall that to test we apply a series of actions. The results of those actions are test results which have traditionally been our signal by which we assess quality. But instead of applying synthetic actions and trying to predict how real users and complex production deployments will act, why don't we instead use the signal from these real users and real deployments as part of our quality strategy? And rather than simple monitors like "server up" or "free disk space" we can measure complex use scenarios in real environments.
For example if we are interested in system performance, why attempt to replicate every combination of machine, OS, and browser in lab, only to shoot synthetic traffic at them? Instead be like Microsoft Hotmail and measure actual times it takes real users to send and receive emails. This can be collected for millions of users with no PII (personally identifiable information) but including key environment data like OS or browser type. With data like that systems can be tuned to better perform where trouble spots are identified with specific OS, browsers, or even specific locations around the world.
Another example is Bing. How can a tester anticipate every query? And even if they could, delivering relevance is the key of a good search engine -- how would they know what each user would find relevant? So instead Bing can anonymously data mine what users are searching for and what they do or do not click on to refine its algorithms for delivering better results.
Or say you simply want to find functional bugs prior to deployment. Microsoft.com is actually a platform hosting several "applications" that comprise the Microsoft.com site you see. When they have a new build of the platform they first deploy it to a single server to take production traffic, and then compare key metrics including error rates between this build and the current one in production.
And those are just the Microsoft examples. Google is well known for their "1% launches." And Facebook Ops and Engineering work closely together to expose metrics from the server level through the applications level to engineers charged with assessing system quality.
Of course these kinds of Testing in Production have costs of their own. It is still true that bugs are more expensive when found late in the production cycle. But also keep in mind some bugs may be prohibitively expensive (or impossible) to find before you go to production. While the "test results signal" remains part of our test strategy (although possibly in a diminished capacity) we should move what we can of our quality strategy into production and make use of the "big data signal". This is what I call TestOps.
Seth Eliot Senior Test Manager, Microsoft - As a Senior Test Manager at Microsoft Seth's team solves Exabyte storage and data processing challenges for Bing. Previously he was Test Manager for the Microsoft Experimentation Platform (http://exp-platform.com) which enables developers to innovate by testing new ideas quickly with real users “in production”. Testing in Production (TiP), software processes, cloud computing, and other topics are ruminated upon at Seth's blog at http://bit.ly/seth_qa. Prior to Microsoft, Seth applied his experience at delivering high quality software services at Amazon.com where he led the Digital QA team to release Amazon MP3 download, Amazon Video on Demand Streaming, and support systems for Kindle.
Come see Seth at the Software Test Professionals Conference in New Orleans from March 26-29. Seth will lead session 503: A to Z Testing in Production: TiP Methodologies, Techniques, and Examples, part of the Performance Testing track.