When you roll out a new deployment, how do you roll? With a big bang? A blue/green deployment? Or do you prefer a Canary Release?
There’s a lot to be said for the Canary Release strategy of testing new software releases on a limited subset of users. It reduces the risk of an embarrassing and potentially costly public failure of your application to a practical minimum. It allows you to test your new deployment in a real-world environment and under a real-world load. It allows a rapid (and generally painless) rollback. And if there’s a failure of genuinely catastrophic proportions, only a small subset of your users will even notice the problem.
But when you use Canary Release, are you getting everything you can out of the process? A full-featured suite of analytics and monitoring tools is — or should be — an indispensable part of any Canary Release strategy.
The Canary Release Pattern
In a Canary Release, you initially release the new version of your software on a limited number of servers, and make it available to a small subset of your users. You monitor it for bugs and performance problems, and after you’ve taken care of those, you release it to all of your users.
The strategy is named after the practice of taking canaries into coal mines to test the quality of the air; if the canary stopped singing (or died), it meant that the air was going bad. In this case, the “canary” is your initial subset of users; their exposure to your new release allows you detect and fix the bugs, so your general body of users won’t have to deal with them.
Ideally, in a strategy such as this, you want to get as much useful information as possible out of your initial sample, so that you can detect not only the obvious errors and performance issues, but also problems which may not be so obvious, or which may be relatively slow to develop. This is where good analytic tools can make a difference.
Using Analytics to Support a Canary Release
In fact, the Canary Release strategy needs at least some analytics in order to work at all. Without any analytics, you would have to rely on extremely coarse-grained sources of information, such as end-user bug reports and obvious crashes at the server end, which are very likely to miss the problems that you actually need to find.
Such problems, however, generally will show up in error logs and performance logs. Error statistics will tell you whether the number, type, and concentration (in time or space) of errors is out of the expected range. Even if they can’t identify the specific problem, such statistics can suggest the general direction in which the problem lies.
And since error logs also contain records of individual errors, you can at least in theory pinpoint any errors which are likely to be the result of newly-introduced bugs, or of failed attempts to eliminate known bugs.
The problem with identifying individual errors in the log is that any given error is likely to be a very small needle in a very large haystack. Analytics tools which incorporate intelligent searches and such features as pattern analysis and detection of unusual events allow you to identify likely signs of a significant error in seconds. Without such tools, the equivalent search might take hours, whether it uses brute force or carefully-crafted regex terms. Even being forced by necessity to do a line-by-line visual scan of an error log, however, is better than having no error log at all.
Logs that monitor such things as performance, load, and load distribution can also be useful in the Canary Release strategy. Bugs which don’t produce clearly identifiable errors may show up in the form of performance degradation or excessive traffic. Design problems may also leave identifiable traces in performance logs; poor design can cause traffic jams, or lead to excessive demands on databases and other resources.
You can enhance the value of your analytics, and of the Canary Release itself, if you put together an in-depth demographic profile of the user subset assigned to the release. The criteria which you use in choosing the subset, of course, depends on your needs and priorities, as well as the nature of the release. It may consist of in-house users, of a random selection from the general user base, or of users carefully chosen to represent either the general user base, or specific types of user.
In any of these cases, however, it should be possible to assemble a profile of the users in the subset. If you know how the users in the subset make use of your software (which features they access most frequently, how often they use the major features, and at what times of day, how this use is reflected in server loads, etc.), and if you understand how these patterns of use compared to those of you general user base, the process of extrapolation from Canary Release analytics should be fairly straightforward, as long as you are using analytic tools which are capable of distilling out the information that you need.
So yes, Canary Release can be one of the most rewarding deployment strategies — when you take full advantage of what it has to offer by making intelligent use of first-rate analytic tools. Then the canary will really sing!
About the Author
Michael Churchman started as a scriptwriter, editor, and producer during the anything-goes early years of the game industry. He spent much of the 90s in the high-pressure bundled software industry, where the move from waterfall to faster release was well under way, and near-continuous release cycles and automated deployment were already de facto standards. During that time he developed a semi-automated system for managing localization in over fifteen languages. For the past ten years, he has been involved in the analysis of software development processes and related engineering management issues.