Everything old is new again–that’s a universal law! In our industry, we have talked about Service Level Agreements (SLAs) for a long time. At Sumo, we are proud to continue to be the only Machine Data Analytics Platform as a Service that commits to a query performance SLA. SLAs are typically legal instruments, contracts that specify to conditions under which there are remedies if the service provider does not adhere to the negotiated properties of the service, typically around availability or performance. Not surprisingly then, the service levels agreed upon in a contract necessarily are defined with a significant amount of buffer. For service operations, they are at best the lower bar for objectives to hold a service to. To provide world-class customer experience, we need to look not at SLAs, but Service Level Objectives.
Applications are becoming more complex and are at the same driving rapidly increasing numbers of digital businesses. Along the way, the monitoring discipline has undergone quite the renaissance and evolution over the last decade. From #monitoringhate we have thankfully evolved to #monitoringlove <3. For a long time, we have known that monitoring cannot rely on a single type of data and modern monitoring platforms have embraced the notion of Observability, along with its three pillars of logs, metrics, and traces. However, as much as I have always been aligned with the wide-angle philosophy behind O11y, the reality is that monitoring approaches continue to be a mere means to an end. Moreover, the end, of course, is Site Reliability. A reliable site is available and performant, and those of us running reliable sites know that the only way to success is to define clear, yes, Service Level Objectives.
In my mind, Service Level Objectives provide the right level of abstraction to drive the highest level of reliability, both technically and culturally. The keyword here of course is “objectives!” Objectives encapsulate goals, intents, and purposes. Objectives are essential performance tools but also act as cultural connecting tissue in any organization, driving alignment and priorities. Management By Objectives has long been a mainstream organizational technique (see also Objectives & Key Results). Service Level Objectives afford us an operational construct to agree on what we believe a reliable service has to look like. They also enable us to observe, orient, decide on, and act on achieving reliability in a continuous loop with clear guidance in the chaos of nebulous priorities that I am sure you all know too well.
An old saying goes, “you can’t improve what you can’t measure.” Service Level Objectives are not measurements but instead define target levels for observed Service Level Indicators, or SLIs, the final acronym in this other modern monitoring trifecta. SLIs are, simplified, the metrics we measure and which we compare to the target levels set by objective. A lot has been written about what we should measure, and the various definitions of Golden Signals all make the right amount of sense. We are using them daily to observe the lower conceptual layers of our infrastructure––microservices. However, we have realized that in the end, none of our customers care whether or not microservice X is achieving its service level objectives independently of the rest of the system. This is why internally we have adopted a system of Customer-Centric SLIs, supported by sub-SLIs that allow us to understand how customers are experiencing the reliability of our service in the way they care about them: log and metrics ingestion, alerting, dashboard performance, and so forth.
As we have discussed in the keynote at our recent user conference Illuminate, we are tracking these customer-centric indicators not in aggregate but on a per-customer basis. We think this is key: just like you as a customer could care less about the properties of a given microservice in our platform, you don’t really get any value from knowing that other customers experience good service levels. Instead, all that matters to you is the service level you are experiencing. This is our operational practice and we have built, and are continuing to build, processes around this approach.
And this leads us to the ultimate benefit of the Reliability By Objective methodology. Independently on what level of granularity you chose to measure your service reliability, as soon as the objectives are established and indicators are being collected, alerting takes care of itself. Error budgets, how much they have depleted, and the trajectory of depletion allows you to alert on things that are relevant to the experience of your customers, saving you from waking up every single time a request results in an error, or a CPU is running at 100% for 10 seconds. Sufficient sleep is the cornerstone of all high performing teams!
We are excited about and are looking forward to talking more to you in person about our own hard-earned experiences running reliable sites. And we want to learn from your experiences so that together we can further evolve our platform to support the Objective-Driven Reliability paradigm!
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.