A quarter has passed since we launched our Reliability Management capabilities that help developers focus on defining, monitoring and managing Service Level Objectives (SLOs) to drive great digital experiences. Reducing alert fatigue and balancing innovation with reliability are common outcomes that customers expect from Reliability Management.
If you are new to SLOs, these insights from our customers capture common practices among peer developers. Where possible, we showcase best practices that you can adopt on your own journey. If you have practiced SLOs for a while, some of these insights might surprise you!
1. Most customers are early in their SLO journey
75% of organizations have five or fewer SLOs in place, indicating that they are in the early stages of SLO adoption. 12.5% of customers have more than 50 SLOs. These customers have automated the SLO management process through APIs.
2. SLO adoption by EU customers lags behind NAM and APAC
Customers in APAC, in particular, Australia and India, have 17 SLOs on average, while NAM customers created 12 SLOs on average. In contrast, and surprisingly, EU customers seem to be lagging behind with just two SLOs per organization on average.
Anecdotally, many Sumo Logic customers in Australia and India are born in the cloud, modern app companies while other regions feature a mix of modern and modernizing applications. At first glance, this suggests that SLO adoption is most prevalent in modern app contexts — more data and contextual analysis will be required to confirm this hypothesis.
3. Latency, availability and errors account for 90% of all SLOs
Over 90% of SLOs are based on the latency, availability and error indicators. We have always recommended latency and errors (or a combined error-free latency) as best practice signals for SLOs, as they are good indicators of great user experiences. It is great to see that reflected in the data.
Somewhat surprising is the relative importance of availability as an SLO signal. Simply put, an unavailable app will have an infinite latency and a latency SLO, as a result, would capture availability as well. We suspect customers want redundancy in their SLO strategy.
This is similar to redundancy for monitors — some Sumo Logic customers, including Sumo Logic itself, have more monitors than they need to account for potential unreliability of particular monitors. Customers use the other category for non-observability indicators; for example, we have a customer that uses SLOs for assessing the reliability of their security operations.
4. Median SLO targets vary by golden signal type
Error SLOs feature the most aggressive target of 99.95% on the median. Latency and other signals are less aggressive with a median target of 99%. This seems reasonable as errors impact user experience more than latency because users might tolerate latency more than errors.
5. Logs dominate SLOs
73% of SLOs are defined on logs which underscores the importance of logs for top-level application observability.
6. Request-based SLOs are more common than window-based SLOs
62% of SLOs use request-based evaluation. Request-based SLOs are easy to understand and set up compared to window-based SLOs — we are surprised that customers do not have an overwhelming preference for them like we expected.
It is possible that customers find windows-based SLOs offer the ability to lessen the impact of particularly bad days on their reliability despite the complexity of their configuration.
7. Most SLOs use rolling compliance periods
75% of SLOs use rolling compliance periods compared to 25% calendar compliance. We speculate that calendar compliance is used by customers that offer Service Level Agreements (SLAs). Anecdotally, these are less common for modern apps.
Some customers have also mentioned aligning compliance periods with sprint boundaries used by their development teams. Such a practice, when adopted more broadly, would result in greater prevalence of calendar compliance periods.
8. Most common rolling compliance periods are one, seven and 30 Days
Most common duration for rolling compliance are seven and 30 days followed by one day. It is possible that these are the result of anchoring — as our user interface offers day, week and month as drop down choices.
9. Most common calendar compliance period is one week
Most common duration for calendar compliance is one week with 98% of SLOs using this value. We suspect this the result of anchoring — as our user interface offers calendar week as a drop down choice.
10. Less than six percent of SLOs use monitoring
Surprisingly, customers set up monitors for only about 6% of SLOs. In other words, most often customers consult dashboards to assess SLO performance and don’t opt to get alerted on it. We’re surprised by this as we assumed that teams would prefer to get informed automatically when reaching various SLO thresholds. The desire to track this directly in the dashboard implies that teams are actively monitoring their SLOs on an ongoing basis.
Many customers use SLOs as a planning tool to balance innovation with reliability — and planning activities are best served via dashboards. We will continue to watch this indicator for shifts in customer behavior.
11. SLO alerts appear to be high fidelity
When customers do set up monitors for SLO, these monitors are triggered only two times in 30 days. In other words, SLO alerts trigger rarely, which implies that customers are doing a great job with reliability already. Perhaps this is also a sign that those using this capability in Sumo Logic were already more established in their reliability journey.
In addition, SLO-based monitoring has the potential to streamline alerts significantly. We will continue to watch this indicator as a sign of enhanced progress in reliability management across our customer base.
Next Steps
Learn more about how SLOs, and more generally reliability management, can improve your decision making. Get started today using this microlesson.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.