blog に戻る

2022年12月20日 Bashyam Anant

11 unique insights into SLOs and reliability management

SLO insights and best practices

A quarter has passed since we launched our Reliability Management capabilities that help developers focus on defining, monitoring and managing Service Level Objectives (SLOs) to drive great digital experiences. Reducing alert fatigue and balancing innovation with reliability are common outcomes that customers expect from Reliability Management.

If you are new to SLOs, these insights from our customers capture common practices among peer developers. Where possible, we showcase best practices that you can adopt on your own journey. If you have practiced SLOs for a while, some of these insights might surprise you!

1. Most customers are early in their SLO journey

75% of organizations have five or fewer SLOs in place, indicating that they are in the early stages of SLO adoption. 12.5% of customers have more than 50 SLOs. These customers have automated the SLO management process through APIs.

Most customers are early in their SLO journey

2. SLO adoption by EU customers lags behind NAM and APAC

Customers in APAC, in particular, Australia and India, have 17 SLOs on average, while NAM customers created 12 SLOs on average. In contrast, and surprisingly, EU customers seem to be lagging behind with just two SLOs per organization on average.

Anecdotally, many Sumo Logic customers in Australia and India are born in the cloud, modern app companies while other regions feature a mix of modern and modernizing applications. At first glance, this suggests that SLO adoption is most prevalent in modern app contexts — more data and contextual analysis will be required to confirm this hypothesis.

SLO adoption by EU customers lags behind NAM and APAC

3. Latency, availability and errors account for 90% of all SLOs

Over 90% of SLOs are based on the latency, availability and error indicators. We have always recommended latency and errors (or a combined error-free latency) as best practice signals for SLOs, as they are good indicators of great user experiences. It is great to see that reflected in the data.

Somewhat surprising is the relative importance of availability as an SLO signal. Simply put, an unavailable app will have an infinite latency and a latency SLO, as a result, would capture availability as well. We suspect customers want redundancy in their SLO strategy.

This is similar to redundancy for monitors — some Sumo Logic customers, including Sumo Logic itself, have more monitors than they need to account for potential unreliability of particular monitors. Customers use the other category for non-observability indicators; for example, we have a customer that uses SLOs for assessing the reliability of their security operations.

Latency availability and errors account for 90 percent of all SLOs

4. Median SLO targets vary by golden signal type

Error SLOs feature the most aggressive target of 99.95% on the median. Latency and other signals are less aggressive with a median target of 99%. This seems reasonable as errors impact user experience more than latency because users might tolerate latency more than errors.

Median SLO targets vary by golden signal type

5. Logs dominate SLOs

73% of SLOs are defined on logs which underscores the importance of logs for top-level application observability.

Logs dominate SLOs

6. Request-based SLOs are more common than window-based SLOs

62% of SLOs use request-based evaluation. Request-based SLOs are easy to understand and set up compared to window-based SLOs — we are surprised that customers do not have an overwhelming preference for them like we expected.

It is possible that customers find windows-based SLOs offer the ability to lessen the impact of particularly bad days on their reliability despite the complexity of their configuration.

Request-based SLOs are more common than window- based SLOs

7. Most SLOs use rolling compliance periods

75% of SLOs use rolling compliance periods compared to 25% calendar compliance. We speculate that calendar compliance is used by customers that offer Service Level Agreements (SLAs). Anecdotally, these are less common for modern apps.

Some customers have also mentioned aligning compliance periods with sprint boundaries used by their development teams. Such a practice, when adopted more broadly, would result in greater prevalence of calendar compliance periods.

Most SLOs use Rolling Compliance Periods

8. Most common rolling compliance periods are one, seven and 30 Days

Most common duration for rolling compliance are seven and 30 days followed by one day. It is possible that these are the result of anchoring — as our user interface offers day, week and month as drop down choices.

Most common rolling compliance periods are one seven and 30 Days

9. Most common calendar compliance period is one week

Most common duration for calendar compliance is one week with 98% of SLOs using this value. We suspect this the result of anchoring — as our user interface offers calendar week as a drop down choice.

Most common calendar compliance period is one week

10. Less than six percent of SLOs use monitoring

Surprisingly, customers set up monitors for only about 6% of SLOs. In other words, most often customers consult dashboards to assess SLO performance and don’t opt to get alerted on it. We’re surprised by this as we assumed that teams would prefer to get informed automatically when reaching various SLO thresholds. The desire to track this directly in the dashboard implies that teams are actively monitoring their SLOs on an ongoing basis.

Many customers use SLOs as a planning tool to balance innovation with reliability — and planning activities are best served via dashboards. We will continue to watch this indicator for shifts in customer behavior.

11. SLO alerts appear to be high fidelity

When customers do set up monitors for SLO, these monitors are triggered only two times in 30 days. In other words, SLO alerts trigger rarely, which implies that customers are doing a great job with reliability already. Perhaps this is also a sign that those using this capability in Sumo Logic were already more established in their reliability journey.

In addition, SLO-based monitoring has the potential to streamline alerts significantly. We will continue to watch this indicator as a sign of enhanced progress in reliability management across our customer base.

Next Steps

Learn more about how SLOs, and more generally reliability management, can improve your decision making. Get started today using this microlesson.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Sumo Logic cloud-native SaaS analytics

Build, run, and secure modern applications and cloud infrastructures.

Start free trial
Bashyam Anant

Bashyam Anant

Sr Director, Advanced Analytics

As a general manager and innovation leader, Bashyam has driven 30+ software products ($1B ARR, 7 patents) from concept to market leadership. At Sumo Logic, Bashyam leads AI experiences and platform products, which feature petabyte-scale machine learning to drive insights and action for cybersecurity and application reliability outcomes. In a prior strategy consulting career, Bashyam advised leadership teams at Boeing, HP and Motorola on new market and product assessments leading to multimillion dollar businesses.

More posts by Bashyam Anant.

これを読んだ人も楽しんでいます