What is an error budget?
An error budget is how much downtime a system can afford without upsetting customers, or, in other words, the margin of error permitted by a service level objective (SLO).
Key takeaways
- In practice, an error budget serves as a data point for deciding when to accelerate innovation or implement freezes.
- While SRE teams usually track an error budget, they don’t make decisions regarding how it should be spent.
- Error budgets can be measured in relation to availability or uptime, both of which are defined by a company’s Service Level Objective (SLO).
- To use uptime effectively in error budgets, you'll need to translate SLA/SLO targets into real numbers that development teams can use.
Why tech teams need and use error budgets?
A site reliability engineering (SRE) team comprises software engineers who improve the reliability of their systems and software in production. While SRE teams usually track an error budget, they don’t make decisions regarding how it should be spent. Instead, SRE teams work with development teams to set error budgets and policies.
The key stakeholders involved in creating the error budget are:
Product owners, including product managers, business analysts, and product leads speak on behalf of the customer to the development team to communicate customer needs and the user journey.
SRE and operations teams, including DevOps, ITSM, and problem management, and infrastructure engineers, use software to manage a service, solve problems and automate operations tasks.
Engineers that work on the product.
Customers, since the SLOs are non-legally binding promises that the service provider makes to them.
What is the purpose of an error budget?
There is a delicate balance between releasing new features and maintaining an acceptable level of availability to customers. An error budget tracks if a company is meeting contractual promises for a system or service and prevents it from pursuing too much innovation at the expense of that system or service’s reliability.
In practice, an error budget serves as a data point for deciding when to accelerate innovation or implement freezes. When a company exceeds its error budget, SRE teams can pause innovation to eliminate persistent causes of errors from the system.
Error budgets and SLO
Error budgets can be measured in relation to availability or uptime, both of which are defined by a company’s Service Level Objective (SLO). In other words, an error budget is 1 minus the SLO of the service. A 99.9% SLO service has a 0.1% error budget.
Error budgets and SLI
As part of operationalizing SLOs, SRE teams translate SLI percentages in terms of days and hours for software engineers. Service-level indicators (SLIs) are the measures that indicate if an SLO is met or not. The SLI ranges from 0% to 100%, where 0% means nothing works, and 100% means nothing is broken.
Error budgets and SLA
Some downtime is inevitable, which is why Service-level agreements (SLA) should never promise 100% uptime. When an SLO is not met, the terms of a company’s SLA kick in.
Suppose a company has an SLA of 98% and an SLO of 99% availability. The error budget would be 1%, and that 1% in a 28-day window is 6.72 hours of downtime. If the SLI dips below 99% during that 28-day window, then it’s used up all of its error budgets and is no longer meeting the SLO.
If availability is above the number promised by the SLA/SLO, an SRE team can release new features and take risks. But if it’s below the target, releases halt until the target numbers are back on track.
What happens if you’ve spent or are close to spending your error budget?
When an error budget is close to being spent, SRE teams work with the development team to implement alerts and policies to minimize the impact failures and outages have on customers. This alerting policy is what makes error budgets and SLOs actionable.
If a team burns through its entire error budget, then contingency policies can come into effect to prevent further customer impact. For example, going into code red and freezing all new releases until the number of errors is adequately reduced.
If there are simply too many errors, then the SRE team may have to do a system rollback to give developers enough time to deal with the errors gradually and release the changes over time.
How to use an error budget in your organization
Most DevSecOps teams monitor the uptime of applications and systems on a monthly basis. If uptime is above the SLA/SLO number, then engineering teams can take greater risks. This means more feature releases, more experiments, etc. If uptime is below the SLA/SLO number, then teams need to consider this and slow down the release schedule until uptime is back on track.
To use uptime effectively in error budgets, you'll need to translate SLA/SLO targets into real numbers that development teams can use.
How Sumo Logic can help
Businesses are focused on achieving their goals, which is why they value robust observability platforms, like Sumo Logic, to help them measure their objectives and ensure they’re on track to meeting their KPIs, deadlines, and long-term strategies.
Try Sumo Logic’s free trial today to see how we can help you reach your goals and maintain quality assurance today.
FAQs
How can error budget policies be effectively implemented within a development team?
To effectively implement error budget policies within a development team, clearly define service level objectives (SLOs) and service level indicators (SLIs) that align with the team's goals. From there, consider the following best practices
Establish a structured process for tracking, monitoring, and reporting on error budgets, error rates, and reliability improvements.
Encourage cross-functional collaboration between the development team, site reliability engineers (SRE team) and product owners to prioritize balancing new feature development and system reliability.
Regularly review error budget consumption and remaining error budget to make informed decisions and address any SLO violations promptly.
Continuously evaluate and adjust error budget policies to meet reliability goals, customer experience standards and availability targets.
How often should error budgets be reviewed and recalibrated?
Error budgets should ideally be reviewed and recalibrated regularly, typically aligned with the frequency of service level objective (SLO) reviews. This ensures that error budgets remain relevant to performance metrics and organizational goals. Depending on the specific needs of the system and the criticality of the services being provided, error budgets may be reevaluated monthly, quarterly or annually to ensure they accurately reflect the acceptable level of errors that can occur without compromising reliability.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.