blog に戻る

2024年11月07日 Bashyam Anant

Reduce alert noise, automate incident response and keep coding with AI-driven alerting

AI-driven alerting

Noisy monitors can lead to alert fatigue, which frustrates engineers and hinders innovation. With our patent-pending anomaly detection capabilities built on the power of AI, you can eliminate 60-90% of alerts. A unique differentiator, Sumo Logic’s alerts can also trigger one or more playbooks to drive auto-diagnosis or remediation and accelerate time to recovery for application incidents.

Faster issue remediation means engineers can focus more time on development and releasing software. The combination of next-generation anomaly detection and automation is part of our AI-driven alerting capabilities available to all Sumo Logic customers that will change how you troubleshoot.

Alerting challenges

The noisiest 5% of monitors on logs and metrics used by Sumo Logic trigger seven times per day, as in the graph below. Anecdotally, half of these noisy alerts trigger after regular work hours. These stats imply that the volume of alerts from modern applications can overwhelm on-call teams.

Alerting challenges
Top 5% of noisiest monitors trigger seven times per day

A substantial portion of these alerts are often irrelevant or insignificant, contributing to alert fatigue among operators. The image below shows a monitor for a Sumo Logic customer, which, over three days, generated two false positives (i.e. false alarm) and one false negative (i.e. alert was not triggered when it should have). False positives are a distraction that pulls engineers away from their focused work, while false negatives hide genuine problems that developers actually need to act on.

Customer monitor
Customer monitor with two false positives and one false negative

For AI-driven alerting, we were convinced we could leverage real-time AI and ML to drive up our accuracy and keep developers focused on the work they do best.

While reducing false positives helps developers with alert fatigue, when incidents do occur, you want to resolve them quickly. AI-driven alerting also features playbooks for automating incident diagnosis and if necessary, recovery.

Why are monitors noisy?

First-generation anomaly detection, such as Sumo Logic outlier, learns dynamic baselines from recent data points and can avoid the problem of finding optimum static thresholds. However, these techniques can still generate false positives because they:

  • Don't adjust for seasonality, especially longer range periodicity such as weekly periodicity or weekday/weekend periodicity. As a result, some monitors based on first-generation anomaly detection might trip on weekends, as they do not factor the expected dip on weekends for most business apps, which is particularly annoying for false alarms.
  • Require tuning of lots of parameters based on periodic assessment of false positives and false negatives. During our customer previews, we assessed that many customer on-call engineers spend a lot of time tweaking monitor thresholds or parameters of an AI-driven alerting feature,
  • Unable to support contextual and dynamic thresholds. For example, for some signals in some contexts, you may want to wait for ten minutes of sustained degradation before triggering an alert, while in others you may want to trigger an alert even for a single anomalous data point.

The benefits of AI-driven alerting

AI-driven alerting addresses challenges with first-generation anomaly detection through the following strategies:

  • Model-driven anomaly detection: AI-driven alerts use 60 days of historical data (when available) to train and test an ML model so that hourly, daily and weekly (especially, weekday/weekend) seasonality are factored into baselines. An anomaly is an unusual datapoint compared to the baseline or expected value.

  • AutoML: AI-driven alerts embed an AutoML framework where the analytics tune itself based on model performance on training datasets. Simply put, AutoML supports a “set it and forget it” experience with minimal user intervention.

  • Model contextual and dynamic thresholds: AI-driven alerts have a sensitivity setting (low sensitivity for signals that are expected to be noisy and high sensitivity for critical indicators). Additionally, the user can configure the incident detector based on context. For example, in the Cluster detector, the user can specify how many data points in a detection window of say 5m need to be unusual before triggering an alert.

AI-driven alerting case study

One of our preview customers for AI-driven alerting is a B2C modern application company that had many first-generation Sumo Logic outliers that were noisy primarily because they missed the weekday/weekend periodicity of their signals.

AI-driven alerting successfully modeled the periodicity in the data as indicated in the blue line in the chart below, while the red lines are the upper and lower bounds predicted by the ML model. With AI-driven alerting, false alarms were successfully mitigated while alerts were triggered on genuine issues.

AI-driven alerting case study
AI-driven monitor successfully models weekend/weekday periodicity

Incident response automation

When incidents are detected correctly via anomaly detection or otherwise, you want to resolve them quickly to minimize customer impact and lost revenue. Recovery time for production incidents is about 30 minutes, which is driven largely by the ad hoc nature of reading through text playbooks, contacting subject matter experts, collecting diagnostics, forming hypotheses and taking action. What if diagnosis and/or recovery time could be reduced to five minutes through automation?

Sumo Logic Automation Service is now integrated with monitors. Any logs or metrics monitor can be associated with one or more playbooks authored by subject matter experts in the Automation Service. When the monitor triggers, the playbooks execute and cut minutes and hours from the response.


Incident response automation

Here is an example of an auto-diagnosis playbook, in response to a site-down alert, where the customer is running six log searches and one metrics search in parallel, collating the results and alerting an on-call user with a summary of the incident. In some cases, the root cause might be part of the summary; in other cases, the playbook helps eliminate known root causes so that the on-call engineer can begin an ad hoc investigation. Either way, this auto-diagnosis playbook reduced the recovery time.

Auto-diagnosis playbook example
Auto-diagnosis playbook example

While these examples are related to application incidents, AI-driven alerting is also relevant for security alerts, by cutting noise and automating incident response through playbooks. Many Cloud SIEM customers use playbooks already; with this release, playbooks can be attached to any logs or metrics-based security monitor. With Flex Licensing, our aim is to cover 100% of your logs and use cases.

AI-driven alerting reduces alert noise and accelerates incident diagnosis and recovery time, changing how you troubleshoot and secure your applications and infrastructure. Learn more about AI and log analytics, or start your free trial and test it for yourself.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Sumo Logic cloud-native SaaS analytics

Build, run, and secure modern applications and cloud infrastructures.

Start free trial
Bashyam Anant

Bashyam Anant

Sr Director, Advanced Analytics

As a general manager and innovation leader, Bashyam has driven 30+ software products ($1B ARR, 7 patents) from concept to market leadership. At Sumo Logic, Bashyam leads AI experiences and platform products, which feature petabyte-scale machine learning to drive insights and action for cybersecurity and application reliability outcomes. In a prior strategy consulting career, Bashyam advised leadership teams at Boeing, HP and Motorola on new market and product assessments leading to multimillion dollar businesses.

More posts by Bashyam Anant.

これを読んだ人も楽しんでいます