Understanding Sustain and Cooldown Periods

Server Scout's alerting system uses sustain and cooldown periods to ensure you receive meaningful notifications whilst avoiding alert fatigue. Understanding how these periods work will help you configure more effective monitoring for your infrastructure.

The Alert State Machine

Server Scout alerts follow a three-state system that determines when notifications are sent:

  1. OK - The metric is within normal parameters
  2. Pending - A threshold has been breached, but we're waiting to confirm it's not just a brief spike
  3. Firing - The breach has been sustained long enough to warrant notification

When a metric first crosses your defined threshold, the alert immediately transitions from OK to Pending. However, no notification is sent at this stage. Instead, Server Scout begins monitoring the metric more closely during what's called the sustain period.

For an alert to progress from Pending to Firing, 80% of the readings during the sustain period must breach the threshold. This approach prevents false alarms from temporary spikes whilst still catching genuine issues quickly.

For example, if your sustain period is 5 minutes and Server Scout checks every minute, you'll have 5 readings. At least 4 of these readings must breach the threshold for the alert to fire.

Understanding Sustain Periods

The sustain period serves as a buffer against noisy metrics and brief anomalies. Without this mechanism, you might receive alerts for:

  • Temporary CPU spikes during scheduled backups
  • Brief network hiccups that resolve themselves
  • Momentary disk I/O bursts from log rotation
# Example: CPU usage briefly spikes to 95% but returns to 30%
# Without sustain period: Alert fires immediately
# With 3-minute sustain: No alert if spike lasts less than ~2.4 minutes

The 80% threshold means that even during a genuine incident, brief moments where the metric dips below the threshold won't immediately resolve the alert. This prevents flapping between Pending and Firing states during intermittent issues.

Cooldown Periods Explained

Once an alert has fired and sent a notification, the cooldown period determines the minimum time that must pass before another notification for the same alert can be sent. During cooldown:

  • The alert remains in the Firing state if the condition persists
  • No additional notifications are sent
  • The alert can still resolve to OK if the metric returns to normal

Cooldown periods are essential for preventing alert fatigue—imagine receiving notifications every minute about the same disk space issue that you're already working to resolve.

Choosing Appropriate Values

Selecting the right sustain and cooldown periods depends on your metric type and operational requirements:

Critical System Metrics

For CPU, memory, and disk space alerts:

  • Sustain period: 3-5 minutes
  • Cooldown period: 15-30 minutes
# High CPU sustained for 3+ minutes indicates a real issue
# 15-minute cooldown gives you time to investigate without spam

Network and Connectivity

For network latency or service availability:

  • Sustain period: 2-3 minutes
  • Cooldown period: 10-15 minutes

Network issues can be more transient, so shorter sustain periods help catch genuine connectivity problems whilst the cooldown prevents notification storms during unstable conditions.

Less Critical Metrics

For informational metrics like load average or swap usage:

  • Sustain period: 5-10 minutes
  • Cooldown period: 30-60 minutes

These metrics often fluctuate naturally and rarely require immediate action, so longer periods reduce noise whilst ensuring you're still informed of persistent issues.

Best Practices

  1. Start conservative: Begin with longer periods and adjust based on your experience with false positives and genuine incidents
  1. Consider your response time: Set cooldown periods based on how quickly you can typically investigate and resolve issues
  1. Account for metric volatility: Noisy metrics benefit from longer sustain periods, whilst stable metrics can use shorter ones
  1. Test your settings: Use Server Scout's alert history to review whether your periods are catching real issues without creating noise

Remember, the goal is to be alerted to genuine problems quickly enough to take action, whilst avoiding the alert fatigue that leads to ignored notifications.

Frequently Asked Questions

How do I set up sustain and cooldown periods in ServerScout

ServerScout uses sustain and cooldown periods in its three-state alerting system (OK, Pending, Firing). The sustain period determines how long a threshold breach must persist before firing an alert, while the cooldown period sets the minimum time between repeat notifications for the same alert.

Why is my ServerScout alert not firing immediately when threshold is breached

When a metric crosses your threshold, the alert enters a 'Pending' state during the sustain period. For the alert to fire, 80% of readings during this period must breach the threshold. This prevents false alarms from temporary spikes while catching genuine issues.

How does the 80% threshold work in sustain periods

During the sustain period, at least 80% of metric readings must breach your threshold for an alert to fire. For example, with a 5-minute sustain period and 1-minute checks, at least 4 out of 5 readings must breach the threshold before you receive a notification.

What are the recommended sustain and cooldown periods for CPU alerts

For critical system metrics like CPU, memory, and disk space, use a sustain period of 3-5 minutes and cooldown period of 15-30 minutes. This catches real issues while giving you time to investigate without notification spam.

What happens during the cooldown period after an alert fires

During cooldown, the alert remains in Firing state if the condition persists, but no additional notifications are sent. The alert can still resolve to OK if the metric returns to normal. This prevents alert fatigue from repeated notifications about the same issue.

How long should sustain periods be for network monitoring alerts

For network latency or service availability alerts, use shorter periods: 2-3 minutes for sustain and 10-15 minutes for cooldown. Network issues can be more transient, so shorter sustain periods help catch genuine connectivity problems quickly.

What sustain and cooldown periods work best for non-critical metrics

For less critical metrics like load average or swap usage, use longer periods: 5-10 minutes sustain and 30-60 minutes cooldown. These metrics fluctuate naturally and rarely need immediate action, so longer periods reduce noise while keeping you informed.

Was this article helpful?