Server Scout's alerting system uses sustain and cooldown periods to ensure you receive meaningful notifications whilst avoiding alert fatigue. Understanding how these periods work will help you configure more effective monitoring for your infrastructure.
The Alert State Machine
Server Scout alerts follow a three-state system that determines when notifications are sent:
- OK - The metric is within normal parameters
- Pending - A threshold has been breached, but we're waiting to confirm it's not just a brief spike
- Firing - The breach has been sustained long enough to warrant notification
When a metric first crosses your defined threshold, the alert immediately transitions from OK to Pending. However, no notification is sent at this stage. Instead, Server Scout begins monitoring the metric more closely during what's called the sustain period.
For an alert to progress from Pending to Firing, 80% of the readings during the sustain period must breach the threshold. This approach prevents false alarms from temporary spikes whilst still catching genuine issues quickly.
For example, if your sustain period is 5 minutes and Server Scout checks every minute, you'll have 5 readings. At least 4 of these readings must breach the threshold for the alert to fire.
Understanding Sustain Periods
The sustain period serves as a buffer against noisy metrics and brief anomalies. Without this mechanism, you might receive alerts for:
- Temporary CPU spikes during scheduled backups
- Brief network hiccups that resolve themselves
- Momentary disk I/O bursts from log rotation
# Example: CPU usage briefly spikes to 95% but returns to 30%
# Without sustain period: Alert fires immediately
# With 3-minute sustain: No alert if spike lasts less than ~2.4 minutes
The 80% threshold means that even during a genuine incident, brief moments where the metric dips below the threshold won't immediately resolve the alert. This prevents flapping between Pending and Firing states during intermittent issues.
Cooldown Periods Explained
Once an alert has fired and sent a notification, the cooldown period determines the minimum time that must pass before another notification for the same alert can be sent. During cooldown:
- The alert remains in the Firing state if the condition persists
- No additional notifications are sent
- The alert can still resolve to OK if the metric returns to normal
Cooldown periods are essential for preventing alert fatigue—imagine receiving notifications every minute about the same disk space issue that you're already working to resolve.
Choosing Appropriate Values
Selecting the right sustain and cooldown periods depends on your metric type and operational requirements:
Critical System Metrics
For CPU, memory, and disk space alerts:
- Sustain period: 3-5 minutes
- Cooldown period: 15-30 minutes
# High CPU sustained for 3+ minutes indicates a real issue
# 15-minute cooldown gives you time to investigate without spam
Network and Connectivity
For network latency or service availability:
- Sustain period: 2-3 minutes
- Cooldown period: 10-15 minutes
Network issues can be more transient, so shorter sustain periods help catch genuine connectivity problems whilst the cooldown prevents notification storms during unstable conditions.
Less Critical Metrics
For informational metrics like load average or swap usage:
- Sustain period: 5-10 minutes
- Cooldown period: 30-60 minutes
These metrics often fluctuate naturally and rarely require immediate action, so longer periods reduce noise whilst ensuring you're still informed of persistent issues.
Best Practices
- Start conservative: Begin with longer periods and adjust based on your experience with false positives and genuine incidents
- Consider your response time: Set cooldown periods based on how quickly you can typically investigate and resolve issues
- Account for metric volatility: Noisy metrics benefit from longer sustain periods, whilst stable metrics can use shorter ones
- Test your settings: Use Server Scout's alert history to review whether your periods are catching real issues without creating noise
Remember, the goal is to be alerted to genuine problems quickly enough to take action, whilst avoiding the alert fatigue that leads to ignored notifications.
Frequently Asked Questions
How do I set up sustain and cooldown periods in ServerScout
Why is my ServerScout alert not firing immediately when threshold is breached
How does the 80% threshold work in sustain periods
What are the recommended sustain and cooldown periods for CPU alerts
What happens during the cooldown period after an alert fires
How long should sustain periods be for network monitoring alerts
What sustain and cooldown periods work best for non-critical metrics
Was this article helpful?