Setting effective alerts is the difference between sleeping soundly and receiving a cascade of false alarms at 3 AM. This guide will help you configure thresholds that catch real problems whilst avoiding alert fatigue.
Understanding Server Scout's Alert System
Server Scout allows you to set warning and critical thresholds on any metric, with optional sustain periods and cooldown timers. The key principle is that every alert you receive should be actionable, and no actionable condition should go unnoticed.
Alerts are configured per metric with these parameters:
- Warning threshold: Early indication of a potential problem
- Critical threshold: Immediate attention required
- Sustain period: How long the condition must persist before alerting
- Cooldown period: Time to wait before re-alerting on the same condition
Critical Metrics Worth Alerting On
CPU Metrics
| Metric | Warning | Critical | Sustain | Why Alert |
|---|---|---|---|---|
cpu_percent | 85% | 95% | 5 minutes | Server becoming unresponsive |
cpu_iowait | 15% | 30% | 3 minutes | Disk bottleneck masquerading as CPU issue |
cpu_steal | 10% | 20% | 5 minutes | Hypervisor contention on VMs |
Why these thresholds work: Brief CPU spikes during deployments or cron jobs are normal. The 5-minute sustain period filters out these legitimate peaks whilst catching sustained load issues. High cpu_iowait often indicates disk problems rather than CPU problems, making it a valuable early warning for I/O bottlenecks.
Memory Metrics
| Metric | Warning | Critical | Sustain | Why Alert |
|---|---|---|---|---|
mem_percent | 85% | 95% | 3 minutes | Potential memory exhaustion |
Important caveat: High memory usage is often normal on Linux due to aggressive caching. A server showing 90% mem_percent with adequate mem_available_mb is healthy. Consider the guidance in our Memory Metrics Explained article when interpreting these alerts.
Disk Metrics
| Metric | Warning | Critical | Sustain | Why Alert |
|---|---|---|---|---|
disk_percent | 80% | 90% | None | Prevent cascading failures from disk exhaustion |
No sustain period needed: Disk usage changes gradually, and running out of space causes immediate, severe problems. This is arguably the most universally useful alert you can set.
For servers with multiple mounts, configure per-partition alerts for critical paths like /var, /tmp, and /home. A full /tmp can break applications even if the root partition has space.
System Health Metrics
| Metric | Warning | Critical | Sustain | Why Alert |
|---|---|---|---|---|
failed_units | 1 | 1 | None | Any failed systemd service needs investigation |
oom_kills | 1 | N/A | None | Process killed due to memory exhaustion |
Network Error Metrics
| Metric | Warning | Critical | Sustain | Why Alert |
|---|---|---|---|---|
net_rx_errors + net_tx_errors | 1 | N/A | None | Hardware or configuration problems |
Network errors are rarely transient. Any sustained error rate indicates problems with cables, network cards, or configuration. See our Network Metrics Explained article for detailed troubleshooting.
Built-in Alerts
Server Scout automatically alerts when a server goes offline (agent stops reporting). The default 60-second sustain period allows for brief network hiccups whilst catching genuine outages quickly.
Common Threshold Mistakes
1. No Sustain Periods on Volatile Metrics
Setting cpu_percent alerts without sustain periods generates dozens of false positives. A 10-second CPU spike during log rotation isn't worth waking up for—a 10-minute spike is.
Wrong: CPU alert at 80% with no sustain period Right: CPU alert at 85% with 5-minute sustain period
2. Disk Alerts Set Too High
Setting disk alerts at 95% leaves almost no time to react. Modern applications can fill the remaining 5% in minutes.
Wrong: Disk alert at 95% Right: Disk alert at 80% warning, 90% critical
3. Focusing Only on Overall CPU Percentage
cpu_percent tells you the server is busy, but not why. Including cpu_iowait and cpu_steal helps identify root causes:
- High
cpu_percent+ highcpu_iowait= disk problem, not CPU problem - High
cpu_percent+ highcpu_steal= hypervisor contention
4. Misunderstanding Linux Memory Usage
Linux uses "free" memory for caching, so 90% memory usage might be perfectly healthy. The key metric is mem_available_mb—memory that can be reclaimed if needed.
5. One Size Fits All Thresholds
A build server legitimately hitting 100% CPU for hours is normal. A web server hitting 90% CPU for 10 minutes indicates a problem. Start with global defaults, then customise per server type.
6. Alerting on Cumulative Counters
Metrics like page_faults, context_switches, and disk_io_read_bytes are cumulative counters. The dashboard shows rates (per-second), but the absolute numbers grow continuously. Alert on sudden rate changes, not absolute values.
Sustain Periods: Your Shield Against False Positives
Sustain periods prevent alerts from firing on brief, normal fluctuations. Here are recommended sustain periods by metric type:
| Metric Type | Recommended Sustain | Reason |
|---|---|---|
| CPU metrics | 3-5 minutes | Filter out deployment spikes, cron jobs |
| Memory metrics | 2-3 minutes | Allow for brief allocation spikes |
| Disk space | None | Changes slowly, immediate action needed |
| Load average | 5-10 minutes | Already averaged over time |
| Service failures | None | Any failure needs immediate investigation |
| Network errors | None | Errors indicate real problems |
Cooldown Periods: Preventing Alert Fatigue
After an alert fires, the cooldown period prevents re-notification for the same condition. Without cooldown, a metric oscillating around the threshold generates endless notifications.
Recommended cooldown: 30-60 minutes for most alerts. This gives you time to investigate and resolve the issue without being bombarded by duplicate notifications.
Server-Specific Threshold Strategies
Web Servers
- Lower CPU thresholds (80% warning)
- Higher memory tolerance (90% if plenty of cache)
- Strict disk monitoring on
/var/log
Database Servers
- Higher memory thresholds (databases should use most available RAM)
- Very strict
cpu_iowaitmonitoring - Per-mount alerts on data directories
Build Servers
- Higher CPU thresholds (95%+ normal during builds)
- Strict disk monitoring on
/tmpand build directories - Monitor
processes_totalfor runaway builds
Mail Servers
- Strict disk monitoring on mail spool directories
- Monitor
tcp_connectionsfor connection floods - Alert on any
failed_units(mail services are critical)
Setting Up Your First Alert Set
Start with this minimal, high-value alert configuration:
disk_percent: 80% warning, 90% critical, no sustaincpu_percent: 85% warning, 95% critical, 5-minute sustain- Server offline: Use the built-in alert with 60-second sustain
failed_units: 1 warning/critical, no sustainoom_kills: 1 warning, no sustain
This covers the most common failure modes: disk exhaustion, CPU overload, service failures, memory exhaustion, and complete outages.
Advanced Alerting Patterns
Composite Conditions
Consider alerting when multiple related metrics exceed thresholds simultaneously:
- High
cpu_percentAND highcpu_iowait= disk bottleneck - High
mem_percentAND lowmem_available_mb= genuine memory pressure - High
load_15mAND normalcpu_percent= I/O or uninterruptible sleep issues
Rate-of-Change Alerts
Sometimes the rate of change matters more than absolute values:
disk_percentincreasing by 10% in one hourprocesses_totaldoubling in 30 minutes- Network error rates spiking above baseline
Time-Based Thresholds
Different thresholds for different times:
- Stricter CPU limits during business hours
- Relaxed thresholds during known maintenance windows
- Higher disk usage tolerance during backup periods
Testing Your Alert Configuration
Before deploying alerts to production:
- Review historical data: Check how often your thresholds would have triggered over the past month
- Test with synthetic load: Use tools like
stressto verify alerts fire as expected - Start with warnings only: Deploy warning thresholds first, then add critical alerts once you're confident
- Document your decisions: Record why you chose specific thresholds for future reference
Next Steps
Once you've configured basic alerts, explore our detailed metric explanations:
- CPU Metrics Explained for understanding processor performance
- Memory Metrics Explained for Linux memory management
- Disk Metrics Explained for storage monitoring
- Network Metrics Explained for connectivity issues
Remember: the best alert system is one that tells you about problems before they impact users, without crying wolf so often that you ignore genuine issues. Start conservative, monitor your alert frequency, and adjust thresholds based on your actual operational experience.
Back to Complete Reference IndexFrequently Asked Questions
How do I set up metric alert thresholds in Server Scout?
What are good starting thresholds for common metrics?
How do I avoid alert fatigue from too many notifications?
Can I set alerts on cumulative counter metrics like network bytes?
What metrics should I alert on first?
Was this article helpful?