How to Set Thresholds and Alerts Using Metrics

Setting effective alerts is the difference between sleeping soundly and receiving a cascade of false alarms at 3 AM. This guide will help you configure thresholds that catch real problems whilst avoiding alert fatigue.

Understanding Server Scout's Alert System

Server Scout allows you to set warning and critical thresholds on any metric, with optional sustain periods and cooldown timers. The key principle is that every alert you receive should be actionable, and no actionable condition should go unnoticed.

Alerts are configured per metric with these parameters:

  • Warning threshold: Early indication of a potential problem
  • Critical threshold: Immediate attention required
  • Sustain period: How long the condition must persist before alerting
  • Cooldown period: Time to wait before re-alerting on the same condition

Critical Metrics Worth Alerting On

CPU Metrics

MetricWarningCriticalSustainWhy Alert
cpu_percent85%95%5 minutesServer becoming unresponsive
cpu_iowait15%30%3 minutesDisk bottleneck masquerading as CPU issue
cpu_steal10%20%5 minutesHypervisor contention on VMs

Why these thresholds work: Brief CPU spikes during deployments or cron jobs are normal. The 5-minute sustain period filters out these legitimate peaks whilst catching sustained load issues. High cpu_iowait often indicates disk problems rather than CPU problems, making it a valuable early warning for I/O bottlenecks.

Memory Metrics

MetricWarningCriticalSustainWhy Alert
mem_percent85%95%3 minutesPotential memory exhaustion

Important caveat: High memory usage is often normal on Linux due to aggressive caching. A server showing 90% mem_percent with adequate mem_available_mb is healthy. Consider the guidance in our Memory Metrics Explained article when interpreting these alerts.

Disk Metrics

MetricWarningCriticalSustainWhy Alert
disk_percent80%90%NonePrevent cascading failures from disk exhaustion

No sustain period needed: Disk usage changes gradually, and running out of space causes immediate, severe problems. This is arguably the most universally useful alert you can set.

For servers with multiple mounts, configure per-partition alerts for critical paths like /var, /tmp, and /home. A full /tmp can break applications even if the root partition has space.

System Health Metrics

MetricWarningCriticalSustainWhy Alert
failed_units11NoneAny failed systemd service needs investigation
oom_kills1N/ANoneProcess killed due to memory exhaustion

Network Error Metrics

MetricWarningCriticalSustainWhy Alert
net_rx_errors + net_tx_errors1N/ANoneHardware or configuration problems

Network errors are rarely transient. Any sustained error rate indicates problems with cables, network cards, or configuration. See our Network Metrics Explained article for detailed troubleshooting.

Built-in Alerts

Server Scout automatically alerts when a server goes offline (agent stops reporting). The default 60-second sustain period allows for brief network hiccups whilst catching genuine outages quickly.

Common Threshold Mistakes

1. No Sustain Periods on Volatile Metrics

Setting cpu_percent alerts without sustain periods generates dozens of false positives. A 10-second CPU spike during log rotation isn't worth waking up for—a 10-minute spike is.

Wrong: CPU alert at 80% with no sustain period Right: CPU alert at 85% with 5-minute sustain period

2. Disk Alerts Set Too High

Setting disk alerts at 95% leaves almost no time to react. Modern applications can fill the remaining 5% in minutes.

Wrong: Disk alert at 95% Right: Disk alert at 80% warning, 90% critical

3. Focusing Only on Overall CPU Percentage

cpu_percent tells you the server is busy, but not why. Including cpu_iowait and cpu_steal helps identify root causes:

  • High cpu_percent + high cpu_iowait = disk problem, not CPU problem
  • High cpu_percent + high cpu_steal = hypervisor contention

4. Misunderstanding Linux Memory Usage

Linux uses "free" memory for caching, so 90% memory usage might be perfectly healthy. The key metric is mem_available_mb—memory that can be reclaimed if needed.

5. One Size Fits All Thresholds

A build server legitimately hitting 100% CPU for hours is normal. A web server hitting 90% CPU for 10 minutes indicates a problem. Start with global defaults, then customise per server type.

6. Alerting on Cumulative Counters

Metrics like page_faults, context_switches, and disk_io_read_bytes are cumulative counters. The dashboard shows rates (per-second), but the absolute numbers grow continuously. Alert on sudden rate changes, not absolute values.

Sustain Periods: Your Shield Against False Positives

Sustain periods prevent alerts from firing on brief, normal fluctuations. Here are recommended sustain periods by metric type:

Metric TypeRecommended SustainReason
CPU metrics3-5 minutesFilter out deployment spikes, cron jobs
Memory metrics2-3 minutesAllow for brief allocation spikes
Disk spaceNoneChanges slowly, immediate action needed
Load average5-10 minutesAlready averaged over time
Service failuresNoneAny failure needs immediate investigation
Network errorsNoneErrors indicate real problems

Cooldown Periods: Preventing Alert Fatigue

After an alert fires, the cooldown period prevents re-notification for the same condition. Without cooldown, a metric oscillating around the threshold generates endless notifications.

Recommended cooldown: 30-60 minutes for most alerts. This gives you time to investigate and resolve the issue without being bombarded by duplicate notifications.

Server-Specific Threshold Strategies

Web Servers

  • Lower CPU thresholds (80% warning)
  • Higher memory tolerance (90% if plenty of cache)
  • Strict disk monitoring on /var/log

Database Servers

  • Higher memory thresholds (databases should use most available RAM)
  • Very strict cpu_iowait monitoring
  • Per-mount alerts on data directories

Build Servers

  • Higher CPU thresholds (95%+ normal during builds)
  • Strict disk monitoring on /tmp and build directories
  • Monitor processes_total for runaway builds

Mail Servers

  • Strict disk monitoring on mail spool directories
  • Monitor tcp_connections for connection floods
  • Alert on any failed_units (mail services are critical)

Setting Up Your First Alert Set

Start with this minimal, high-value alert configuration:

  1. disk_percent: 80% warning, 90% critical, no sustain
  2. cpu_percent: 85% warning, 95% critical, 5-minute sustain
  3. Server offline: Use the built-in alert with 60-second sustain
  4. failed_units: 1 warning/critical, no sustain
  5. oom_kills: 1 warning, no sustain

This covers the most common failure modes: disk exhaustion, CPU overload, service failures, memory exhaustion, and complete outages.

Advanced Alerting Patterns

Composite Conditions

Consider alerting when multiple related metrics exceed thresholds simultaneously:

  • High cpu_percent AND high cpu_iowait = disk bottleneck
  • High mem_percent AND low mem_available_mb = genuine memory pressure
  • High load_15m AND normal cpu_percent = I/O or uninterruptible sleep issues

Rate-of-Change Alerts

Sometimes the rate of change matters more than absolute values:

  • disk_percent increasing by 10% in one hour
  • processes_total doubling in 30 minutes
  • Network error rates spiking above baseline

Time-Based Thresholds

Different thresholds for different times:

  • Stricter CPU limits during business hours
  • Relaxed thresholds during known maintenance windows
  • Higher disk usage tolerance during backup periods

Testing Your Alert Configuration

Before deploying alerts to production:

  1. Review historical data: Check how often your thresholds would have triggered over the past month
  2. Test with synthetic load: Use tools like stress to verify alerts fire as expected
  3. Start with warnings only: Deploy warning thresholds first, then add critical alerts once you're confident
  4. Document your decisions: Record why you chose specific thresholds for future reference

Next Steps

Once you've configured basic alerts, explore our detailed metric explanations:

Remember: the best alert system is one that tells you about problems before they impact users, without crying wolf so often that you ignore genuine issues. Start conservative, monitor your alert frequency, and adjust thresholds based on your actual operational experience.

Back to Complete Reference Index

Frequently Asked Questions

How do I set up metric alert thresholds in Server Scout?

Server Scout allows you to configure threshold-based alerts on any collected metric through the dashboard. Set a warning threshold and a critical threshold for each metric you want to monitor. When a metric crosses a threshold, the system sends notifications via your configured channels. Start with the recommended healthy ranges from the metrics reference and adjust based on your workload.

What are good starting thresholds for common metrics?

Recommended starting thresholds: CPU above 85% sustained (warning), memory above 85% (warning) and 95% (critical), disk above 80% (warning) and 90% (critical), load average above CPU core count (warning), swap usage above 500MB (warning), failed_units above 0 (critical), and oom_kills above 0 (critical). Adjust these based on your specific workload and response time requirements.

How do I avoid alert fatigue from too many notifications?

Set thresholds based on your server's actual baseline rather than generic values. Use the dashboard's historical data to understand normal ranges for your workload. Configure different thresholds for different server roles. Set alerts only on actionable metrics where you can take specific corrective action. Use warning thresholds for early notice and critical thresholds for immediate response.

Can I set alerts on cumulative counter metrics like network bytes?

Yes, the dashboard converts cumulative counters to per-second rates, and you can set thresholds on these rates. For example, alert when network throughput exceeds a certain MB/s or when disk I/O write rate exceeds normal levels. Rate-based alerting on counters is effective for detecting traffic spikes, DDoS attacks, or unusual disk activity.

What metrics should I alert on first?

Start with the most critical metrics: disk_percent (running out of disk space causes outages), mem_percent or mem_available_mb (memory exhaustion triggers OOM kills), failed_units (service failures), and oom_kills (severe memory issues). Once these baseline alerts are stable, add CPU, load average, and network thresholds. Prioritise alerts that indicate conditions requiring immediate human intervention.

Was this article helpful?