Setting Effective Alert Thresholds

Understanding Server Scout's Default Alert Conditions

When you add a new server to Server Scout, default alert conditions are automatically created to get you started:

  • CPU usage: Warning at 80%, Critical at 90%
  • Memory usage: Warning at 80%, Critical at 90%
  • Disk usage: Warning at 80%, Critical at 90%
  • Server offline detection: Immediate alerts when the agent stops reporting

These defaults provide a solid foundation, but every server environment is unique. Fine-tuning these thresholds based on your specific infrastructure will dramatically reduce false positives and ensure you're alerted to genuine issues.

Establish Baselines Before Adjusting

Before modifying any thresholds, observe your servers' normal operating patterns for at least a week. This baseline period reveals crucial insights about your infrastructure's behaviour.

A database server that consistently operates at 70% memory utilisation needs higher thresholds than the default 80% warning level. Conversely, a lightly-loaded web server that normally runs at 20% CPU might benefit from lower thresholds to catch unusual activity earlier.

Use Server Scout's historical charts to identify:

  • Peak usage periods (backup windows, batch processing times)
  • Normal operational ranges for each metric
  • Regular patterns that might trigger false alerts

Leverage Sustain Periods to Prevent False Alarms

Sustain periods are your first line of defence against alert fatigue. Setting a sustain period of 60-300 seconds ensures that brief, normal spikes don't trigger unnecessary notifications.

Consider these common scenarios:

  • Cron jobs: Scheduled tasks often cause temporary CPU or I/O spikes
  • Application deployments: Brief periods of high resource usage during updates
  • Backup operations: Temporary increases in disk I/O and CPU usage

A 5-minute sustain period for CPU alerts typically strikes the right balance between catching genuine issues and avoiding noise from routine operations.

Choose Severity Levels Strategically

Server Scout's two severity levels serve distinct purposes:

Warning alerts are for "investigate when convenient" situations. These might include:

  • Disk usage reaching 80% (you have time to clean up or expand storage)
  • CPU consistently above 85% (performance may be degraded but service continues)

Critical alerts demand immediate attention:

  • Disk usage at 90% (risk of service failure)
  • Memory usage at 95% (potential for application crashes)
  • Server offline (service unavailable)

Implement Per-Server Overrides for Special Cases

Global defaults work well for most servers, but some systems require special consideration. Server Scout's per-server condition overrides handle these exceptions elegantly.

Build servers that regularly hit 95% CPU during compilation need higher CPU thresholds than standard web servers. Database servers with large buffer pools may normally operate at 90% memory usage. File servers might need different disk usage thresholds for different mount points.

Create per-server conditions for these outliers whilst maintaining sensible global defaults for your standard infrastructure.

Configure Cooldown Periods Appropriately

Cooldown periods prevent notification spam for ongoing issues. Once an alert triggers, the cooldown period determines how long Server Scout waits before sending another notification for the same condition.

A 30-60 minute cooldown works well for most metrics. This gives you time to investigate and address the issue without being bombarded with repeated alerts, whilst ensuring you're reminded if the problem persists.

Recommended Threshold Guidelines

Based on common server behaviours, these thresholds work well for most environments:

Disk usage: Warning at 80%, Critical at 90% with 5-minute sustain

  • Disk space fills gradually, providing time for cleanup

CPU usage: Warning at 85%, Critical at 95% with 5-minute sustain

  • Accommodates normal spikes whilst catching sustained high usage

Memory usage: Warning at 85%, Critical at 95% with 2-minute sustain

  • Linux systems naturally use available RAM for caching, so high usage is often normal

Test Your Alert Configuration

Use Server Scout's test notification feature to verify your alerts reach the intended recipients. Test both email and webhook notifications to ensure your escalation procedures work correctly.

Regular testing ensures that when a genuine issue occurs, your team receives notifications promptly through the expected channels.

Well-configured alerts transform Server Scout from a monitoring tool into a proactive guardian of your infrastructure, providing early warning of issues whilst respecting your team's time and attention.

Frequently Asked Questions

What are ServerScout's default alert thresholds for new servers?

ServerScout automatically creates default alert conditions when you add a new server: CPU usage alerts at 80% warning and 90% critical, Memory usage at 80% warning and 90% critical, Disk usage at 80% warning and 90% critical, plus immediate alerts when the server goes offline.

How long should I observe servers before adjusting alert thresholds?

You should observe your servers' normal operating patterns for at least a week before modifying thresholds. This baseline period reveals crucial insights about peak usage periods, normal operational ranges, and regular patterns that might trigger false alerts.

What is a sustain period and how does it prevent false alarms?

A sustain period is the time a threshold must be exceeded before triggering an alert, preventing brief normal spikes from causing unnecessary notifications. Setting sustain periods of 60-300 seconds helps avoid alerts from routine operations like cron jobs, deployments, or backup operations.

When should I use warning vs critical alert levels?

Warning alerts are for 'investigate when convenient' situations like disk usage at 80% or CPU consistently above 85%. Critical alerts demand immediate attention for issues like disk usage at 90%, memory at 95%, or server offline conditions that risk service failure.

How do I set different thresholds for special server types?

Use ServerScout's per-server condition overrides for systems that need special consideration. Build servers that regularly hit 95% CPU during compilation need higher CPU thresholds, while database servers with large buffer pools may normally operate at 90% memory usage.

What are the recommended alert threshold settings for most servers?

For most environments: Disk usage at 80% warning/90% critical with 5-minute sustain, CPU usage at 85% warning/95% critical with 5-minute sustain, and Memory usage at 85% warning/95% critical with 2-minute sustain. These accommodate normal variations while catching genuine issues.

How do cooldown periods work in ServerScout alerts?

Cooldown periods prevent notification spam for ongoing issues by determining how long ServerScout waits before sending another notification for the same condition. A 30-60 minute cooldown works well, giving you time to investigate without being bombarded with repeated alerts.

Was this article helpful?