Monitoring OOM Kills and Systemd Failures

Out-of-Memory (OOM) kills and systemd service failures are critical events that can significantly impact server stability and application availability. Unlike typical resource monitoring that tracks CPU, memory, and disk usage percentages, these events often occur between monitoring intervals and can be missed by traditional metrics. Server Scout provides specific monitoring capabilities for both scenarios to ensure you're alerted when these critical system events occur.

Understanding OOM Kills

An OOM kill occurs when the Linux kernel runs out of available memory and forcibly terminates one or more processes to free up system resources. This is the kernel's last resort to prevent complete system failure. The challenge with OOM kills is that they often happen quickly—memory usage spikes, the kernel kills processes, and by the next monitoring check, memory usage appears normal again.

Server Scout addresses this through the oomkillsdelta metric, which monitors the kernel's OOM kill counter from /proc/vmstat. This metric tracks increases in the counter between monitoring intervals, ensuring no OOM event goes unnoticed.

Setting Up OOM Kill Monitoring

The oomkillsdelta metric comes with a sensible default alert configuration that triggers whenever the delta value exceeds zero—meaning any OOM kill has occurred since the last check.

To configure OOM kill alerts:

  1. Navigate to your server's alert configuration in Server Scout
  2. Locate the "OOM Kills Delta" metric
  3. Set the threshold to 0 (alerts on any increase)
  4. Configure your notification preferences
# You can verify OOM kills manually on your server with:
grep "killed process" /var/log/kern.log

When an OOM kill occurs, you'll receive an immediate notification. These events are also logged in Server Scout's notification history, allowing you to correlate OOM kills with other system events and identify patterns.

Monitoring Systemd Service Failures

Systemd manages most services on modern Linux distributions, and service failures can occur without immediately affecting overall system performance. A failed database connection service, backup job, or monitoring agent might not show up in CPU or memory graphs but could indicate serious problems.

The systemd_failed metric counts the number of systemd units currently in a failed state. This provides visibility into service health beyond resource consumption.

Configuring Systemd Failure Alerts

Unlike OOM kills where any occurrence warrants attention, systemd failures require more nuanced alerting. Some systems might normally have one or two non-critical failed units, whilst others should have zero failures.

To set up systemd failure monitoring:

  1. First, check your baseline failed unit count:
systemctl list-units --failed --no-pager
  1. In Server Scout, configure the systemd_failed metric threshold based on your baseline
  2. For most production servers, alerting when more than 2 units are failed provides good coverage without excessive noise
  3. For critical systems, consider setting the threshold to 0
# Monitor failed units manually:
systemctl --failed

Why These Metrics Matter

Traditional monitoring focuses on resource utilisation percentages, but OOM kills and service failures represent discrete events that can have outsized impacts:

  • OOM kills might completely resolve memory pressure before your next monitoring sample, making the event invisible in memory usage graphs
  • Failed services might consume minimal resources while being completely non-functional
  • Both events can indicate underlying issues that resource monitoring alone cannot detect

Practical Implementation Advice

For comprehensive server monitoring, implement these alerts alongside your standard resource monitoring:

  1. Layer your alerting: Use resource threshold alerts for gradual issues and event-based alerts for discrete problems
  2. Tune thresholds appropriately: Start with conservative settings and adjust based on your environment's normal behaviour
  3. Review alert history: Regular analysis of OOM kills and service failures can reveal trends indicating capacity planning needs or recurring issues
  4. Correlate events: When investigating performance issues, check for recent OOM kills or service failures that might be contributing factors

Viewing Historical Data

Server Scout maintains a complete history of these events in the notification log. This historical view allows you to:

  • Identify patterns in OOM kills that might indicate memory leaks or capacity issues
  • Track service reliability over time
  • Correlate system events with performance degradation

By monitoring both OOM kills and systemd failures alongside traditional resource metrics, you'll have comprehensive visibility into your server's health and be alerted to critical events that might otherwise go unnoticed until they cause user-visible problems.

Frequently Asked Questions

How do I set up OOM kill monitoring in ServerScout

Navigate to your server's alert configuration, locate the 'OOM Kills Delta' metric, set the threshold to 0 to alert on any OOM kill, and configure your notification preferences. The oom_kills_delta metric automatically tracks increases in the kernel's OOM kill counter from /proc/vmstat between monitoring intervals.

What is an OOM kill and why should I monitor it

An OOM kill occurs when the Linux kernel runs out of memory and forcibly terminates processes to prevent system failure. These events often happen quickly between monitoring intervals, making memory usage appear normal again by the next check. Monitoring OOM kills ensures you're alerted to critical memory events that traditional percentage-based monitoring might miss.

How do I configure systemd failure alerts properly

First check your baseline failed unit count using 'systemctl list-units --failed --no-pager', then configure the systemd_failed metric threshold based on your baseline. For most production servers, alerting when more than 2 units are failed provides good coverage, while critical systems should consider a threshold of 0.

Why don't regular monitoring alerts catch these issues

Traditional monitoring focuses on resource utilization percentages sampled at intervals. OOM kills might completely resolve memory pressure before the next sample, making them invisible in memory graphs. Failed services might consume minimal resources while being completely non-functional, so they don't show up in CPU or memory metrics.

What should I do when I get an OOM kill alert

Check the kernel logs using 'grep "killed process" /var/log/kern.log' to identify which processes were terminated. Review Server Scout's notification history to correlate the OOM kill with other system events and identify patterns. This can help determine if you need more memory, have a memory leak, or need to optimize resource usage.

Can I see historical data for OOM kills and systemd failures

Yes, Server Scout maintains a complete history of these events in the notification log. This allows you to identify patterns in OOM kills that might indicate memory leaks or capacity issues, track service reliability over time, and correlate system events with performance degradation for better troubleshooting.

What threshold should I set for systemd failure alerts

The threshold depends on your environment's normal behavior. Some systems might normally have 1-2 non-critical failed units, while others should have zero failures. Check your baseline with 'systemctl --failed' first, then set alerts accordingly. Most production servers work well with a threshold of 2, while critical systems should consider 0.

Was this article helpful?