OOM Killer Early Warning Signs Hidden in /proc

Your production database server just killed its own MySQL process. The OOM killer struck at 2:47 AM, taking down your primary instance and triggering a cascade of connection failures. The logs show a simple message: "Out of memory: Kill process 15234 (mysqld) score 856 or sacrifice child".

What the logs don't show is that this kill was preventable. The OOM killer doesn't appear from nowhere - it leaves breadcrumbs scattered throughout /proc that most monitoring tools never check.

The False Security of Available Memory

Most sysadmins watch free -h and consider anything above 20% available memory as safe territory. But the OOM killer doesn't just look at available memory. It considers memory pressure, which includes how hard the system is working to reclaim pages.

The kernel maintains pressure stall information in /proc/pressure/memory. This file shows what percentage of time your system spends reclaiming memory:

cat /proc/pressure/memory
some avg10=2.50 avg60=1.80 avg300=0.95 total=145234
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

When the some values consistently stay above 10%, you're in dangerous territory. When full shows any non-zero values, processes are already stalled waiting for memory reclaim.

Server Scout's memory monitoring tracks these pressure metrics alongside traditional memory usage, giving you the complete picture of system health.

Swap Activity Patterns That Predict Kills

Swap usage itself isn't the problem - swap activity patterns are. A system with 2GB of swap that's been stable for weeks suddenly showing frequent page-outs is screaming for attention.

Check /proc/vmstat for these key counters:

grep -E "(pswpin|pswpout|pgmajfault)" /proc/vmstat

Rapidly increasing pswpout values indicate the system is pushing pages to swap faster than normal. More importantly, rising pgmajfault counts mean processes are page-faulting on swapped-out pages - a sure sign that working sets no longer fit in RAM.

Understanding memory accounting beyond basic tools helps you interpret these patterns correctly.

The Allocation Failure Cascade

Before the OOM killer activates, the kernel logs allocation failures in /var/log/kern.log. These warnings appear long before any process gets killed:

kern.log: page allocation failure: order:3, mode:0x10c0d0

These failures indicate the kernel can't find contiguous memory blocks, even when plenty of memory appears available. It's often the last warning you'll get before OOM conditions develop.

Memory Fragmentation Signals

Check /proc/buddyinfo to see memory fragmentation levels:

cat /proc/buddyinfo
Node 0, zone DMA     1  1  1  0  2  1  1  0  1  1  3
Node 0, zone Normal  48 18  8  2  1  0  0  0  0  0  0

When higher-order pages (the rightmost columns) show mostly zeros, your system has plenty of total memory but it's fragmented. Large allocations will fail, triggering OOM conditions even with gigabytes technically free.

The Linux kernel documentation explains memory management concepts in detail.

Building Effective Early Warning

Combine these signals into a composite memory pressure score:

PSI memory pressure above 5% (warning) or 10% (critical)
Major page faults increasing by more than 20% per hour
Buddy allocator showing fragmentation in order-3+ pages
Any allocation failure messages in kernel logs

Monitoring tools that track these patterns help you catch problems before they become outages.

The OOM killer's job is to save your system when memory runs out. Your job is ensuring it never needs to step in. These /proc indicators give you the visibility to stay ahead of memory pressure and keep your services running smoothly.

Server Scout monitors these memory pressure indicators alongside traditional metrics, giving you the early warning system you need to prevent OOM kills before they happen.

Reading the OOM Killer's Tea Leaves: Early Warning Signs Hidden in /proc

The False Security of Available Memory

Swap Activity Patterns That Predict Kills

The Allocation Failure Cascade

Memory Fragmentation Signals

Building Effective Early Warning

Ready to Try Server Scout?