🔍

Container Memory Pressure Hidden in cgroups: The Pod Restart Investigation Standard Tools Couldn't Solve

· Server Scout

Last month, a mid-sized hosting company running 200+ Kubernetes pods faced an impossible problem. Their containers kept restarting every 2-3 hours, but kubectl logs showed nothing. Health checks passed. Application metrics looked normal. Even the pod events showed clean restarts with exit code 0.

The restarts weren't random. They followed a pattern tied to memory allocation, but memory usage graphs showed pods consuming only 60-70% of their limits. Traditional monitoring missed the real culprit: subtle memory pressure signals buried in cgroups v2 that triggered container runtime restarts before any OOMKilled events appeared.

The Mystery: Healthy Pods That Keep Dying

Their investigation started with the usual suspects. kubectl describe pods showed normal resource allocation. Application logs contained no errors. Pod metrics in Prometheus indicated healthy memory and CPU usage well within limits.

But the restart pattern was too consistent to ignore. Pods handling customer data processing would restart precisely when memory allocation reached certain thresholds - not the configured limits, but something lower and invisible to standard tooling.

The breakthrough came from examining cgroups v2 directly on the worker nodes.

Initial Investigation - When Standard Tools Fail

Traditional Kubernetes troubleshooting relies on kubectl logs, pod events, and metrics scraped from the kubelet. All of these sources showed clean container behaviour. No OOM events, no resource limit violations, no application errors.

This is where most teams give up or blame "Kubernetes being weird." But the hosting company's sysadmin noticed something crucial: the restarts correlated with memory allocation spikes, not memory usage spikes.

Discovering the cgroups Memory Pressure Signal

While kubectl and container metrics show memory usage, they don't expose memory pressure indicators that cgroups v2 tracks internally. The memory.pressure file contains PSI (Pressure Stall Information) metrics that reveal when containers experience memory allocation stress even when total usage remains low.

Here's what they found in /sys/fs/cgroup/memory.pressure for affected pods:

some avg10=12.34 avg60=8.91 avg300=3.45 total=892847
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

The some values indicated memory pressure events where the container runtime had to wait for memory allocation, even though total memory consumption stayed within limits. This pressure triggered container restarts as a protective measure.

Reading cgroups v2 Memory Statistics

The key insight was understanding that memory.current vs memory.high thresholds create a grey zone where containers experience allocation pressure without hitting hard limits. When applications request memory faster than the kernel can allocate it cleanly, pressure builds up.

This pressure doesn't show up in application logs because the allocation eventually succeeds. It doesn't trigger OOM events because total usage never exceeds limits. But the container runtime sees the pressure signals and preemptively restarts pods to prevent performance degradation.

The Root Cause: Subtle Memory Pressure Below OOM Thresholds

The hosting company discovered their pods were hitting memory allocation bottlenecks during garbage collection cycles. Java applications would request large contiguous memory blocks for heap cleanup, creating temporary pressure spikes that lasted milliseconds but triggered PSI metrics.

Kubernetes health checks couldn't detect this because they ran between GC cycles when memory pressure had already subsided. Application logging missed it because the memory allocations ultimately succeeded - just not smoothly.

Why Application Logs Miss This Condition

Application-level monitoring focuses on successful operations and obvious failures. Memory allocation pressure exists in the kernel space between request and fulfillment. By the time an application would log memory issues, either the allocation has succeeded (no log entry) or failed completely (OOM event).

The cgroups PSI metrics capture this intermediate state where allocation stress occurs without complete failure.

Building Container Health Forensics

Their solution involved monitoring cgroups v2 directly from the worker nodes, tracking memory pressure metrics alongside traditional usage metrics. Server Scout's monitoring approach proved ideal for this kind of system-level investigation that goes beyond standard container metrics.

The key was correlating PSI pressure spikes with pod restart times, revealing the invisible connection between kernel-level memory stress and container lifecycle events.

Integration with Kubernetes Events

By combining cgroups pressure data with Kubernetes event streams, they built a complete picture of why containers restart. This forensic approach revealed that 80% of "mysterious" restarts correlated with memory.pressure spikes above baseline thresholds.

Prevention and Long-term Monitoring

Once they understood the cgroups memory pressure signals, the hosting company adjusted their pod memory configurations and implemented unified infrastructure monitoring that tracked both traditional metrics and PSI indicators.

They also refined their customer resource isolation strategies to prevent memory pressure cascades between neighbouring pods.

The solution wasn't increasing memory limits - it was smoothing memory allocation patterns and monitoring the pressure signals that predict restarts before they happen. Understanding cgroups v2 memory management turned an impossible debugging problem into a predictable, manageable monitoring challenge.

Modern container orchestration creates new categories of system behaviour that traditional monitoring doesn't capture. The most interesting problems often live in these gaps between application awareness and kernel reality, waiting for someone curious enough to read the /proc and /sys filesystems directly.

FAQ

Why don't standard Kubernetes monitoring tools show memory pressure?

Most tools focus on memory usage metrics from the kubelet API, which doesn't expose cgroups v2 PSI (Pressure Stall Information) data. Memory pressure exists at the kernel level between allocation request and fulfillment, invisible to application-level monitoring.

How can you tell if pod restarts are caused by memory pressure vs other issues?

Check the /sys/fs/cgroup/memory.pressure file on worker nodes during restart events. If the "some" PSI values show spikes correlating with restart times, memory allocation pressure is likely the cause rather than resource exhaustion or application errors.

Do memory pressure restarts show up as OOMKilled events in Kubernetes?

No, memory pressure restarts happen before OOM conditions occur. The container runtime restarts pods when it detects allocation stress, preventing memory exhaustion. These appear as clean exits (code 0) rather than OOMKilled events in pod status.

Ready to Try Server Scout?

Start monitoring your servers and infrastructure in under 60 seconds. Free for 3 months.

Start Free Trial