🐋

How cgroups v2 Memory Accounting Broke Our Container Monitoring

· Server Scout

TechStack Hosting runs 400 Docker containers across 50 Ubuntu 22.04 hosts. Their monitoring showed container memory usage hovering around 60-70% of allocated limits. Yet every few days, the OOM killer would terminate containers seemingly at random.

The pattern made no sense. Containers with 2GB limits showing 1.2GB usage would die whilst containers at 90% usage stayed healthy. The monitoring dashboard painted a picture of stability right up until processes started disappearing.

The Memory Reporting Mismatch

The problem surfaced during a routine update cycle. TechStack had migrated from Ubuntu 20.04 (systemd 245, cgroups v1) to Ubuntu 22.04 (systemd 249, cgroups v2 unified hierarchy). Their monitoring agents continued reading Docker's /sys/fs/cgroup/memory/docker/[containerid]/memory.usagein_bytes file, but this file no longer existed.

Docker automatically fell back to estimating memory usage from the new cgroups v2 structure. The problem? cgroups v2 fundamentally changed how memory gets accounted.

Container Metrics vs Kernel Reality

Under cgroups v1, memory.usageinbytes included all memory charged to the cgroup: anonymous pages, page cache, kernel memory, and swap. The kernel's OOM killer used the same accounting.

With cgroups v2, memory.current reports only the memory that counts toward the limit. Page cache that can be reclaimed under pressure doesn't appear in this figure, even though it consumes physical RAM.

The containers weren't misbehaving. The monitoring was blind to 30-40% of their actual memory footprint.

Understanding cgroups v2 Memory Accounting

TechStack's containers were running database replicas and file processing jobs. These workloads generate substantial page cache activity that doesn't appear in memory.current readings.

A typical container showed:

  • memory.current: 1.2GB (what monitoring reported)
  • memory.max: 2.0GB (the configured limit)
  • Actual kernel memory pressure: 1.8GB+ (including reclaimable cache)

memory.current vs memory.usageinbytes

The difference isn't academic. Under memory pressure, the kernel attempts to reclaim page cache before invoking the OOM killer. If that reclamation fails or takes too long, processes die despite appearing to have headroom.

Page Cache Attribution Changes

In cgroups v2, page cache gets more nuanced treatment. Cache that's "easily reclaimable" doesn't count toward limits until the kernel actually needs to reclaim it. This creates a reporting gap between what monitoring sees and what triggers OOM conditions.

The Docker Memory Limits vs Host Reporting: Why cgroups v2 Changes Everything article covers the technical details, but the practical impact is clear: traditional container memory monitoring becomes misleading.

Debugging the Discrepancy

TechStack's breakthrough came from comparing multiple data sources simultaneously. Server Scout's monitoring agent was already collecting host-level memory metrics from /proc/meminfo, which showed the full picture.

Comparing /sys/fs/cgroup Memory Files

Running find /sys/fs/cgroup -name "memory.*" -exec grep -H . {} \; on affected containers revealed the gap:

/sys/fs/cgroup/system.slice/docker-abc123.scope/memory.current:1258291200
/sys/fs/cgroup/system.slice/docker-abc123.scope/memory.max:2147483648
/sys/fs/cgroup/system.slice/docker-abc123.scope/memory.events:oom_kill 3

The memory.events file showed OOM kills despite apparent headroom in memory.current.

Using systemd-cgtop for Real-Time Validation

The systemd-cgtop tool displays live cgroups v2 memory consumption, including non-counted cache. Comparing its output against Docker stats revealed consistent 25-35% underreporting across database containers.

Referencing the Linux kernel cgroup documentation confirmed the behavioural changes. The kernel's memory accounting philosophy shifted from "charge everything" to "charge what matters for limits".

Production Fix and Monitoring Adjustments

Updating Alert Thresholds

TechStack lowered their container memory alert thresholds from 85% to 60% of allocated limits. This accounts for the invisible page cache that could trigger OOM conditions.

For database containers specifically, they implemented custom memory tracking through /proc/[pid]/status files, parsing VmRSS values directly rather than relying on cgroup reporting.

Adding Kernel Memory Metrics

The solution involved monitoring multiple metrics simultaneously:

  • Container memory.current for application memory
  • Host /proc/meminfo for overall memory pressure
  • Per-container memory.events for OOM kill tracking

Server Scout's lightweight monitoring approach made it practical to collect all three metric types without adding overhead. The bash agent reads cgroups v2 files directly, providing accurate resource reporting regardless of Docker's interpretation.

Six months later, TechStack hasn't experienced an unexpected OOM kill. Their monitoring now reflects kernel reality rather than container abstractions. The lesson: cgroups v2 requires monitoring tools that understand its fundamental changes to memory accounting.

Modern container environments need monitoring that bridges the gap between application metrics and kernel behaviour. Tools that assume cgroups v1 semantics will consistently underreport memory usage, creating blind spots exactly where you need visibility most.

FAQ

How do I check if my system is using cgroups v2?

Run mount | grep cgroup - if you see a single cgroup2 mount at /sys/fs/cgroup, you're using cgroups v2. Systems with both cgroup and cgroup2 mounts are running in hybrid mode.

Can I revert to cgroups v1 to fix monitoring issues?

Yes, add systemd.unifiedcgrouphierarchy=false to your kernel boot parameters. However, updating your monitoring to handle cgroups v2 properly is the better long-term solution as cgroups v1 is deprecated.

Why doesn't Docker stats show the same memory usage as system monitoring tools?

Docker stats reads from cgroups v2's memory.current file, which excludes reclaimable page cache. System tools like htop read from /proc files that include all memory types, creating apparent discrepancies.

Ready to Try Server Scout?

Start monitoring your servers and infrastructure in under 60 seconds. Free for 3 months.

Start Free Trial