🔍

Debugging Memory Leaks in Production: When valgrind Isn't an Option and ps Shows Steady Growth

· Server Scout

Your PostgreSQL process has grown from 2GB to 8GB over the past fortnight, and you're not entirely sure why. The database is handling the same workload, query patterns haven't changed significantly, and yet ps aux shows steady memory growth that doesn't correlate with connection count or cache usage.

This is the classic production memory leak scenario. Unlike development environments where you can attach debuggers or run valgrind, production systems need different approaches.

Reading the Memory Maps

Start with /proc/[pid]/smaps rather than the basic memory counters everyone checks first. This gives you a breakdown of memory regions by type:

cat /proc/$(pidof postgres)/smaps | awk '/^Size:/ { sum += $2 } /^Rss:/ { rss_sum += $2 } END { print "Virtual:", sum, "KB", "Resident:", rss_sum, "KB" }'

More importantly, look for anonymous memory regions that are growing. These often indicate heap fragmentation or unreleased allocations:

grep -E '^[0-9a-f].*heap|^Size:|^Rss:' /proc/$(pidof postgres)/smaps

If you see the heap VMA (virtual memory area) expanding but RSS (resident set size) not shrinking during quiet periods, you've likely found your culprit.

Tracking Allocator Behaviour

Glibc's malloc implementation holds onto freed memory in hopes of reusing it. This creates false positives when hunting memory leaks. Check if your application links against an alternative allocator:

ldd $(which postgres) | grep -E 'jemalloc|tcmalloc'

For applications using glibc malloc, the MALLOCTRIMTHRESHOLD_ environment variable controls when memory gets returned to the system. Set it lower and restart (when possible) to see if apparent leaks are actually allocator behaviour.

The pmap Approach

While ps shows totals, pmap reveals memory layout changes over time:

pmap -x $(pidof postgres) | tail -1

Run this every few hours and compare the totals. More usefully, diff the full output between quiet periods. Growing anonymous regions suggest genuine leaks rather than legitimate cache growth.

The Linux kernel documentation explains these memory statistics in detail, particularly the difference between virtual and physical memory accounting.

Application-Specific Debugging

Most applications provide internal memory statistics. PostgreSQL has pgstatactivity and shared buffer metrics. Redis offers INFO memory. Apache shows pool allocation via mod_status.

Cross-reference these internal counters with system-level memory usage. If application memory stays constant but system memory grows, you're looking at a leak in shared libraries or kernel resources.

Long-Term Tracking

Memory leaks often correlate with specific operations or time patterns. Server Scout's memory monitoring tracks these trends over weeks rather than hours, making it easier to spot gradual leaks that daily checks miss.

The key is establishing baseline memory usage during known-quiet periods, then measuring deviation rather than absolute values.

If you're dealing with intermittent memory growth that's hard to catch manually, Server Scout's free trial includes historical memory tracking to help identify patterns across longer timeframes.

Ready to Try Server Scout?

Start monitoring your servers and infrastructure in under 60 seconds. Free for 3 months.

Start Free Trial