🕸️

Kubernetes Network Congestion Hidden Below the Dashboard: Why CNI Plugins Saturate Before Pod Metrics Show Problems

· Server Scout

Your Kubernetes cluster shows healthy pod metrics, normal CPU usage, and plenty of available memory. But applications are timing out, inter-pod communication is dropping packets, and users are complaining about intermittent connectivity issues. The problem isn't in your pods - it's in the network layer between them, where CNI plugins create bottlenecks that standard container monitoring never reveals.

The Hidden Layer: Why CNI Metrics Matter More Than Pod Metrics

Kubernetes dashboards focus on application-level metrics: request latency, error rates, and resource consumption. But between your pods and the physical network interface lies a complex stack of kernel networking components, iptables rules, and CNI plugin logic that processes every packet your containers send or receive.

When a pod in your cluster sends data to another pod, that traffic doesn't travel directly between containers. It flows through the CNI plugin's virtual interfaces, gets processed by netfilter rules, passes through kernel network queues, and potentially crosses multiple network namespaces before reaching its destination. Each step consumes CPU cycles and buffer space that standard monitoring tools never measure.

Flannel creates additional iptables rules for every pod subnet. Calico maintains extensive routing tables that kernel networking code must traverse for each packet. Weave builds encrypted tunnels that add crypto processing overhead. None of these costs appear in your container metrics, but they accumulate into system-level bottlenecks that cause packet drops and connection timeouts.

Decoding softnet_stat: The Kernel's Network Truth

The /proc/net/softnet_stat file exposes per-CPU network packet processing statistics that reveal CNI plugin performance impact. Each line represents one CPU core, and the columns show packet processing counters that directly correlate with network bottlenecks.

Reading the Critical Columns

Column 1 shows total packets processed by each CPU's network interrupt handler. Column 2 indicates dropped packets due to netdev_budget exhaustion - when the kernel can't process incoming packets fast enough and discards them. Column 9 tracks CPU squeeze events, where network processing consumed the entire CPU time slice allocated for interrupt handling.

In a healthy system, columns 2 and 9 should remain near zero. But CNI plugins that create complex iptables rules or virtual interface hierarchies can push packet processing times beyond kernel limits, causing drops that never appear in application monitoring.

Establishing Baseline Thresholds

Baseline your softnet_stat counters during normal traffic periods to establish what packet processing rates your CNI configuration can sustain. Track the ratio between column 1 (processed) and column 2 (dropped) across all CPU cores. A drop rate above 0.1% typically indicates approaching saturation, while squeeze events above 1% of total packets suggest CPU cores can't keep up with network interrupt processing.

The net.core.netdev_budget kernel parameter controls how many packets each CPU processes per interrupt cycle. CNI plugins with heavy iptables rules may require increasing this value from the default 300, but higher budgets increase interrupt latency for other system processes.

CNI Plugin Bottlenecks Standard Tools Never See

Different CNI plugins create distinct performance signatures in kernel network statistics. Understanding these patterns helps isolate whether network problems originate from CNI overhead or underlying infrastructure issues.

Flannel vs Calico Performance Signatures

Flannel's VXLAN overlay creates consistent packet encapsulation overhead that appears as elevated CPU usage in softnet_stat processing counters. Each pod-to-pod communication requires VXLAN header addition and removal, plus UDP socket processing for tunnel traffic. This overhead scales with pod density and inter-pod communication patterns.

Calico's BGP-based approach creates different bottlenecks. Instead of encapsulation overhead, Calico generates extensive iptables rules that kernel netfilter must evaluate for each packet. As pod count increases, rule evaluation time grows linearly, eventually causing netdev_budget exhaustion during traffic spikes.

Both plugins can saturate kernel network processing while container monitoring shows normal application performance, creating a gap between perceived and actual network health.

Building Detection Before Degradation

Effective CNI monitoring requires tracking system-level network statistics alongside application metrics. Kernel counters provide earlier warning of approaching bottlenecks than application timeout rates or error logs.

Automated softnet_stat Monitoring

Parse /proc/net/softnet_stat every 30 seconds and calculate drop rates, squeeze percentages, and per-CPU processing imbalances. Alert when drop rates exceed baseline thresholds or when individual CPU cores show significantly higher processing loads than others - indicating poor interrupt distribution across the system.

Correlate softnet_stat anomalies with pod deployment events, traffic pattern changes, and CNI configuration modifications. Network saturation often correlates with scaling events that increase iptables rule complexity or pod communication density beyond CNI plugin capacity.

The /proc/interrupts file shows network interrupt distribution across CPU cores, revealing whether CNI plugin overhead concentrates on specific processors. Unbalanced interrupt handling creates hotspots that cause packet drops while other cores remain underutilised.

System-level monitoring reveals the network bottlenecks that application-focused tools miss. By tracking kernel network statistics alongside pod metrics, administrators can detect CNI plugin saturation before it impacts application performance. The network layer between containers and infrastructure deserves the same monitoring attention as the applications it supports - building intrusion detection capabilities demonstrates how system-level monitoring catches problems that application metrics miss. Understanding these patterns prevents the cascade failures that occur when network bottlenecks compound across container infrastructure.

For teams managing container infrastructure at scale, Server Scout's network monitoring features provide the system-level visibility needed to detect CNI bottlenecks before they impact pod communication, with the lightweight agent overhead that production Kubernetes environments require.

FAQ

Why don't standard Kubernetes monitoring tools show CNI plugin bottlenecks?

Most monitoring focuses on application-level metrics within containers, missing the kernel network stack where CNI plugins operate. Packet processing overhead, iptables rule evaluation, and network interrupt handling occur below the container abstraction layer where standard tools collect data.

How do I determine which CNI plugin is causing network saturation?

Compare softnet_stat drop rates before and after pod deployments, correlate processing spikes with CNI configuration changes, and analyse /proc/interrupts to identify network interrupt patterns. Different CNI plugins create distinct signatures in kernel network statistics.

What's the performance impact of monitoring softnetstat frequently?

Reading /proc/net/softnetstat is a simple file system operation with minimal overhead - typically under 0.1ms per read. The file contains pre-calculated kernel counters, so monitoring every 30 seconds adds negligible system load compared to the network processing it helps optimise.

Ready to Try Server Scout?

Start monitoring your servers and infrastructure in under 60 seconds. Free for 3 months.

Start Free Trial