CPU Metrics Explained

Understanding CPU Metrics in Linux Server Monitoring

CPU metrics are fundamental to understanding server performance, yet they're often misinterpreted. Server Scout collects 11 comprehensive CPU metrics every 5 seconds, providing detailed insight into how your processor time is being consumed. This article explains each metric and how to use them for effective performance analysis.

How Server Scout Collects CPU Data

Server Scout's agent reads CPU statistics from /proc/stat, which provides raw counters (called "jiffies") for different CPU time categories. Every 5 seconds, the agent calculates the delta between consecutive readings and converts these into percentages, giving you real-time insight into CPU utilisation patterns.

This fast-tier collection (every 5 seconds) ensures you capture short-lived CPU spikes that slower monitoring intervals might miss entirely.

The CPU Time Breakdown

Linux divides CPU time into several categories. Understanding these categories is crucial for diagnosing performance issues effectively.

MetricDescriptionTypical RangeWhat High Values Indicate
cpu_percentOverall CPU utilisation (100% minus idle time)Varies by workloadSustained >85% warrants investigation
cpu_userTime spent executing user-space application code20-60% under loadApplications are compute-bound
cpu_systemTime spent in kernel space (system calls, drivers)5-20% typicallyI/O bottlenecks or excessive system calls
cpu_iowaitTime CPU was idle but waiting for I/O operations<5% ideallyDisk or network I/O bottlenecks
cpu_stealTime stolen by hypervisor for other VMs<2% on good hostsOvercommitted virtualisation host
cpu_niceTime running low-priority (niced) processesUsually 0%Background jobs or batch processing
cpu_irqTime handling hardware interrupts<2% normallyBusy network cards or storage controllers
cpu_softirqTime handling software interrupts<5% normallyHigh network throughput or timer activity

Core CPU Metrics Explained

cpu_percent - Overall Utilisation

This is your primary CPU health indicator, calculated as 100% minus idle time. It represents the total percentage of time your CPU cores are actively working on something.

Normal behaviour: Highly variable depending on workload. A web server might average 20-40% with spikes to 80%, while a batch processing server might sustain 90%+ during jobs.

Investigation threshold: Sustained utilisation above 85% typically indicates either insufficient CPU capacity or inefficient application behaviour.

cpu_user - Application Processing Time

User CPU time represents the percentage spent executing your applications' code - web servers, databases, calculation tasks, and user programs.

High user CPU typically indicates:

  • Applications performing intensive calculations
  • Efficient, compute-bound workloads
  • Well-optimised applications that don't require excessive system calls

This is generally "good" CPU usage - your applications are doing productive work.

cpu_system - Kernel Processing Time

System CPU time represents the percentage spent in kernel space, handling system calls, device drivers, and core operating system functions.

Sustained system CPU above 20% may indicate:

  • Inefficient I/O patterns (many small reads/writes instead of larger operations)
  • Applications making excessive system calls
  • Driver issues or hardware problems
  • Memory pressure causing frequent page management

cpu_iowait - The Most Misunderstood Metric

Critical misconception: Many administrators think iowait represents "CPU time spent waiting for I/O." This is incorrect.

Reality: iowait is CPU time that was idle (not busy) but during which the system had outstanding I/O operations. It's a subset of idle time, not busy time.

Think of it this way: if your CPU has nothing to do but your application is waiting for disk reads, that idle time gets classified as iowait instead of regular idle.

High iowait (>10% sustained) indicates:

  • Disk I/O bottlenecks (slow storage, insufficient IOPS)
  • Network I/O bottlenecks
  • Applications blocked on file operations
  • Storage subsystem cannot keep up with demand

Key insight: You can have 50% iowait and still be I/O bound rather than CPU bound. The solution is faster storage, not more CPU cores.

cpu_steal - Virtualisation Overhead

CPU steal only applies to virtualised environments (cloud VMs, VPS hosting). It represents CPU time that your VM should have received but was "stolen" by the hypervisor to service other VMs on the same physical host.

Sustained steal >5% indicates:

  • The physical host is overcommitted
  • "Noisy neighbour" VMs consuming excessive resources
  • Your VM is being starved of CPU cycles

Important: This isn't something you can fix from within your VM. It requires action from your hosting provider (migrating to a less contended host) or upgrading to a dedicated/less shared hosting tier.

cpu_nice - Low Priority Processing

Nice CPU represents time spent running processes with adjusted priority (using the nice command). These processes voluntarily yield CPU time to higher-priority tasks.

Common sources:

  • Backup compression jobs
  • Batch processing tasks
  • Background maintenance scripts
  • Processes started with nice -n 10 command

High nice CPU is generally benign - these are background tasks designed not to interfere with interactive performance.

cpu_irq - Hardware Interrupt Handling

Hardware interrupts occur when devices need immediate CPU attention - network cards receiving packets, storage controllers completing operations, timers firing.

Typically <2% on most systems. Higher values may indicate:

  • Very busy network interfaces
  • High-throughput storage operations
  • Hardware issues causing excessive interrupts
  • Poorly configured interrupt balancing

cpu_softirq - Software Interrupt Processing

Software interrupts handle deferred work from hardware interrupts, particularly network packet processing and timer management.

Typically <5% on most systems. Spikes often correlate with:

  • High network throughput (packet processing)
  • Many concurrent connections
  • Timer-heavy applications
  • Network-intensive workloads

Additional CPU Context Metrics

MetricCollection FrequencyPurpose
cpu_tempEvery 5 secondsProcessor temperature in Celsius
cpu_coresDailyNumber of logical CPU cores
cpu_modelDailyProcessor model identification

cpu_temp - Temperature Monitoring

Processor temperature monitoring helps identify thermal throttling issues. Values above 80°C sustained may indicate:

  • Inadequate cooling
  • Dust accumulation in heat sinks
  • Failing fans
  • Thermal throttling reducing performance

Note: Many cloud VMs don't expose temperature sensors, so this value may be null in virtualised environments.

Reading CPU Charts and Diagnostic Patterns

Server Scout's dashboard displays CPU metrics as stacked area charts, where all the components sum to approximately 100% of total CPU time. Here are common patterns and their interpretations:

High User + Low System = Compute-Bound Applications

  • Pattern: 60-80% user, 5-15% system, low iowait
  • Interpretation: Applications efficiently using CPU for calculations
  • Action: Scale horizontally or upgrade to faster processors

High System + High iowait = I/O-Bound Workload

  • Pattern: 20-40% system, 10-30% iowait, moderate user
  • Interpretation: Applications bottlenecked by storage performance
  • Action: Optimise I/O patterns, upgrade storage, increase buffer sizes

High Steal = Overcommitted Host

  • Pattern: Variable steal >5%, inconsistent performance
  • Interpretation: Virtualisation host cannot provide consistent resources
  • Action: Contact hosting provider or migrate to dedicated resources

High Softirq = Network-Heavy Load

  • Pattern: 5-15% softirq, correlates with network throughput spikes
  • Interpretation: High network packet processing overhead
  • Action: Optimise network configuration, consider interrupt balancing

Monitoring Best Practices

  1. Monitor trends, not snapshots: CPU metrics naturally fluctuate. Focus on sustained patterns rather than momentary spikes.
  1. Context matters: A 90% CPU spike during a scheduled backup is normal; the same spike during low traffic hours warrants investigation.
  1. Correlate with other metrics: High CPU often correlates with increased memory usage, network activity, or disk I/O. Server Scout's dashboard helps identify these relationships.
  1. Set appropriate thresholds: Generic alerts like "CPU >80%" often generate false positives. Establish baselines specific to your workload patterns.
  1. Consider load averages: CPU percentage shows current utilisation; load averages (collected every 5 minutes by Server Scout) show sustained demand over time.

Common Troubleshooting Scenarios

Scenario 1: High overall CPU but low user time

  • Likely cause: I/O bottleneck causing high system and iowait
  • Investigation: Check disk I/O metrics and storage performance

Scenario 2: Intermittent performance issues with CPU steal

  • Likely cause: Noisy neighbour or overcommitted virtualisation
  • Investigation: Monitor steal patterns; contact hosting provider if sustained

Scenario 3: High CPU with normal application load

  • Likely cause: Inefficient code, memory pressure, or resource contention
  • Investigation: Profile applications, check memory metrics, review recent changes

Understanding these CPU metrics enables proactive performance management and faster problem resolution. Server Scout's 5-second collection interval ensures you capture the full picture of your server's CPU behaviour, from brief spikes to sustained utilisation patterns.

Back to Complete Reference Index

Frequently Asked Questions

What is a healthy CPU usage percentage for a Linux server?

A healthy CPU usage depends on the server role. Sustained cpu_percent above 85% warrants investigation for most workloads. Brief spikes to 100% during peak processing are normal. The key is whether the server can still respond to requests within acceptable timeframes. Monitor trends rather than reacting to momentary spikes.

What does high CPU iowait mean?

High cpu_iowait indicates the CPU had idle time but was waiting for I/O operations to complete. This is a common misconception: iowait is a subset of idle time, not busy time. Sustained iowait above 10% typically points to disk or network I/O bottlenecks rather than a CPU problem. Investigate storage performance or consider faster disks.

What is CPU steal time and why does it matter?

CPU steal (cpu_steal) measures time the hypervisor took from your virtual machine to serve other VMs on the same physical host. Sustained steal above 5% means your host is overcommitted and your VM is being resource-starved. This is not fixable from inside the VM. Contact your hosting provider or migrate to a less contended host.

How do CPU breakdown percentages relate to each other?

The CPU breakdown percentages (user, system, iowait, steal, nice, irq, softirq, and idle) sum to approximately 100%. A stacked CPU chart shows how total CPU time is divided. High user + low system means applications are compute-bound. High system + high iowait indicates I/O-bound workloads. High softirq points to network-heavy processing.

Why is cpu_temp showing as null on my server?

CPU temperature (cpu_temp) is read from /sys/class/thermal or /sys/class/hwmon. Virtual machines typically lack virtual thermal sensors, so cpu_temp will be null on most cloud VMs and VPS instances. This is normal and expected. Temperature monitoring is most relevant on physical (bare-metal) servers where cooling issues can cause throttling.

Was this article helpful?