Process and System Metrics Explained

Understanding your server's process and system metrics is crucial for maintaining optimal performance and diagnosing issues before they become critical. Server Scout's agent collects several key indicators that reveal how your system is managing processes, context switching, and file descriptors. These metrics work together to paint a comprehensive picture of your server's operational health.

Core Process State Metrics

Linux processes exist in various states at any given moment, and tracking these states helps identify performance bottlenecks and resource contention issues.

Running Processes

The processes_running metric shows the number of processes currently executing on a CPU core. This value comes from the procs_running field in /proc/stat and represents processes in the "R" state that are either actively using CPU time or waiting in the run queue for their turn.

On a lightly loaded server, you'll typically see 1-4 running processes. This baseline includes essential system processes and any active workloads. However, when processes_running consistently exceeds your CPU core count by a factor of two or more, it indicates CPU contention. Your processes are competing for processing time, and some are waiting longer than optimal in the run queue.

For example, on a 4-core server, sustained values above 8 running processes suggest your CPU is becoming a bottleneck. Users may experience slower response times, and batch jobs will take longer to complete.

Blocked Processes

The processes_blocked metric tracks processes waiting for I/O operations to complete. These processes are in an uninterruptible sleep state, typically waiting for disk reads, network responses, or other I/O operations. The metric comes from procs_blocked in /proc/stat.

Under normal circumstances, you should see 0-2 blocked processes. Brief spikes are entirely normal as processes perform routine I/O operations. However, sustained high values indicate I/O bottlenecks somewhere in your system.

Common causes of elevated blocked processes include:

  • Slow disk subsystems struggling with heavy read/write operations
  • Network file systems (NFS, CIFS) experiencing latency or connectivity issues
  • Database servers waiting for disk-bound queries
  • Backup operations saturating storage bandwidth

When investigating high blocked process counts, examine your disk I/O metrics (disk_io_read_bytes, disk_io_write_bytes) and network activity to identify the bottleneck source.

Zombie Processes

Zombies, tracked by processes_zombie, are processes that have completed execution but remain in the process table because their parent process hasn't collected their exit status. While zombies don't consume CPU time or memory, they do occupy process ID slots and can indicate application bugs.

A healthy system should show 0 zombie processes most of the time. During normal operation, you might see transient zombies that appear and disappear quickly as processes are created and destroyed. This is perfectly normal.

However, persistent or growing zombie counts indicate a problem with parent processes that aren't properly calling wait() or similar system calls to clean up their children. This typically points to:

  • Poorly written applications with improper child process handling
  • Parent processes that have crashed or become unresponsive
  • Signal handling issues in daemon processes

To investigate zombie accumulation, use ps aux | grep Z to identify zombie processes and their parent PIDs, then examine why the parent isn't performing proper cleanup.

Total Process Count

The processes_total metric provides important context by showing the overall number of processes on your system. This value comes from /proc/loadavg and represents all processes regardless of their current state.

Process counts vary dramatically based on your server's role and applications. A typical baseline might be:

  • Minimal Linux server: 50-100 processes
  • Web server with PHP-FPM: 200-500 processes
  • Database server: 100-200 processes
  • Containerised environment: highly variable

Understanding your baseline total is crucial for interpreting other process metrics. A server normally running 500 processes will have different expectations for running and blocked processes compared to one running 50.

System Activity Indicators

Beyond process states, Server Scout monitors system-level activity that reflects how efficiently your kernel is managing resources.

Context Switching

The context_switches metric measures how frequently your CPU switches between processes or threads. This cumulative counter from /proc/stat (the ctxt field) appears as a rate in Server Scout's dashboard, showing switches per second.

Context switching is a fundamental part of multitasking systems, but excessive switching can impact performance. Normal rates vary enormously based on workload characteristics:

Workload TypeTypical Context Switch Rate
Low-activity server1,000-5,000/sec
Web server10,000-50,000/sec
Database server5,000-25,000/sec
High-concurrency applications50,000-100,000+/sec

More important than absolute values are sudden changes in context switch rates. A dramatic increase often indicates:

  • Runaway process creation (fork bombs or application bugs)
  • Increased application concurrency or user load
  • System resource contention forcing more frequent scheduling decisions
  • Changes in workload patterns or deployed applications

File Descriptor Usage

The open_fds metric tracks system-wide open file descriptors from /proc/sys/fs/file-nr. In Linux, file descriptors represent not just open files, but also network sockets, pipes, and other I/O resources.

Understanding your file descriptor usage helps prevent resource exhaustion. Most Linux systems have default limits between 65,536 and 1,048,576 file descriptors. You should investigate when usage exceeds 80% of your system's limit.

File descriptor leaks are a common application issue. Symptoms include:

  • Steadily growing open_fds count without corresponding workload increases
  • Applications eventually failing with "too many open files" errors
  • Network connection failures as socket creation fails

Interpreting Metrics in Context

These process and system metrics work together to provide insights into your server's behaviour. Understanding their relationships helps you diagnose issues more effectively.

Process State Relationships

The relationship between running and blocked processes reveals your system's current bottleneck:

Active processes ≈ processes_running + processes_blocked

When running processes are high but blocked processes remain low, you're experiencing CPU pressure. Conversely, high blocked processes with moderate running processes suggests I/O bottlenecks.

The total process count provides the denominator for these calculations. On a system with 500 total processes, 20 running processes represents 4% CPU activity. On a system with 50 total processes, those same 20 running processes indicate much higher relative activity.

Workload-Specific Baselines

Different server roles have characteristic process patterns:

Web Servers typically show high process counts due to worker processes (Apache prefork, PHP-FPM pools, Nginx worker processes). Context switching rates are often elevated due to request handling concurrency.

Database Servers usually have fewer total processes but may show higher blocked process counts during heavy query loads. Context switching patterns often correlate with transaction rates.

Application Servers running Java or .NET applications might show fewer processes but higher thread activity, reflected in context switching rates.

Troubleshooting Common Issues

Persistent High Running Processes

When processes_running remains consistently high:

  1. Check CPU utilisation metrics to confirm CPU pressure
  2. Identify CPU-intensive processes using system tools
  3. Consider whether increased capacity or workload optimisation is needed
  4. Look for runaway processes or infinite loops

Growing Zombie Count

For accumulating zombie processes:

  1. Use ps aux | grep Z to identify zombie processes
  2. Note the parent process IDs (PPID column)
  3. Investigate why parent processes aren't cleaning up children
  4. Consider restarting problematic parent processes after identifying the root cause

File Descriptor Leaks

When open_fds grows steadily:

  1. Identify processes with high file descriptor usage: lsof | awk '{print $2}' | sort | uniq -c | sort -nr
  2. Check application logs for file handling errors
  3. Review application code for proper file/socket cleanup
  4. Monitor the trend to determine leak severity and timeline

Server Scout's 5-minute collection interval for process metrics provides the right balance between granularity and system impact. The 30-second collection for context switches and file descriptors offers more responsive monitoring for these faster-changing indicators.

Understanding these metrics helps you maintain optimal server performance and quickly identify issues before they impact users. Combined with Server Scout's other monitoring data, these process and system metrics form a comprehensive view of your server's operational health.

Back to Complete Reference Index

Frequently Asked Questions

What are zombie processes and are they dangerous?

Zombie processes (processes_zombie) are child processes that have finished execution but whose parent has not yet read their exit status. They consume no CPU or memory but occupy a process table entry. A few zombies are harmless and transient. Persistent or growing zombie counts indicate a bug in the parent application that is not properly waiting for child processes. Large numbers can exhaust the process table.

What does a high open file descriptor count indicate?

High open_fds relative to the system limit (ulimit) indicates the system is approaching its file descriptor cap. File descriptors are used for open files, network sockets, pipes, and other I/O resources. When the limit is reached, processes cannot open new files or connections, causing errors. Keep open_fds below 80% of the ulimit and investigate if it grows steadily over time.

What are context switches and why do they matter?

Context switches (context_switches) occur when the kernel saves one process's state and loads another's. They are a normal part of multitasking. Excessive context switching wastes CPU cycles on overhead rather than useful work. High rates may indicate too many competing processes, inefficient thread usage, or lock contention. The dashboard shows per-second rates from the cumulative counter.

What does processes_blocked mean?

Blocked processes (processes_blocked) are in uninterruptible sleep, typically waiting for I/O operations to complete. One or two blocked processes is normal during disk operations. Consistently high blocked process counts indicate I/O bottlenecks. These processes contribute to load average without consuming CPU, which is why you may see high load with low CPU usage on I/O-bound systems.

How many total processes is normal for a Linux server?

Normal process counts (processes_total) vary widely by server role. A minimal server might run 100-200 processes while a busy application server could have 500 or more. The absolute number matters less than sudden changes. A sharp increase might indicate a fork bomb or runaway process spawning. Monitor the trend and set alerts based on your server's established baseline.

Was this article helpful?