Diagnosing Unexpected Metric Values

When Server Scout displays metric values that seem unusual or incorrect, it's often due to how Linux systems report data or normal system behaviour that might appear concerning at first glance. This guide will help you diagnose and understand these common scenarios.

Understanding Memory Usage

One of the most frequent sources of confusion is memory reporting. If Server Scout shows high memory usage but your applications seem to have plenty of RAM available, this is likely normal Linux behaviour.

Linux uses available RAM to cache disk data, which dramatically improves system performance. The "used" memory metric includes this cache, making it appear as though your system is consuming far more memory than it actually needs for running processes.

What to check:

  • Focus on the "available" memory metric rather than "used" memory
  • Available memory accounts for cache that can be immediately reclaimed
  • Memory pressure only becomes a concern when available memory drops significantly
  • If available memory is above 10-15% of total RAM, your system likely has adequate memory

High CPU Usage on Responsive Servers

If Server Scout reports 100% CPU usage but your server feels responsive, don't panic immediately. This often indicates brief, intensive tasks rather than system overload.

Common causes:

  • Batch jobs or cron tasks using multiple cores efficiently
  • Background maintenance tasks (database optimisation, log rotation)
  • Short-lived compilation or backup processes

Diagnostic steps:

  1. Check the top processes in Server Scout's process monitoring
  2. Look for patterns - does this occur at specific times?
  3. Use htop or top to identify which processes are consuming CPU
  4. Consider if this aligns with scheduled maintenance tasks

CPU Steal Time on Virtual Machines

CPU steal time appearing on cloud instances is perfectly normal and indicates that the hypervisor is allocating CPU resources to other virtual machines sharing the same physical hardware.

When to be concerned:

  • Steal time consistently exceeds 10%
  • Applications experience noticeable performance degradation
  • Response times increase during peak steal periods

Solutions:

  • Consider upgrading to a larger instance type
  • Move to dedicated instances if steal time regularly impacts performance
  • Monitor patterns - steal time often correlates with peak usage hours

Disk Usage Discrepancies

When disk usage doesn't match your expectations, several factors could be at play:

Check mount point breakdowns: Server Scout provides per-mount statistics. A filesystem might be fuller than expected due to:

  • Log files growing unexpectedly
  • Temporary files not being cleaned up
  • Database files expanding

Hidden space consumption: Files deleted whilst still open by processes continue consuming disk space until the process releases the file handle or restarts.

Diagnostic commands:

# Find processes with deleted files still open
lsof +L1

# Identify largest directories
du -sh /* | sort -rh | head -10

Load Average vs CPU Percentage

High load averages with moderate CPU usage often confuse administrators. Load average includes processes waiting for I/O operations, not just CPU-bound tasks.

Common scenarios:

  • Servers with slow or heavily utilised storage
  • Network-intensive applications waiting for responses
  • Database servers performing disk-heavy operations

A system with fast CPUs but slow storage can easily show load averages of 5-10 whilst CPU usage remains at 30-40%.

Investigation steps:

  1. Monitor disk I/O metrics in Server Scout
  2. Check network utilisation if applicable
  3. Identify processes in 'D' state (uninterruptible sleep) using ps aux | grep " D "

Null or Missing Metrics

When Server Scout shows null values for certain metrics, several factors might be responsible:

Hardware support:

  • Temperature sensors may not be available on virtual machines
  • Network statistics might be unavailable for certain interface types

Software requirements:

  • Some metrics require specific commands to be installed
  • Permissions issues might prevent data collection

Configuration:

  • Optional metrics may be disabled in your Server Scout configuration
  • Certain cloud providers restrict access to hardware-level information

Troubleshooting steps:

  1. Check if the metric is supported on your platform
  2. Verify required packages are installed (sensors, smartctl, etc.)
  3. Review Server Scout logs for permission errors
  4. Consider whether the metric applies to your environment (e.g., hardware sensors on VMs)

Best Practices

When investigating unusual metrics:

  • Compare current values with historical trends
  • Consider the context of your server's workload
  • Cross-reference multiple related metrics
  • Don't rely on single data points - look for patterns

Understanding these common scenarios will help you make informed decisions about your infrastructure rather than reacting to misleading metric values.

Frequently Asked Questions

How do I set up Server Scout to monitor my server metrics properly?

Server Scout automatically monitors key metrics once installed, but some hardware-specific metrics may require additional packages like 'sensors' or 'smartctl'. Check that required permissions are granted and review Server Scout logs for any collection errors during initial setup.

Why does Server Scout show high memory usage when my server has plenty of RAM?

Linux uses available RAM to cache disk data for performance, which appears as 'used' memory. Focus on the 'available' memory metric instead - this accounts for cache that can be immediately reclaimed. Memory pressure only becomes concerning when available memory drops below 10-15% of total RAM.

How does CPU steal time work on virtual machines?

CPU steal time indicates when the hypervisor allocates CPU resources to other virtual machines on the same physical hardware. This is normal on cloud instances. Only be concerned when steal time consistently exceeds 10% or causes noticeable performance degradation in your applications.

Server Scout shows 100% CPU usage but my server feels responsive - is this normal?

Yes, brief 100% CPU usage on responsive servers often indicates efficient use of multiple cores by batch jobs, maintenance tasks, or compilation processes. Check Server Scout's process monitoring and look for patterns - this commonly occurs during scheduled maintenance windows.

What should I do when Server Scout shows high load average but low CPU usage?

High load with moderate CPU usage typically means processes are waiting for I/O operations rather than CPU processing. This is common with slow storage or network-intensive applications. Monitor disk I/O metrics in Server Scout and check for processes in uninterruptible sleep state.

Why are some metrics showing as null or missing in Server Scout?

Missing metrics often occur because hardware sensors aren't available on virtual machines, required software packages aren't installed, or permissions prevent data collection. Verify your platform supports the metric and check that necessary commands like 'sensors' are installed.

How do I troubleshoot disk usage discrepancies in Server Scout?

Check Server Scout's per-mount statistics to identify which filesystems are full. Common causes include growing log files, uncleaned temporary files, or files deleted while still open by processes. Use 'lsof +L1' to find processes with deleted files consuming space.

Was this article helpful?