Understanding the Server Health Summary

The Server Health Summary provides a quick, at-a-glance view of your server's overall status on the Server Scout detail page. This intelligent summary automatically evaluates key system metrics and alerts you to potential issues that require attention, helping you maintain optimal server performance.

How the Health Summary Works

The health summary operates on a simple principle: when everything is running smoothly, you'll see the reassuring "All systems normal" message. However, when Server Scout detects issues that could impact your server's performance or reliability, it will display specific alerts with clear descriptions of what needs attention.

This summary is generated from the latest metrics snapshot and updates in real-time, ensuring you always have current information about your server's health status.

Understanding Health Issues

Reboot Required

When you see a "reboot required" alert, it indicates that your system has a pending OS reboot, typically after installing security updates or kernel patches. Whilst your server continues to run normally, the reboot ensures all updates take effect properly.

# Check if reboot is required on Ubuntu/Debian
ls /var/run/reboot-required

# Check on CentOS/RHEL
needs-restarting -r

High CPU Temperature

Server Scout monitors your CPU temperature and raises an alert when it exceeds 85 degrees Celsius. Elevated temperatures can lead to thermal throttling, reduced performance, and potential hardware damage.

High CPU temperatures often indicate:

  • Inadequate cooling or ventilation
  • Dust accumulation in cooling systems
  • Failing thermal paste or cooling components
  • Excessive CPU load over extended periods

Failed Systemd Units

When more than 10 systemd units are in a failed state, the health summary will flag this as a concern. Failed units can indicate service crashes, configuration issues, or dependency problems that may affect system functionality.

# View failed systemd units
systemctl --failed

# Check specific unit status
systemctl status unit-name

Agent Integrity Status

The agent integrity check ensures your Server Scout monitoring agent hasn't been compromised. You'll see one of three states:

  • Verified: Checksums match expected values - your agent is authentic and unmodified
  • Unverified: Indicates an older agent version that may need updating
  • Tampered: Checksums don't match, suggesting the agent files have been modified

If you see "tampered" status, investigate immediately as this could indicate a security issue.

High CPU Steal Percentage

CPU steal time becomes a concern when it remains consistently high. This metric is particularly relevant for virtual machines and indicates that your VM is waiting for the hypervisor to allocate CPU resources. High steal percentages suggest:

  • VM resource contention on the physical host
  • Oversubscribed virtualisation environment
  • Need for resource allocation review

High IO Wait Percentage

Elevated IO wait percentages signal that your CPU is frequently waiting for disk operations to complete. This typically indicates a disk bottleneck that can significantly impact system performance.

Common causes include:

  • Slow or failing storage devices
  • Insufficient disk IOPS for current workload
  • Poorly optimised database queries
  • Inadequate storage configuration
# Monitor IO wait in real-time
iostat -x 1

# Check disk usage patterns
iotop

Taking Action on Health Alerts

When health issues appear in your summary, prioritise them based on severity and potential impact. Critical issues like high temperatures or agent tampering require immediate attention, whilst others like pending reboots can often be scheduled during maintenance windows.

The real-time nature of the health summary means that as you resolve issues, the alerts will disappear and you'll return to the "All systems normal" status, providing immediate feedback on your remediation efforts.

Regular monitoring of the Server Health Summary helps maintain proactive server management, allowing you to address potential problems before they impact your services or users.

Frequently Asked Questions

How does ServerScout's server health summary work?

The server health summary automatically evaluates key system metrics and displays either 'All systems normal' when everything runs smoothly, or specific alerts when issues are detected. It generates from the latest metrics snapshot and updates in real-time, ensuring you always have current information about your server's health status.

What does reboot required alert mean in ServerScout?

A 'reboot required' alert indicates your system has a pending OS reboot, typically after installing security updates or kernel patches. While your server continues running normally, the reboot ensures all updates take effect properly. You can check this on Ubuntu/Debian with 'ls /var/run/reboot-required' or on CentOS/RHEL with 'needs-restarting -r'.

When does ServerScout alert for high CPU temperature?

ServerScout raises a high CPU temperature alert when it exceeds 85 degrees Celsius. Elevated temperatures can lead to thermal throttling, reduced performance, and potential hardware damage. This often indicates inadequate cooling, dust accumulation, failing thermal paste, or excessive CPU load over extended periods.

What does agent integrity status mean in ServerScout?

Agent integrity status has three states: Verified means checksums match expected values and your agent is authentic; Unverified indicates an older agent version that may need updating; Tampered means checksums don't match, suggesting agent files have been modified, which requires immediate investigation as it could indicate a security issue.

How to fix failed systemd units alert in ServerScout?

ServerScout flags failed systemd units when more than 10 units are in a failed state. To troubleshoot, use 'systemctl --failed' to view failed units and 'systemctl status unit-name' to check specific unit status. Failed units can indicate service crashes, configuration issues, or dependency problems affecting system functionality.

What causes high IO wait percentage alerts?

High IO wait percentage alerts signal that your CPU is frequently waiting for disk operations to complete, indicating a disk bottleneck impacting system performance. Common causes include slow or failing storage devices, insufficient disk IOPS, poorly optimized database queries, or inadequate storage configuration. Monitor with 'iostat -x 1' or 'iotop' commands.

What is CPU steal time in ServerScout monitoring?

CPU steal time becomes a concern when consistently high, particularly for virtual machines. It indicates your VM is waiting for the hypervisor to allocate CPU resources. High steal percentages suggest VM resource contention on the physical host, an oversubscribed virtualization environment, or need for resource allocation review.

Was this article helpful?