System Health and Security Metrics Explained

Server Scout continuously monitors essential system health and security indicators that provide early warning of potential issues. These metrics complement performance monitoring by offering insights into your system's stability, security posture, and maintenance requirements. Understanding these indicators helps you maintain robust, secure servers and plan preventive maintenance effectively.

System Security Health Metrics

The following table outlines the core system health and security metrics collected by Server Scout:

Metric	Description	Collection Tier	Ideal Value
`entropy`	Available entropy in kernel random pool	Medium (30s)	>256
`oom_kills`	Cumulative Out-Of-Memory kill count	Medium (30s)	0
`ntp_synced`	System clock synchronisation status	Glacial (1hr)	true
`reboot_required`	Whether reboot needed for updates	Glacial (1hr)	false
`package_updates`	Number of pending package updates	Glacial (1hr)	Informational
`selinux_status`	SELinux enforcement mode	Daily (24hr)	enforcing
`firewall_status`	Host firewall active state	Daily (24hr)	active
`integrity`	Agent script integrity verification	Every payload	ok

Critical Security Indicators

Entropy Pool Health

The entropy metric tracks available randomness in the kernel's /proc/sys/kernel/random/entropy_avail pool. This entropy feeds cryptographic operations including TLS handshakes, SSH key generation, and certificate creation. Modern applications rely heavily on high-quality randomness for security.

Values consistently below 256 indicate entropy starvation, which can cause applications to block whilst waiting for sufficient randomness. This manifests as:

Slow SSH connections during key exchange
Web server delays during TLS handshakes
Database connection timeouts for encrypted connections
Application hangs during certificate generation

Modern kernels with hardware random number generators (RDRAND instruction on Intel/AMD processors) rarely experience entropy depletion. However, virtualised environments, embedded systems, or heavily loaded cryptographic services may still encounter issues.

If entropy consistently runs low, consider installing rng-tools or haveged to supplement the entropy pool, though investigate the root cause first.

Out-Of-Memory Kill Events

The oom_kills counter from /proc/vmstat tracks how many times the kernel's Out-Of-Memory killer has terminated processes. This is a cumulative counter that should remain at zero on healthy systems.

Any OOM kill represents a serious memory exhaustion event where the system forcibly terminated processes to prevent complete system failure. The OOM killer selects victims based on memory usage, process importance, and OOM score adjustments.

When oom_kills increases, immediately investigate:

# Check recent OOM events
dmesg | grep -i "killed process"

# Review system memory pressure
cat /proc/pressure/memory

# Identify memory-heavy processes
ps aux --sort=-%mem | head -20

Cross-reference OOM kills with Server Scout's memory metrics. Look for periods where mem_available_mb approached zero and mem_swap_total_mb was fully utilised. The timing correlation helps identify which applications triggered memory exhaustion.

Firewall Protection Status

The firewall_status metric indicates whether your host firewall is active. Server Scout checks multiple firewall implementations:

firewalld (RHEL/CentOS/Fedora default)
nftables (modern netfilter frontend)
iptables (traditional netfilter interface)

An "inactive" firewall status on internet-facing servers represents a significant security exposure. Without host-level filtering, your server relies entirely on network firewalls and application-level security.

However, firewall status should be evaluated contextually. Servers behind well-configured network firewalls or in isolated network segments may intentionally disable host firewalls to reduce complexity. The key is ensuring appropriate network-level protection exists.

SELinux Enforcement

On RHEL-family systems (Red Hat, CentOS, Fedora), selinux_status reports the current enforcement mode from getenforce:

enforcing: SELinux actively blocks policy violations (recommended)
permissive: SELinux logs violations but allows them (debugging mode)
disabled: SELinux completely inactive

"Enforcing" mode provides mandatory access controls that significantly limit attack impact even if applications are compromised. Many compliance frameworks require SELinux enforcement on production systems.

"Permissive" mode indicates a temporary debugging state—acceptable during troubleshooting but inappropriate for production. "Disabled" SELinux removes an important security layer and may violate organisational policies.

System Maintenance Indicators

Time Synchronisation

The ntp_synced metric indicates whether your system clock synchronises with authoritative time sources. Server Scout checks this via timedatectl on systemd systems or by examining chrony/ntpd status on older distributions.

Accurate time synchronisation is critical for:

Certificate validation: SSL/TLS certificates have validity periods checked against system time
Authentication protocols: Kerberos, OAuth, and multi-factor authentication rely on time accuracy
Log correlation: Distributed systems require synchronised timestamps for troubleshooting
Database replication: Many database clusters require time synchronisation
Compliance: Audit logs must have accurate timestamps

Servers with ntp_synced: false may experience authentication failures, certificate errors, or difficulties correlating events across systems.

Reboot Requirements

The reboot_required indicator signals when installed updates require a system restart to take effect. Detection methods vary by distribution:

Debian/Ubuntu: Presence of /var/run/reboot-required file
RHEL/CentOS: Output from needs-restarting utility
Other distributions: Similar mechanisms checking for kernel updates or core system library changes

Kernel updates, core library updates (glibc, systemd), and some security patches require reboots. Whilst not immediately urgent, plan maintenance windows to apply these restarts and fully activate security updates.

Package Updates

The package_updates count shows pending updates from your distribution's package manager (apt, dnf, yum). This provides visibility into your patch management status without requiring separate tools.

Large numbers of pending updates may indicate:

Infrequent maintenance schedules
Failed automatic update mechanisms
Manual update policies requiring review

Regular patching reduces security exposure and ensures access to bug fixes. However, production systems typically require change control processes rather than immediate automatic updates.

Agent Integrity Verification

The integrity metric provides tamper detection for the Server Scout agent itself. Each data payload includes a SHA-256 checksum of the agent script, which the dashboard compares against known-good signatures.

"ok" status confirms the agent hasn't been modified, whilst "modified" indicates potential tampering or corruption. This helps detect:

Unauthorised agent modifications
File system corruption affecting the agent
Compromise attempts targeting monitoring infrastructure

Security Posture Assessment

These metrics collectively provide a basic security health check without requiring dedicated security tools. They complement performance monitoring by highlighting security-relevant system states.

Immediate Action Required:

oom_kills > 0: Investigate memory exhaustion causes
selinux_status: "disabled" on production RHEL systems
firewall_status: "inactive" on internet-facing servers
integrity: "modified" indicates potential tampering
entropy consistently < 100: Risk of cryptographic delays

Maintenance Window Actions:

reboot_required: true — plan restart during maintenance
package_updates > 0: schedule update installation
ntp_synced: false — configure time synchronisation

Monitoring Trends:

Track these metrics over time to identify patterns. Gradually increasing oom_kills suggests growing memory pressure. Frequent reboot_required states may indicate aggressive update policies need refinement.

Correlation with Performance Metrics

System health metrics gain context when correlated with performance indicators:

OOM kills vs. memory usage: Compare oom_kills timing with mem_available_mb and swap utilisation to identify memory pressure patterns
Entropy depletion vs. network activity: Low entropy during high net_tx_bytes periods may indicate TLS-heavy workloads
Update requirements vs. system stability: Correlate pending updates with system error rates or performance degradation

Best Practices

Establish baseline values for your environment and set appropriate alerting thresholds. Security metrics often have binary good/bad states, making them suitable for immediate notifications rather than trend analysis.

Document your organisation's acceptable states for each metric. Some environments intentionally disable certain security features for performance or compatibility reasons—the key is making these decisions consciously rather than by oversight.

Regular review of these metrics during maintenance windows ensures your servers maintain good security hygiene alongside optimal performance.

Back to Complete Reference Index

Frequently Asked Questions

What is entropy and why does it matter for server security?

Entropy measures the randomness available to the kernel for cryptographic operations like generating SSL/TLS keys, secure random numbers, and authentication tokens. Low entropy (below 256 bits) can cause applications to block waiting for randomness, slowing down HTTPS connections and security operations. Modern kernels with hardware random number generators rarely have entropy issues, but virtual machines sometimes do.

What does an OOM kill mean and how serious is it?

An OOM (Out of Memory) kill means the kernel's OOM killer terminated a process because the system ran out of memory and swap. This is a serious event indicating severe memory pressure. The killed process is chosen by a heuristic that targets the largest memory consumer. Any non-zero oom_kills value should be investigated immediately. Solutions include adding RAM, fixing memory leaks, or adjusting workload.

Why should NTP synchronisation be monitored?

NTP synchronisation (ntp_synced) ensures your server clock is accurate. Unsynchronised clocks cause problems with TLS certificate validation, log correlation across servers, distributed database consistency, authentication token expiry, and scheduled task timing. Time drift can cause subtle, hard-to-diagnose issues. Server Scout checks whether the system clock is NTP-synchronised and reports a boolean status.

What does the integrity metric check?

The integrity metric provides agent tamper detection by verifying that the Server Scout agent binary has not been modified. This security feature helps ensure that the monitoring data you receive is trustworthy and that the agent has not been compromised. The integrity check runs as part of the daily collection tier alongside other security-related metrics.

How do I use the reboot_required and package_updates metrics?

The reboot_required metric indicates whether pending system updates need a reboot to take effect, typically kernel or core library updates. The package_updates metric shows how many updates are available. Together they help you plan maintenance windows. Monitor reboot_required for timely security patching and track package_updates to ensure systems stay current with security fixes.

Was this article helpful?

Search Results