Service Monitoring Metrics Explained

Server Scout's service monitoring capabilities provide essential visibility into the health of your systemd services, helping you detect failures, track service states, and ensure critical services remain operational. These metrics are collected hourly as part of the Glacial tier, offering a comprehensive view of your system's service landscape.

Understanding Service States

Core Service Metrics

Server Scout tracks four key service-related metrics that work together to provide complete service visibility:

MetricTypeDescription
servicesArrayDetailed information for each monitored service
services_runningIntegerCount of services currently in running state
services_totalIntegerTotal number of monitored services
failed_unitsIntegerCount of all failed systemd units system-wide

The services array contains detailed information for each monitored service, with four attributes per service:

AttributeDescriptionExample Values
nameService unit namenginx.service, sshd.service, mysql.service
stateSimplified service staterunning, stopped, failed
sub_stateDetailed systemd sub-staterunning, exited, dead, failed, start-pre
enabledBoot-time startup configurationtrue, false

Service State Classification

Server Scout simplifies systemd's complex state model into three primary states:

Running: The service is active and operating normally. This corresponds to systemd's "active" state with sub-states like "running" for daemon processes or "exited" for one-shot services that completed successfully.

Stopped: The service is not currently running. This is normal for disabled services or services that have been intentionally stopped. The systemd state is typically "inactive" with sub-state "dead".

Failed: The service has encountered an error and is not functioning. This maps to systemd's "failed" state and always requires investigation. Common causes include configuration errors, missing dependencies, or application crashes.

Choosing Services to Monitor

Strategic Service Selection

The Server Scout agent can monitor up to 16 systemd services, so selecting the right services is crucial for effective monitoring. Focus on services that are critical to your server's primary functions.

Web Services: If your server hosts websites or web applications, monitor your web server (nginx.service, apache2.service, or httpd.service) and any application servers (php-fpm.service, uwsgi.service).

Database Services: Database availability is often critical. Monitor services like mysql.service, mariadb.service, postgresql.service, or redis.service depending on your stack.

Infrastructure Services: Essential system services warrant monitoring. Always include sshd.service for remote access, and consider systemd-resolved.service for DNS resolution or chrony.service/ntp.service for time synchronisation.

Mail Services: For mail servers, monitor both incoming (postfix.service, exim4.service) and delivery services (dovecot.service).

Application-Specific Services: Include any custom applications or third-party services critical to your server's function, such as docker.service, fail2ban.service, or custom application daemons.

Configuration Considerations

When configuring service monitoring, consider the service's role and expected behaviour:

  • Always-running services: Web servers, databases, and SSH should typically show as "running" and "enabled"
  • On-demand services: Some services like cups.service might legitimately be stopped when not needed
  • Maintenance services: Services like logrotate.service or backup scripts might show as "stopped" with sub-state "exited" after successful completion

Failed Units: System-Wide Monitoring

Beyond Configured Services

The failed_units metric provides broader visibility than your configured service list. While the services array only tracks your selected services, failed_units counts all systemd units in a failed state across the entire system.

This metric should ideally be zero. Any non-zero value indicates that something on your system has failed and requires attention. Failed units might include:

  • Services you haven't explicitly configured for monitoring
  • Mount units for filesystems that failed to mount
  • Timer units that encountered errors
  • Socket units that failed to bind

Early Warning System

Because failed_units casts a wider net, it often catches problems before they impact your monitored services. For example, a failed mount unit might not immediately affect your web server, but could cause issues when the web server tries to access files on that mount point.

Use failed_units as your primary alerting metric for service-related issues. Set up alerts when this value exceeds zero, then investigate using systemctl --failed to identify the problematic units.

Collection Methodology and Timing

Why Glacial Tier?

Service monitoring operates on the Glacial tier (hourly collection) for several important reasons:

Service Stability: Well-configured services rarely change state. Checking every few seconds would waste system resources without providing meaningful additional insight.

System Overhead: The agent uses systemctl show commands to query service states. While lightweight, these commands have more overhead than reading from /proc or /sys filesystems, making them unsuitable for the fast monitoring tiers.

Operational Patterns: Service failures typically persist long enough that hourly detection is sufficient for most operational needs. A service that fails and recovers within an hour often indicates intermittent issues that warrant investigation anyway.

Complementary Monitoring

For faster detection of critical service failures, combine service monitoring with other Server Scout metrics:

  • Monitor process counts for critical services
  • Track network connection states for network services
  • Watch for spikes in system error rates or load averages
  • Use log monitoring for immediate failure detection

Interpreting Service Patterns

Normal Operation Patterns

Stable Services: Most production services should show consistent "running" states with occasional stops for maintenance or updates. Frequent state changes often indicate underlying problems.

Seasonal Services: Some services might legitimately show varying states based on usage patterns. Backup services might appear as "stopped" with "exited" sub-state after successful completion.

Dependency Patterns: Related services often fail together. Database connection failures might precede web application failures, creating a cascade visible in your service metrics.

Failure Patterns and Recovery

Crash Loops: A service alternating rapidly between "running" and "failed" states indicates a crash loop. The service starts, encounters an immediate problem, fails, gets restarted by systemd, and fails again.

Dependency Failures: Services might fail if their dependencies are unavailable. Check failed services for dependency relationships using systemctl list-dependencies.

Resource Exhaustion: Services might fail due to resource constraints. Correlate service failures with memory, disk space, or file descriptor metrics.

Recovery Strategies

When services show failure states:

  1. Immediate Assessment: Check systemctl status for error details
  2. Log Analysis: Examine service-specific logs and systemd journal entries
  3. Dependency Check: Verify that required services and resources are available
  4. Resource Verification: Ensure adequate memory, disk space, and system resources
  5. Configuration Validation: Check service configuration files for syntax errors

Integration with Other Metrics

Holistic System Health

Service metrics work best when interpreted alongside other Server Scout metrics:

Process Metrics: Cross-reference service states with process counts. A "running" service with zero processes might indicate a problem with process tracking or zombie processes.

Network Metrics: For network services, correlate service states with connection counts and network activity. A running web server with no network connections might indicate binding or firewall issues.

System Load: Service failures often precede or follow system load spikes. Monitor load averages and CPU usage patterns around service state changes.

Memory and Disk: Resource exhaustion commonly causes service failures. Track memory usage and disk space trends to predict and prevent service problems.

Alerting Best Practices

Design alerts that balance responsiveness with noise reduction:

  • Alert immediately on failed_units > 0
  • Create separate alert thresholds for critical vs. non-critical services
  • Use service state history to detect crash loops or frequent restarts
  • Combine service alerts with related system metrics for context

Service monitoring with Server Scout provides the foundation for maintaining system reliability, but reaches its full potential when integrated with comprehensive system monitoring and operational procedures.

Back to Complete Reference Index

Frequently Asked Questions

What does a failed systemd service mean?

A failed systemd unit (counted in failed_units) means a service crashed, exited with an error, or could not start. Any non-zero failed_units count should be investigated. Use systemctl --failed on the server to see which units have failed and journalctl -u service-name to view logs. Failed services may include critical infrastructure like databases, web servers, or monitoring agents.

How often does Server Scout check service status?

Service metrics are collected on the Glacial tier, once per hour. This includes the services array with per-service details (name, state, sub-state, enabled status), services_running and services_total counts, and failed_units. Hourly collection is appropriate because service state changes are relatively infrequent and the systemd query has higher overhead than reading /proc files.

What information does the services array contain?

The services array provides per-service details including the service name, current state (running, stopped, failed), sub-state (for more detail like "dead" or "exited"), and whether the service is enabled to start at boot. This allows monitoring of specific critical services beyond just the aggregate running/total/failed counts.

How do I monitor a specific critical service with Server Scout?

Server Scout collects the status of all systemd services in the services array. The dashboard shows which services are running, stopped, or failed. Set up alerts on the failed_units metric to be notified when any service fails. For specific service monitoring, check the services array in the dashboard for the service name and its current state.

Was this article helpful?