Server Scout's service monitoring capabilities provide essential visibility into the health of your systemd services, helping you detect failures, track service states, and ensure critical services remain operational. These metrics are collected hourly as part of the Glacial tier, offering a comprehensive view of your system's service landscape.
Understanding Service States
Core Service Metrics
Server Scout tracks four key service-related metrics that work together to provide complete service visibility:
| Metric | Type | Description |
|---|---|---|
services | Array | Detailed information for each monitored service |
services_running | Integer | Count of services currently in running state |
services_total | Integer | Total number of monitored services |
failed_units | Integer | Count of all failed systemd units system-wide |
The services array contains detailed information for each monitored service, with four attributes per service:
| Attribute | Description | Example Values |
|---|---|---|
name | Service unit name | nginx.service, sshd.service, mysql.service |
state | Simplified service state | running, stopped, failed |
sub_state | Detailed systemd sub-state | running, exited, dead, failed, start-pre |
enabled | Boot-time startup configuration | true, false |
Service State Classification
Server Scout simplifies systemd's complex state model into three primary states:
Running: The service is active and operating normally. This corresponds to systemd's "active" state with sub-states like "running" for daemon processes or "exited" for one-shot services that completed successfully.
Stopped: The service is not currently running. This is normal for disabled services or services that have been intentionally stopped. The systemd state is typically "inactive" with sub-state "dead".
Failed: The service has encountered an error and is not functioning. This maps to systemd's "failed" state and always requires investigation. Common causes include configuration errors, missing dependencies, or application crashes.
Choosing Services to Monitor
Strategic Service Selection
The Server Scout agent can monitor up to 16 systemd services, so selecting the right services is crucial for effective monitoring. Focus on services that are critical to your server's primary functions.
Web Services: If your server hosts websites or web applications, monitor your web server (nginx.service, apache2.service, or httpd.service) and any application servers (php-fpm.service, uwsgi.service).
Database Services: Database availability is often critical. Monitor services like mysql.service, mariadb.service, postgresql.service, or redis.service depending on your stack.
Infrastructure Services: Essential system services warrant monitoring. Always include sshd.service for remote access, and consider systemd-resolved.service for DNS resolution or chrony.service/ntp.service for time synchronisation.
Mail Services: For mail servers, monitor both incoming (postfix.service, exim4.service) and delivery services (dovecot.service).
Application-Specific Services: Include any custom applications or third-party services critical to your server's function, such as docker.service, fail2ban.service, or custom application daemons.
Configuration Considerations
When configuring service monitoring, consider the service's role and expected behaviour:
- Always-running services: Web servers, databases, and SSH should typically show as "running" and "enabled"
- On-demand services: Some services like
cups.servicemight legitimately be stopped when not needed - Maintenance services: Services like
logrotate.serviceor backup scripts might show as "stopped" with sub-state "exited" after successful completion
Failed Units: System-Wide Monitoring
Beyond Configured Services
The failed_units metric provides broader visibility than your configured service list. While the services array only tracks your selected services, failed_units counts all systemd units in a failed state across the entire system.
This metric should ideally be zero. Any non-zero value indicates that something on your system has failed and requires attention. Failed units might include:
- Services you haven't explicitly configured for monitoring
- Mount units for filesystems that failed to mount
- Timer units that encountered errors
- Socket units that failed to bind
Early Warning System
Because failed_units casts a wider net, it often catches problems before they impact your monitored services. For example, a failed mount unit might not immediately affect your web server, but could cause issues when the web server tries to access files on that mount point.
Use failed_units as your primary alerting metric for service-related issues. Set up alerts when this value exceeds zero, then investigate using systemctl --failed to identify the problematic units.
Collection Methodology and Timing
Why Glacial Tier?
Service monitoring operates on the Glacial tier (hourly collection) for several important reasons:
Service Stability: Well-configured services rarely change state. Checking every few seconds would waste system resources without providing meaningful additional insight.
System Overhead: The agent uses systemctl show commands to query service states. While lightweight, these commands have more overhead than reading from /proc or /sys filesystems, making them unsuitable for the fast monitoring tiers.
Operational Patterns: Service failures typically persist long enough that hourly detection is sufficient for most operational needs. A service that fails and recovers within an hour often indicates intermittent issues that warrant investigation anyway.
Complementary Monitoring
For faster detection of critical service failures, combine service monitoring with other Server Scout metrics:
- Monitor process counts for critical services
- Track network connection states for network services
- Watch for spikes in system error rates or load averages
- Use log monitoring for immediate failure detection
Interpreting Service Patterns
Normal Operation Patterns
Stable Services: Most production services should show consistent "running" states with occasional stops for maintenance or updates. Frequent state changes often indicate underlying problems.
Seasonal Services: Some services might legitimately show varying states based on usage patterns. Backup services might appear as "stopped" with "exited" sub-state after successful completion.
Dependency Patterns: Related services often fail together. Database connection failures might precede web application failures, creating a cascade visible in your service metrics.
Failure Patterns and Recovery
Crash Loops: A service alternating rapidly between "running" and "failed" states indicates a crash loop. The service starts, encounters an immediate problem, fails, gets restarted by systemd, and fails again.
Dependency Failures: Services might fail if their dependencies are unavailable. Check failed services for dependency relationships using systemctl list-dependencies.
Resource Exhaustion: Services might fail due to resource constraints. Correlate service failures with memory, disk space, or file descriptor metrics.
Recovery Strategies
When services show failure states:
- Immediate Assessment: Check
systemctl statusfor error details - Log Analysis: Examine service-specific logs and systemd journal entries
- Dependency Check: Verify that required services and resources are available
- Resource Verification: Ensure adequate memory, disk space, and system resources
- Configuration Validation: Check service configuration files for syntax errors
Integration with Other Metrics
Holistic System Health
Service metrics work best when interpreted alongside other Server Scout metrics:
Process Metrics: Cross-reference service states with process counts. A "running" service with zero processes might indicate a problem with process tracking or zombie processes.
Network Metrics: For network services, correlate service states with connection counts and network activity. A running web server with no network connections might indicate binding or firewall issues.
System Load: Service failures often precede or follow system load spikes. Monitor load averages and CPU usage patterns around service state changes.
Memory and Disk: Resource exhaustion commonly causes service failures. Track memory usage and disk space trends to predict and prevent service problems.
Alerting Best Practices
Design alerts that balance responsiveness with noise reduction:
- Alert immediately on
failed_units> 0 - Create separate alert thresholds for critical vs. non-critical services
- Use service state history to detect crash loops or frequent restarts
- Combine service alerts with related system metrics for context
Service monitoring with Server Scout provides the foundation for maintaining system reliability, but reaches its full potential when integrated with comprehensive system monitoring and operational procedures.
Back to Complete Reference IndexFrequently Asked Questions
What does a failed systemd service mean?
How often does Server Scout check service status?
What information does the services array contain?
How do I monitor a specific critical service with Server Scout?
Was this article helpful?