Understanding Failed Systemd Units

Systemd is the backbone of modern Linux distributions, managing services, processes, and system resources. When systemd units fail, they can indicate serious problems with your server's health. Server Scout provides comprehensive monitoring for failed systemd units, helping you catch and resolve issues before they impact your users.

What Are Failed Systemd Units?

A failed systemd unit represents a service that attempted to start but crashed or exited with an error code. This could be due to configuration problems, missing dependencies, resource constraints, or application bugs. Unlike stopped services (which are intentionally inactive), failed units indicate something has gone wrong and requires attention.

Common causes of failed units include:

  • Misconfigured service files
  • Missing executable files or dependencies
  • Permission issues
  • Resource exhaustion (memory, disk space)
  • Network connectivity problems

Enabling Systemd Monitoring in Server Scout

To monitor failed systemd units, enable the systemd_failed metric in your Server Scout configuration:

sudo nano /opt/serverscout/scout.conf

Add or uncomment the following line:

systemd_failed=1

Restart the Server Scout agent to apply the changes:

sudo systemctl restart serverscout

The agent will now count units in the "failed" state every hour as part of the glacial monitoring tier. This frequency is appropriate since systemd failures typically require immediate attention when they occur, but don't need constant monitoring once identified.

Viewing Failed Units in Server Scout

Once enabled, you can monitor failed systemd units through the Server Scout dashboard:

Server Detail Page

Navigate to your server's detail page and locate the System panel. Here you'll find the failed unit count alongside other system metrics. This gives you an at-a-glance view of systemd health across all your monitored servers.

Services List

The services section provides detailed information about individual systemd units, showing:

  • Status: Active, failed, inactive, or other states
  • Enabled/Disabled: Whether the service starts automatically at boot
  • Unit type: Service, socket, timer, etc.

This granular view helps you quickly identify which specific services are experiencing problems.

Health Summary Alerts

Server Scout's health summary automatically flags servers when more than 10 systemd units are in a failed state. This threshold indicates a potentially serious system-wide issue that requires immediate investigation.

Investigating Failed Units

When Server Scout identifies failed systemd units, use these commands to investigate:

List All Failed Units

systemctl list-units --failed

This command shows all currently failed units with their load state and active status.

Examine Specific Unit Status

systemctl status <unit-name>

Replace with the specific service name to see detailed status information, including recent log entries and the reason for failure.

Check Service Logs

journalctl -u <unit-name>

Use journalctl to examine the full log history for a specific unit. Add -f to follow logs in real-time or --since "1 hour ago" to limit the timeframe.

View Recent System Logs

journalctl --since "1 hour ago" --priority=err

This shows recent error-level messages across all systemd units, helping identify patterns or related failures.

Setting Up Alerts

Configure alerts in Server Scout to notify you when failed systemd units exceed your defined threshold:

  1. Navigate to the Alerts section in your Server Scout dashboard
  2. Create a new alert rule for the "Systemd Failed Units" metric
  3. Set your desired threshold (consider starting with 1-3 failed units for critical servers)
  4. Configure notification channels (email, Slack, webhooks)
  5. Define alert frequency to avoid spam during extended outages

For production servers, consider setting a low threshold (1-2 failed units) with immediate notifications. Development or staging environments might tolerate higher thresholds.

Best Practices

  • Review failed units promptly—they often indicate underlying system problems
  • Investigate patterns in failures across multiple servers
  • Document solutions for recurring issues to speed future resolution
  • Consider setting different alert thresholds based on server criticality
  • Regularly audit your systemd services to remove unnecessary units

By monitoring failed systemd units with Server Scout, you'll maintain better visibility into your server health and catch problems before they escalate into service disruptions.

Frequently Asked Questions

How do I enable systemd monitoring in ServerScout

To enable systemd monitoring, add 'systemd_failed=1' to your /opt/serverscout/scout.conf file, then restart the ServerScout agent with 'sudo systemctl restart serverscout'. The agent will then count failed units every hour as part of the glacial monitoring tier.

What causes systemd units to fail

Common causes include misconfigured service files, missing executable files or dependencies, permission issues, resource exhaustion like memory or disk space problems, and network connectivity issues. Failed units indicate something has gone wrong and requires attention, unlike stopped services which are intentionally inactive.

How do I troubleshoot failed systemd units

Use 'systemctl list-units --failed' to see all failed units, 'systemctl status <unit-name>' for specific unit details, and 'journalctl -u <unit-name>' to examine service logs. You can also check recent system-wide errors with 'journalctl --since "1 hour ago" --priority=err'.

When does ServerScout alert for failed systemd units

ServerScout's health summary automatically flags servers when more than 10 systemd units are in a failed state, indicating a potentially serious system-wide issue. You can also configure custom alert rules with lower thresholds like 1-3 failed units for critical servers.

How often does ServerScout check for failed systemd units

ServerScout checks for failed systemd units every hour as part of the glacial monitoring tier. This frequency is appropriate since systemd failures typically require immediate attention when they occur, but don't need constant monitoring once identified.

Where can I view failed systemd units in ServerScout

You can view failed systemd units in the System panel on your server's detail page for an overview, or in the Services section for detailed information about individual units including their status, enabled/disabled state, and unit type.

What threshold should I set for systemd unit alerts

For production servers, consider setting a low threshold of 1-2 failed units with immediate notifications. Development or staging environments might tolerate higher thresholds. ServerScout automatically flags servers with more than 10 failed units as having serious system-wide issues.

Was this article helpful?