Understanding Metric Collection Tiers

Why Server Scout Uses Tiered Collection

The Server Scout agent employs a sophisticated 5-tier data collection system that many customers initially find surprising. Why does CPU usage update every 5 seconds whilst package updates appear only once daily? The answer lies in the fundamental design philosophy: near-zero footprint monitoring that balances detection speed against resource cost.

Understanding these tiers will help you interpret your dashboard data more effectively and appreciate why certain metrics appear to update at different intervals.

The Resource Cost Problem

A naive monitoring approach would collect every possible metric every 5 seconds. This seems logical—more frequent data means better visibility, right? In practice, this creates substantial overhead:

  • Dozens of command forks per collection cycle
  • Measurable CPU impact from constant subprocess creation
  • Unnecessary strain on system resources
  • Identical data points for metrics that rarely change

Server Scout takes a different approach. The agent is a pure Bash script that reads primarily from /proc and /sys virtual filesystems—kernel-served data with zero disk I/O. Only when necessary does it fork external commands, and then only at intervals appropriate to each metric's rate of change.

The Five Collection Tiers

Fast Tier: Every 5 Seconds

The Fast Tier captures the most volatile and critical metrics—those that change rapidly and require immediate alerting capabilities.

What's collected:

Metric CategoryMetricsSource
CPU Usagecpu_percent, cpu_user, cpu_system, cpu_iowait, cpu_steal, cpu_nice, cpu_irq, cpu_softirq/proc/stat
CPU Informationcpu_cores, cpu_model, cpu_temp/proc/cpuinfo, /sys/class/thermal
Memory Coremem_percent, mem_used_gb, mem_total_gb, mem_available_mb, mem_cached_mb, mem_buffers_mb/proc/meminfo
Memory Detailmem_swap_used_mb, mem_swap_total_mb, mem_dirty_mb, mem_shmem_mb, mem_slab_reclaimable_mb/proc/meminfo

Why these metrics are Fast Tier:

  • CPU utilisation can spike from 5% to 95% within seconds during traffic bursts or batch jobs
  • Memory pressure can escalate rapidly, especially in containerised environments
  • These metrics are essential for real-time alerting on performance issues

Resource cost: ~50-100ms CPU per 5-second cycle. All data comes from reading two virtual files (/proc/stat and /proc/meminfo) served directly by the kernel—no disk I/O, no command forks.

Medium Tier: Every 30 Seconds

The Medium Tier covers metrics that change frequently but don't require 5-second granularity for effective monitoring.

What's collected:

Metric CategoryMetricsSource
Network I/Onet_rx_bytes, net_tx_bytes, net_rx_errors, net_tx_errors, net_rx_dropped, net_tx_dropped/proc/net/dev
Network Identitynet_interface, net_ip, net_mac/proc/net/dev, system interfaces
Disk I/Odisk_io_read_bytes, disk_io_write_bytes/proc/diskstats
Virtual Memorypage_faults, page_faults_major, swap_in_pages, swap_out_pages/proc/vmstat
TCP Connectionstcp_connections, tcp_established, tcp_time_wait, tcp_close_wait, tcp_listen/proc/net/tcp, /proc/net/tcp6
System Activitycontext_switches, open_fds, oom_kills, entropyVarious /proc files

Why these metrics are Medium Tier:

  • Network and disk I/O counters are cumulative—30-second intervals provide sufficient granularity for rate calculations
  • TCP connection states change relatively frequently but don't require instant detection
  • Page faults and context switches trend over minutes rather than seconds

Resource cost: ~10-50ms CPU per 30-second cycle. Still purely virtual filesystem reads with no external commands.

Slow Tier: Every 5 Minutes

The Slow Tier handles metrics that change gradually or represent inherently averaged data.

What's collected:

Metric CategoryMetricsSource
System Loadload_1m, load_5m, load_15m/proc/loadavg
Process Countsprocesses_running, processes_blocked, processes_zombie, processes_total/proc/stat
Disk Usagedisk_percent, disk_used_gb, disk_total_gbdf command
Mount Detailsdisk_mounts array with mount points, devices, filesystems, usagedf command, /proc/mounts

Why these metrics are Slow Tier:

  • Load averages are kernel-calculated averages over 1, 5, and 15 minutes—collecting them every 5 seconds adds no information
  • Disk space changes gradually; 5-minute intervals catch storage issues well before they become critical
  • Process counts typically trend over minutes

Resource cost: ~100-200ms CPU per 5-minute cycle. This tier requires forking the df command but only once per cycle.

Glacial Tier: Every Hour

The Glacial Tier covers metrics that rarely change but have high collection overhead.

What's collected:

Metric CategoryMetricsSource
Servicesservices array, services_running, services_total, failed_unitssystemctl commands
Time Syncntp_syncedtimedatectl or NTP status
Updatespackage_updates, reboot_requiredapt/dnf/zypper commands

Why these metrics are Glacial Tier:

  • Service states typically change only during deployments or maintenance
  • Package updates are discovered weekly or monthly
  • These checks require multiple external command forks with non-trivial overhead
  • Checking service status every 5 seconds would consume significant CPU for metrics that change perhaps once per month

Resource cost: ~500ms-2s CPU per hour. Multiple systemctl forks plus package manager queries.

Daily Tier: Every 24 Hours

The Daily Tier captures essentially static system information.

What's collected:

Metric CategoryMetricsSource
System Identityos, kernel, arch, virtualization, hostname, agent_version, device_typeVarious system commands
Security Statusselinux_status, firewall_statusgetenforce, firewall status commands

Why these metrics are Daily Tier:

  • OS version and kernel change only during major updates
  • Hostname and architecture are effectively static
  • Security configurations change infrequently
  • These checks involve multiple command forks acceptable only at daily intervals

Resource cost: ~200ms-1s CPU per day. Several forks for system detection commands.

Understanding Counter Metrics

Several metrics (net_rx_bytes, net_tx_bytes, disk_io_read_bytes, disk_io_write_bytes, context_switches) are cumulative counters. The agent reports the raw cumulative values, but the dashboard calculates and displays rates per second by computing deltas between consecutive data points.

For example:

  • Agent reports network RX bytes: 1,000,000 then 1,010,000 (30 seconds later)
  • Dashboard calculates: (1,010,000 - 1,000,000) ÷ 30 = 333 bytes/second
  • You see the rate, not the cumulative counter

This approach is more accurate than attempting rate calculations within the agent and aligns with industry-standard monitoring practices.

Data Retention and Downsampling

Server Scout stores and displays data at different granularities depending on the time range:

Time RangeData PointsGranularitySource
1 hour~720 pointsRaw 5-second dataDirect from Fast/Medium tiers
6 hours~720 points30-second averagesDownsampled
24 hours~720 points2-minute averagesDownsampled
7 days~672 points15-minute averagesDownsampled

Raw 5-second data is retained for 24 hours, then automatically pruned. Averaged data provides historical context whilst maintaining reasonable storage requirements and dashboard performance.

Handling Network Outages

The agent includes sophisticated data spooling to ensure no metrics are lost during connectivity issues:

  • When unable to reach the dashboard, payloads are stored locally in /opt/scout-agent/spool/
  • Up to 720 spool files are retained (approximately 1 hour of Fast Tier data)
  • When connectivity returns, spooled data is automatically replayed with historical timestamps
  • The dashboard processes replayed data to fill gaps in your charts

This ensures continuous monitoring even during network outages or dashboard maintenance.

Total Resource Footprint

The tiered approach achieves remarkable efficiency:

  • Memory usage: <3 MB RSS
  • CPU usage: <100ms total per 5-second cycle (average <0.1% on modern hardware)
  • Disk I/O: Virtually zero (except brief spool writes during outages)
  • Network traffic: ~2-5 KB per payload, compressed

Compare this to traditional monitoring agents that often consume 50-100 MB RAM and measurable CPU even when idle.

Force Collection for Troubleshooting

You can trigger all collection tiers immediately using:

/opt/scout-agent/scout-agent.sh --refresh

This is useful:

  • After configuration changes (new services, mount points)
  • When troubleshooting specific metrics
  • To verify the agent can collect all metric types

The --refresh flag bypasses normal timing intervals and executes all five tiers in sequence.

Practical Implications

Understanding the tier system helps you interpret your dashboard effectively:

  • Immediate issues (CPU spikes, memory exhaustion) appear within 5 seconds
  • Performance trends (network throughput, disk I/O) update every 30 seconds
  • Capacity planning (disk space, load averages) updates every 5 minutes
  • Configuration changes (new services, updates) appear within an hour
  • System changes (kernel updates, hostname changes) appear daily

This tiered approach ensures you get rapid alerting on critical issues whilst maintaining the lightest possible footprint on your servers. The agent intelligently matches collection frequency to each metric's characteristics—volatile metrics get frequent attention, stable metrics get occasional checks.

The result is comprehensive monitoring that's virtually invisible to your server's performance, proving that effective monitoring doesn't require heavy resource consumption.

Back to Complete Reference Index

Frequently Asked Questions

What are Server Scout metric collection tiers?

Server Scout uses a 5-tier collection schedule to balance monitoring granularity with resource efficiency. Fast (5 seconds) collects CPU and memory from /proc. Medium (30 seconds) covers network, TCP, and VMstat. Slow (5 minutes) handles load, processes, and disk. Glacial (1 hour) checks services. Daily (24 hours) gathers system identity. Each tier is optimised for how frequently that data typically changes.

Why does Server Scout use different collection intervals?

Different metrics change at different rates and have different overhead costs. CPU and memory can fluctuate rapidly and are cheap to read from /proc, so they are collected every 5 seconds. Service states rarely change and the systemd query is heavier, so hourly collection is appropriate. System identity (OS, kernel) changes only on upgrades, so daily collection is sufficient. This tiered approach keeps the agent lightweight.

How does the collection tier affect dashboard time ranges?

The dashboard shows data at different resolutions depending on the time range: 1 hour shows raw data points, 6 hours shows 30-second averages, 24 hours shows 2-minute averages, and 7 days shows 15-minute averages. Metrics collected less frequently than the display resolution appear as individual points. For example, hourly service data shows as one point per hour even in the 7-day view.

What is the performance impact of the Server Scout agent?

The agent uses less than 3 MB of RAM and under 100ms of CPU time per 5-second collection cycle. The fast tier reads only from /proc virtual filesystems, requiring no disk I/O. This near-zero footprint means the agent does not measurably affect the performance of monitored servers, even on small instances. The tiered collection ensures heavier operations run infrequently.

Was this article helpful?