Disk Metrics Explained

Understanding Your Server's Storage Performance

Disk metrics are among the most critical indicators of server health and performance. Unlike CPU and memory, which can recover quickly from temporary spikes, disk problems tend to be persistent and can bring an entire system to a halt. Server Scout monitors both disk usage and I/O performance across multiple tiers to give you comprehensive visibility into your storage subsystem.

Storage is typically the slowest component in your server's architecture. Where RAM access is measured in nanoseconds and CPU operations in picoseconds, disk operations are measured in milliseconds—orders of magnitude slower. This makes disk the most common bottleneck in server performance, which is why Server Scout tracks disk metrics so closely.

Disk Usage Metrics (Slow Tier - Every 5 Minutes)

Server Scout collects disk usage statistics every 5 minutes from all mounted filesystems, filtering out pseudo-filesystems like tmpfs, proc, and sysfs to focus on real storage devices.

Primary Disk Metrics

Metric	Description	Source
`disk_percent`	Usage percentage of the primary mount point (/)	`df` command output
`disk_used_gb`	Disk space used on the primary mount in gigabytes	`df` command output
`disk_total_gb`	Total disk capacity of the primary mount in gigabytes	`df` command output

The disk_percent metric is your primary indicator of storage health. Most system administrators follow the 80/90 rule: start investigating when usage exceeds 80%, take urgent action when it reaches 90%. Unlike memory pressure, which can be resolved by freeing cache, or CPU spikes, which often resolve themselves, disk space only decreases through deliberate action—files don't delete themselves.

Comprehensive Mount Point Monitoring

Server Scout doesn't just monitor your root filesystem. The disk_mounts array provides detailed information about every mounted storage device on your system:

Field	Description	Purpose
`mount`	Mount path (e.g., `/`, `/home`, `/var`)	Identifies the filesystem location
`device`	Block device (e.g., `/dev/sda1`, `/dev/nvme0n1p1`)	Shows the underlying hardware
`fs`	Filesystem type (ext4, xfs, btrfs, etc.)	Indicates filesystem capabilities and limitations
`total_gb`	Total capacity in gigabytes	Baseline for calculating usage
`used_gb`	Used space in gigabytes	Actual consumption
`percent`	Usage percentage	Quick visual indicator
`inodes_percent`	Inode usage percentage	Tracks filesystem metadata usage
`ro`	Read-only flag	Critical error indicator

The Hidden Threat: Inode Exhaustion

The inodes_percent metric tracks something many administrators overlook until it causes problems. Inodes are filesystem metadata structures that track individual files and directories. Every file consumes exactly one inode, regardless of its size.

You can have gigabytes of free disk space but still be unable to create new files if you've exhausted your inodes. This commonly occurs on:

Mail servers with millions of small email files
Log directories with numerous small log files
Development environments with extensive dependency trees (looking at you, node_modules)
Backup systems storing many incremental snapshots

When inodes_percent approaches 100%, you'll start seeing "No space left on device" errors even with available disk space. The solution isn't freeing disk space—it's deleting files to free inodes.

Read-Only Remount: The Canary in the Coal Mine

The ro flag in the mount information serves as an early warning system for serious filesystem problems. When the Linux kernel detects filesystem corruption, hardware errors, or other critical issues, it often remounts the filesystem as read-only to prevent further damage.

A filesystem that has been remounted read-only indicates:

Potential disk hardware failure
Filesystem corruption
I/O errors or timeouts
Power loss during critical write operations

This requires immediate attention, as the system can no longer write to the affected filesystem.

Disk I/O Metrics (Medium Tier - Every 30 Seconds)

While usage metrics tell you how much space you're consuming, I/O metrics reveal how hard your storage subsystem is working. Server Scout collects these every 30 seconds to capture performance patterns.

Metric	Description	Source
`disk_io_read_bytes`	Cumulative bytes read from all block devices	`/proc/diskstats` field 6 × 512
`disk_io_write_bytes`	Cumulative bytes written to all block devices	`/proc/diskstats` field 10 × 512

These are cumulative counters that continuously increase throughout system uptime. The Server Scout dashboard converts these into meaningful rates (bytes per second) by calculating the delta between consecutive measurements.

Understanding I/O Patterns

Read and write patterns tell different stories about your system's behaviour:

High read I/O typically indicates:

Database query processing
File serving (web servers, file shares)
Backup or archival operations
Search indexing
Cold cache scenarios where data isn't in memory

High write I/O suggests:

Log generation
Database transactions
Temporary file creation
Swap activity (concerning if persistent)
File downloads or data ingestion

The agent parses these values directly from /proc/diskstats, converting the sector counts (512-byte units) to bytes for easier interpretation. This approach provides accurate, low-overhead monitoring without executing external commands in the collection path.

The Relationship Between Usage and I/O

Disk usage and I/O metrics are interconnected in ways that can create cascading performance problems:

Disk Space and Write Performance

When disk usage approaches 100%, write operations begin failing, but the problems start well before that point. Most filesystems reserve 5-10% of space for the root user to ensure system operations can continue even when the disk appears "full" to regular users.

As available space decreases:

Filesystem fragmentation increases
Write performance degrades
Temporary file operations fail
Log rotation may stop working
Applications may crash or behave unpredictably

I/O Wait and System Performance

Heavy disk I/O directly impacts the cpu_iowait metric you'll see in Server Scout's CPU monitoring. When processes are waiting for disk operations to complete, they're not using CPU cycles, but they're also not making progress. High iowait percentages indicate your system is spending significant time waiting for storage operations.

This creates a performance bottleneck where:

Applications become unresponsive
Load averages increase without high CPU usage
User experience degrades significantly
System appears to "hang" during heavy I/O

Practical Monitoring Thresholds

Effective disk monitoring requires different approaches for usage and I/O metrics:

Usage Thresholds

80% usage: Begin investigation and cleanup planning
90% usage: Urgent action required
95% usage: Critical—immediate intervention needed
90% inode usage: Start monitoring file creation patterns
Any read-only remount: Emergency response required

I/O Rate Guidelines

I/O thresholds depend heavily on your hardware, but general patterns include:

Sustained writes >100MB/s: Monitor for application issues
Read rates consistently exceeding write rates: Healthy pattern for most applications
Write rates consistently exceeding read rates: May indicate logging issues or data ingestion problems
Sudden I/O pattern changes: Investigate for new processes or system changes

Data Collection and Storage Efficiency

Server Scout's approach to disk monitoring balances comprehensiveness with efficiency:

Usage metrics are collected every 5 minutes because disk space changes gradually. More frequent collection would waste resources without providing meaningful additional insight.

I/O metrics are collected every 30 seconds because I/O patterns can change rapidly and you need sufficient granularity to identify performance problems and correlate them with application behaviour.

The agent reads directly from /proc/diskstats rather than executing commands like iostat or iotop, maintaining its lightweight footprint while providing accurate data.

Using Disk Metrics for Proactive Management

Effective disk monitoring goes beyond reactive alerting. Use these metrics proactively:

Trend analysis: Track usage growth rates to predict when intervention will be needed
Pattern recognition: Identify daily, weekly, or monthly I/O patterns to optimise maintenance windows
Capacity planning: Use historical data to size future storage requirements
Performance correlation: Compare disk I/O with application performance metrics to identify bottlenecks

The 5-minute collection interval for usage metrics provides sufficient granularity for trend analysis while the 30-second I/O collection captures performance variations without overwhelming you with data.

Understanding your disk metrics helps you maintain system reliability, plan capacity upgrades, and identify performance bottlenecks before they impact users. In Server Scout's dashboard, you can view these metrics across multiple time ranges to spot both immediate issues and long-term trends, ensuring your storage subsystem remains healthy and performant.

Back to Complete Reference Index

Frequently Asked Questions

What are disk inodes and why do they matter?

Inodes are data structures that store metadata about files and directories (permissions, ownership, timestamps). Each filesystem has a fixed number of inodes set at creation time. Running out of inodes (high inodes_percent) means you cannot create new files even if disk space is available. This commonly occurs on filesystems with millions of small files, such as mail servers or cache directories.

What is the difference between disk usage and disk I/O metrics?

Disk usage metrics (disk_percent, disk_used_gb) measure storage capacity, collected every 5 minutes. Disk I/O metrics (disk_io_read_bytes, disk_io_write_bytes) measure throughput as cumulative byte counters collected every 30 seconds. The dashboard shows I/O as rates per second. High disk usage means you are running out of space; high disk I/O means the storage subsystem is busy.

How should I set disk usage alert thresholds?

Start with an alert at 80% disk usage (disk_percent) as a warning and 90% as critical. Adjust based on how quickly your disk fills and how long it takes to respond to alerts. Servers with rapid log growth or database writes may need lower thresholds like 70%. Always leave enough free space for the operating system to function, temporary files, and log rotation.

What does the disk_mounts array contain?

The disk_mounts array provides per-mount details including the mount point path, device name, filesystem type, total and used capacity in GB, usage percentage, inode usage percentage, and read-only status. This lets you monitor individual mount points separately when a server has multiple partitions, network mounts, or attached storage volumes.

Was this article helpful?

Search Results