💾

Building SAN Path Monitoring Before Total Storage Failure: /proc/scsi Detection Scripts for Production Arrays

· Server Scout

The Silent Path Loss: When SAN Redundancy Fails Without Warning

Last month, a hosting provider running 40 production servers lost their primary SAN controller at 2 AM. Standard monitoring showed normal disk performance right until the moment every virtual machine froze simultaneously. The problem: their storage array had been running on a single path for three weeks without anyone noticing.

This scenario plays out more frequently than most sysadmins realise. Modern SAN arrays excel at maintaining performance even when half their connectivity disappears, but your OS-level monitoring tools remain blissfully unaware of the growing single-point-of-failure.

The solution lies in parsing /proc/scsi/scsi output and device-mapper statistics that reveal path health long before iostat notices any performance degradation.

Parsing /proc/scsi/scsi for Path State Detection

While multipath -ll provides human-readable output, /proc/scsi/scsi contains the raw device states that matter for automated monitoring:

#!/bin/bash
cat /proc/scsi/scsi | grep -A 3 "Host:" | while read line; do
    if [[ $line =~ Host:[[:space:]]*scsi([0-9]+) ]]; then
        host_id="${BASH_REMATCH[1]}"
        echo "Checking SCSI host $host_id path health"
    fi
done

This approach catches vendor-specific error codes that don't surface in standard I/O statistics. EMC, NetApp, and HPE arrays each report path degradation differently, but all write status changes to the SCSI subsystem first.

Device-Mapper Statistics That Reveal Hidden Failures

The /sys/block/dm-*/dm/ directory structure contains queue depth and error counters that change during path transitions. Most monitoring systems ignore these metrics entirely:

for dm_device in /sys/block/dm-*; do
    if [[ -f "$dm_device/dm/uuid" ]]; then
        uuid=$(cat "$dm_device/dm/uuid")
        if [[ $uuid =~ mpath- ]]; then
            queue_depth=$(cat "$dm_device/queue/nr_requests")
            echo "Multipath device ${dm_device##*/}: queue depth $queue_depth"
        fi
    fi
done

Building Automated Detection Scripts

Effective multipath monitoring requires parsing output from multiple sources because no single command provides the complete picture.

Analysing multipath -ll Output for Path State Changes

The multipath -ll command shows current path priorities, but the key insight is tracking state transitions over time. Active paths switching to standby mode often indicate controller issues hours before complete failure.

A practical monitoring script checks for paths marked as "failed" or "faulty" while also tracking priority changes. Many SAN arrays demote path priorities as a precursor to complete path loss.

Setting Up Alerting Thresholds for Path Degradation

Unlike CPU or memory monitoring, storage path alerts should trigger on any reduction in available paths. There's no "warning threshold" for redundancy loss - either you have multiple paths or you don't.

The most effective approach involves tracking the total number of active paths per LUN and alerting immediately when that count drops. A hosting provider with dual-controller arrays should never see fewer than two active paths per device.

Implementation: Real-World Monitoring Setup

Production deployments require handling different storage vendor outputs and integrating with existing monitoring infrastructure without adding significant overhead.

Script Configuration for Different Storage Vendors

Each storage vendor reports path information slightly differently. EMC VNX arrays use different device naming conventions than NetApp E-Series, but both follow the same /proc/scsi parsing patterns.

The most robust approach involves checking for vendor-specific strings in /proc/scsi/scsi output first, then applying the appropriate parsing logic for that storage type.

Integration with Existing Monitoring Systems

SAN path monitoring integrates naturally with service monitoring systems that already track systemd services and hardware health. The bash-based approach means no additional dependencies or agent overhead.

Many teams combine this with socket-level analysis techniques for comprehensive infrastructure monitoring. The storage path detection becomes another data source feeding into existing alerting workflows.

For teams managing multiple storage arrays, Server Scout's alerting system can track path status across dozens of servers simultaneously. The lightweight bash agent approach means adding SAN monitoring doesn't impact the systems being monitored.

This monitoring approach proved invaluable for the hosting provider mentioned earlier. After implementing path detection scripts, they discovered that storage path failures followed predictable patterns - controller temperature spikes preceded path degradation by an average of 90 minutes. Their hardware monitoring now includes both path state and thermal analysis.

The key lesson: redundancy only works if you know when you've lost it. Standard Linux I/O monitoring tools focus on performance metrics, but storage reliability depends on path diversity that requires dedicated monitoring scripts.

Modern storage arrays fail gracefully until they don't. Building path redundancy detection through /proc/scsi analysis ensures you'll know about single-points-of-failure before they become complete system failures. For more details on implementing this type of system-level monitoring, check the official Linux SCSI documentation for kernel interface specifications.

FAQ

How often should multipath status checks run without impacting performance?

Every 30 seconds provides adequate coverage without measurable I/O overhead. The /proc filesystem reads are cached and parsing multipath output typically completes in under 50ms on production systems.

Can this monitoring detect partial path failures that don't show as completely down?

Yes, by tracking queue depth changes and path priority modifications over time. Many SAN arrays reduce path priorities or increase queue depths as early indicators of controller stress before marking paths as failed.

Does this approach work with software-defined storage like Ceph or GlusterFS?

Partially - the /proc/scsi analysis applies to block-level multipath devices, but distributed storage systems require different monitoring approaches focused on cluster health rather than individual path redundancy.

Ready to Try Server Scout?

Start monitoring your servers and infrastructure in under 60 seconds. Free for 3 months.

Start Free Trial