Last Tuesday, a Dell R730 with hardware RAID reported all drives as healthy through its OMSA interface. Three hours later, one of the drives failed completely, taking the array offline for four hours whilst a replacement was sourced and rebuilt. The SMART data had been screaming warnings for weeks.
RAID controllers prioritise keeping arrays online. Their health reports focus on immediate functionality rather than predictive failures, often missing the subtle degradation patterns that indicate impending drive death. This creates a false sense of security that can cost you dearly.
What RAID Controllers Actually Report
Hardware RAID controllers typically monitor basic drive availability, surface scan results, and obvious failures. They'll catch a drive that stops responding or reports unrecoverable read errors, but they're blind to the gradual degradation that accounts for most predictable failures.
The controller firmware interprets SMART data through its own filters, often suppressing attributes that seem "within normal ranges" even when they show concerning trends.
Critical SMART Attributes Your Controller Ignores
Direct smartctl monitoring reveals attributes that predict failures weeks before they occur:
Reallocated Sector Count (ID 5) - This should be zero on healthy drives. Any non-zero value indicates the drive is already failing at the hardware level. RAID controllers often don't alert until this reaches vendor-specific thresholds that are far too high.
Current Pending Sector Count (ID 197) - Sectors waiting to be remapped. Even a handful of pending sectors suggests the drive's error correction is struggling. Check this with:
sudo smartctl -A /dev/sda | grep Current_Pending_Sector
Temperature readings - Drives running consistently above 45°C show significantly higher failure rates. RAID controllers rarely monitor this, but it's visible in SMART attribute 194:
sudo smartctl -A /dev/sda | grep Temperature_Celsius
Load Cycle Count vs Power-On Hours - Enterprise drives shouldn't be spinning down frequently. A high load cycle count relative to power-on time indicates either misconfiguration or mechanical stress.
Building Predictive Monitoring
Run smartctl tests weekly via cron, not just when problems surface:
# /etc/cron.d/smart-monitoring
0 2 * * 0 root /usr/sbin/smartctl -t long /dev/sda
0 2 * * 0 root /usr/sbin/smartctl -t long /dev/sdb
Capture and trend the raw values, not just the normalised scores. A reallocated sector count that jumps from 0 to 8 then back to 4 is concerning regardless of the threshold.
Temperature monitoring becomes critical in dense server environments. Drives that consistently run hot fail sooner, and this thermal stress often doesn't trigger RAID alerts until after damage accumulates. Our hardware-specific alert thresholds guide covers setting appropriate temperature baselines for different server generations.
Long-Term Error Patterns
The smartmontools documentation details how different error types correlate with failure modes. Media errors cluster before catastrophic failures - a pattern invisible to RAID health checks that only see the current state.
Errors that self-correct through retries still indicate degrading read heads or magnetic media. RAID controllers report these as successful operations, but the raw error counts tell a different story.
Integration with Production Monitoring
Combine SMART monitoring with your existing infrastructure. Server Scout's plugin system can capture these metrics alongside your standard server health data, alerting when SMART attributes cross thresholds that matter for your hardware generation.
The key is consistent data collection rather than crisis response. Drives that pass RAID health checks but show climbing reallocated sectors need immediate attention, not eventual replacement.
Trending SMART data reveals failure patterns weeks before they impact production. Your RAID controller will keep reporting green lights right up until the drive stops responding entirely.