DRBD Split Brain Detection: Manual /proc Monitoring Prevents Data Loss

Production databases running DRBD face a fundamental problem: the very automation designed to prevent split-brain scenarios often creates them instead. Automated handlers timeout during high-load scenarios, network partition edge cases trigger unexpected failover behaviour, and the result is often complete data divergence between nodes.

The €34,000 figure isn't hypothetical. It represents the actual cost of restoring service when DRBD's automated split-brain resolution fails during peak trading hours, forcing manual recovery procedures that could have been avoided with proper monitoring.

Understanding DRBD Split-Brain Detection Fundamentals

DRBD's connection states reveal split-brain conditions well before automated handlers execute. The /proc/drbd interface provides real-time status information that standard monitoring tools often ignore.

Reading /proc/drbd Status Output

cat /proc/drbd

The output shows critical connection states: cs:Connected indicates normal operation, while cs:StandAlone signals a split-brain condition requiring immediate manual intervention. The ro:Primary/Secondary field shows role assignments, and ds:UpToDate/UpToDate indicates data synchronisation status.

Connection State Indicators

Watch for state transitions that indicate impending problems. cs:WFConnection suggests network issues, whilst cs:NetworkFailure indicates complete communication breakdown. These states often precede split-brain scenarios by several minutes, providing crucial early warning time.

The st:Primary/Secondary field shows disk states. When you see st:Primary/Primary or connection states showing cs:Unconnected, you're witnessing the exact moment automated resolution becomes unreliable.

Why Automated Handlers Create False Security

DRBD's automated split-brain handlers rely on predetermined policies that can't account for real-world complexity. The after-sb-0pri and after-sb-1pri handlers work well in laboratory conditions but fail under production stress.

Handler Timeout Scenarios

High I/O load causes handler timeouts. When your database is processing thousands of transactions per second, the time required for automated resolution exceeds DRBD's timeout thresholds. The result: handlers abandon the resolution attempt, leaving both nodes in Primary state.

Network latency compounds the problem. Geographic replication introduces round-trip delays that automated handlers can't accommodate. A 200ms network delay becomes a 4-second handler execution time, well beyond DRBD's default timeout values.

Network Partition Edge Cases

Asymmetric network failures break automated resolution completely. Node A can reach Node B, but B cannot reach A. Automated handlers on each node make different resolution decisions, creating the exact split-brain scenario they're designed to prevent.

Firewall changes introduce similar problems. A misconfigured iptables rule blocks DRBD traffic in one direction, triggering handler execution that assumes total network failure when partial connectivity exists.

Building Manual Detection Protocols

Manual intervention protocols seem more complex initially, but they provide reliability that automation cannot match. The key is building detection systems that identify problems before they become crises.

Essential /proc/drbd Monitoring Commands

Create a simple monitoring script that parses connection states:

grep -E "cs:|ro:|ds:" /proc/drbd | while read line; do
    if echo "$line" | grep -q "cs:StandAlone\|cs:Unconnected"; then
        echo "CRITICAL: Split-brain condition detected"
        # Trigger manual intervention procedures
    fi
done

Early Warning Signs in Status Output

Watch for patterns that indicate instability. Rapid state changes between cs:Connected and cs:WFConnection suggest network instability that will eventually trigger split-brain scenarios. Log these transitions and alert when frequency exceeds normal baselines.

Data synchronisation warnings appear in the ds: field. ds:Inconsistent states during normal operation indicate underlying storage problems that automated handlers cannot resolve.

Production Implementation Strategy

Successful manual protocols require integration with existing monitoring infrastructure and clearly defined response procedures.

Monitoring Integration Points

Integrate DRBD status checking into your regular monitoring cycle. Server Scout's service monitoring capabilities can track DRBD daemon health alongside connection state parsing, providing comprehensive distributed filesystem oversight.

Set up alerting thresholds based on connection state duration rather than simple binary status. A brief cs:WFConnection state is normal; prolonged disconnection requires intervention.

Response Procedure Templates

Document exact commands for split-brain resolution before you need them. Create step-by-step procedures covering primary selection, secondary invalidation, and full resynchronisation. Test these procedures during maintenance windows, not during production emergencies.

Establish communication protocols between team members. Split-brain resolution requires coordination between multiple administrators to prevent conflicting actions that worsen data divergence.

The cost of implementing manual detection protocols is minimal compared to data loss recovery expenses. Cross-platform monitoring solutions show consistent ROI when they prevent single points of failure in critical infrastructure.

Building robust DRBD monitoring requires understanding that automation fails precisely when you need it most. Manual detection protocols provide the reliability that automated handlers promise but cannot deliver.

FAQ

How often should I check /proc/drbd status to catch split-brain conditions early?

Check every 30 seconds during normal operation, increasing to every 5 seconds during network maintenance or high-load periods. Rapid polling during critical windows catches state transitions before they become permanent split-brain scenarios.

Can DRBD's automated handlers be safely disabled in production environments?

Yes, disabling automated handlers and implementing manual procedures reduces data loss risk. Set after-sb-0pri disconnect; and after-sb-1pri disconnect; to prevent automatic resolution attempts that often make problems worse.

What's the typical recovery time difference between automated handler failures and manual intervention?

Manual intervention typically restores service within 10-15 minutes with proper procedures, while failed automated resolution can require hours of data consistency checking and potential full resynchronisation from backups.

Stop Trusting DRBD's Automated Split-Brain Resolution: Building Manual Detection Protocols That Prevent €34,000 Data Loss Scenarios