🔄

PostgreSQL Cross-Region Failover Detection: Building 20-Second Split-Brain Alerts Through Socket State Analysis

· Server Scout

Most PostgreSQL clusters fail silently across datacenters. Your monitoring shows green whilst your primary has been unreachable for minutes, replicas promote themselves independently, and applications start writing to multiple masters.

The standard approach waits for application errors before triggering alerts. By then, you're dealing with conflicting writes, manual data reconciliation, and potentially hours of downtime. Real failover monitoring needs to detect problems at the socket level before database queries reveal the chaos.

Here's how to build PostgreSQL monitoring that catches multi-datacenter failures in under 20 seconds using TCP connection analysis and replication state verification.

Understanding Multi-Datacenter PostgreSQL Failure Modes

PostgreSQL clusters across regions fail in predictable patterns. Network partitions cause the most dangerous scenarios - each datacenter believes it's the only survivor and promotes its replica to primary. Traditional monitoring misses this because each individual database reports healthy status.

Replication Lag Detection Strategies

Replication lag appears first in WAL sender processes, not in application queries. The pgstatreplication view shows real-time lag through LSN comparisons:

  • sentlsn vs flushlsn indicates network transmission delays
  • writelag and replaylag reveal processing bottlenecks
  • Connection state changes in pgstatactivity show broken replication streams

Socket-level monitoring catches these failures faster than polling database views. TCP connection states reveal broken replication before PostgreSQL's internal timeouts trigger.

Split-Brain Scenario Identification

Split-brain occurs when multiple nodes believe they're primary. The pgisin_recovery() function identifies replica status, but network partitions prevent cross-region verification. Socket analysis of replication connections provides earlier detection - when primary-replica TCP connections drop, investigate immediately.

Setting Up Cross-Region Health Check Infrastructure

Effective PostgreSQL monitoring requires checks at multiple levels: database state, replication health, and network connectivity between datacenters.

TCP Connection State Monitoring Commands

Monitor PostgreSQL replication connections using ss -tuln to identify the replication port states:

# Check replication connections across regions
ss -tuln | grep :5432 | grep -E 'ESTABLISHED|SYN_SENT|CLOSE_WAIT'

This reveals connection problems before PostgreSQL's pgstatreplication updates. Look for connections in SYNSENT (can't reach replica) or CLOSEWAIT (replica disappeared) states.

Automated Failover Detection Timing

Set detection intervals based on your RTO requirements:

  • Socket state checks: every 10 seconds
  • Replication lag verification: every 15 seconds
  • Cross-region connectivity tests: every 30 seconds

This timing catches failures within 20 seconds while avoiding false positives from temporary network hiccups.

Implementation: Step-by-Step Health Check Setup

Implement monitoring in layers - start with basic connectivity, then add replication-specific checks, finally cross-datacenter validation.

Step 1: Primary Database Health Verification

Verify primary status and connection counts. Check that the database accepts connections and isn't in recovery mode:

Create a monitoring script that connects to PostgreSQL and runs SELECT pgisinrecovery(). False indicates primary status. Monitor connection counts through pgstat_activity to catch connection exhaustion before it affects replication.

Step 2: Replica Status and Lag Monitoring

Query pgstatreplication on the primary to verify replica connections. Key metrics include:

  • client_addr confirms expected replica IPs
  • state should show 'streaming'
  • sentlsn minus flushlsn calculates network lag
  • replay_lag indicates processing delays

Alert when any replica disappears from this view or lag exceeds your threshold (typically 10-30 seconds for most applications).

Step 3: Network Path Validation Between Datacenters

Test connectivity using both application ports and replication-specific connections. Use nc -z to verify port accessibility without establishing full database connections:

Test the PostgreSQL port (5432) and any streaming replication ports. This catches firewall changes, routing problems, or load balancer failures before they affect replication.

Step 4: Cross-Region Socket State Analysis

Implement socket monitoring similar to socket state MySQL replication monitoring. Check /proc/net/tcp for connection states between datacenter database servers.

Parse connection states to identify broken replication streams. Look for entries where local and remote addresses match your database servers but connection state indicates problems.

Testing and Validating Your Monitoring Setup

Validation requires controlled failure injection. Test scenarios that mirror real production failures.

Step 5: Simulating Common Failure Scenarios

Test network partitions using iptables rules to block replication traffic:

  • Block specific database ports between datacenters
  • Introduce packet loss using tc netem
  • Simulate primary server failure through service stops

Verify your monitoring detects each scenario within 20 seconds. False positives are better than missed failures in multi-datacenter setups.

Step 6: Alert Integration and Notification Chains

Integrate detection scripts with your alerting system. For unified infrastructure monitoring, ensure PostgreSQL alerts include context about related systems - network switches, load balancers, and application servers.

Configure alert escalation paths specific to database failures. Split-brain scenarios require immediate human intervention, not automated failover.

Common Pitfalls and Troubleshooting

Network monitoring often shows green whilst database replication silently breaks. Common issues include:

Monitoring the wrong metrics: CPU and memory stats won't reveal replication lag. Focus on PostgreSQL-specific metrics and socket states.

Insufficient timing granularity: 5-minute checks miss transient failures that cause permanent replication breaks. Use sub-minute intervals for critical checks.

False positive management: Hardware-specific alert thresholds apply to database monitoring too. Network latency varies between cloud providers and regions.

Server Scout's PostgreSQL monitoring includes built-in replication lag detection through both database queries and socket analysis. The lightweight agent runs these checks without adding load to your database servers, providing real-time alerts when cross-region failures occur.

FAQ

How quickly can socket-level monitoring detect PostgreSQL split-brain scenarios?

Socket state analysis typically detects replication connection failures within 10-20 seconds, compared to 2-5 minutes for application-level health checks. The key is monitoring TCP connection states between database servers rather than waiting for query timeouts.

What's the minimum monitoring interval for reliable PostgreSQL failover detection?

Check socket states every 10 seconds and replication lag every 15 seconds. More frequent checks create unnecessary load without improving detection time, while longer intervals risk missing transient failures that cause permanent replication breaks.

Should automated failover trigger immediately when monitoring detects problems?

Never automate immediate failover for split-brain scenarios. Use monitoring for rapid human alerting instead. Automated failover should only trigger after multiple confirmation checks and manual validation of the failure scope.

Ready to Try Server Scout?

Start monitoring your servers and infrastructure in under 60 seconds. Free for 3 months.

Start Free Trial