Most PostgreSQL clusters fail silently across datacenters. Your monitoring shows green whilst your primary has been unreachable for minutes, replicas promote themselves independently, and applications start writing to multiple masters.
The standard approach waits for application errors before triggering alerts. By then, you're dealing with conflicting writes, manual data reconciliation, and potentially hours of downtime. Real failover monitoring needs to detect problems at the socket level before database queries reveal the chaos.
Here's how to build PostgreSQL monitoring that catches multi-datacenter failures in under 20 seconds using TCP connection analysis and replication state verification.
Understanding Multi-Datacenter PostgreSQL Failure Modes
PostgreSQL clusters across regions fail in predictable patterns. Network partitions cause the most dangerous scenarios - each datacenter believes it's the only survivor and promotes its replica to primary. Traditional monitoring misses this because each individual database reports healthy status.
Replication Lag Detection Strategies
Replication lag appears first in WAL sender processes, not in application queries. The pgstatreplication view shows real-time lag through LSN comparisons:
sentlsnvsflushlsnindicates network transmission delayswritelagandreplaylagreveal processing bottlenecks- Connection state changes in
pgstatactivityshow broken replication streams
Socket-level monitoring catches these failures faster than polling database views. TCP connection states reveal broken replication before PostgreSQL's internal timeouts trigger.
Split-Brain Scenario Identification
Split-brain occurs when multiple nodes believe they're primary. The pgisin_recovery() function identifies replica status, but network partitions prevent cross-region verification. Socket analysis of replication connections provides earlier detection - when primary-replica TCP connections drop, investigate immediately.
Setting Up Cross-Region Health Check Infrastructure
Effective PostgreSQL monitoring requires checks at multiple levels: database state, replication health, and network connectivity between datacenters.
TCP Connection State Monitoring Commands
Monitor PostgreSQL replication connections using ss -tuln to identify the replication port states:
# Check replication connections across regions
ss -tuln | grep :5432 | grep -E 'ESTABLISHED|SYN_SENT|CLOSE_WAIT'
This reveals connection problems before PostgreSQL's pgstatreplication updates. Look for connections in SYNSENT (can't reach replica) or CLOSEWAIT (replica disappeared) states.
Automated Failover Detection Timing
Set detection intervals based on your RTO requirements:
- Socket state checks: every 10 seconds
- Replication lag verification: every 15 seconds
- Cross-region connectivity tests: every 30 seconds
This timing catches failures within 20 seconds while avoiding false positives from temporary network hiccups.
Implementation: Step-by-Step Health Check Setup
Implement monitoring in layers - start with basic connectivity, then add replication-specific checks, finally cross-datacenter validation.
Step 1: Primary Database Health Verification
Verify primary status and connection counts. Check that the database accepts connections and isn't in recovery mode:
Create a monitoring script that connects to PostgreSQL and runs SELECT pgisinrecovery(). False indicates primary status. Monitor connection counts through pgstat_activity to catch connection exhaustion before it affects replication.
Step 2: Replica Status and Lag Monitoring
Query pgstatreplication on the primary to verify replica connections. Key metrics include:
client_addrconfirms expected replica IPsstateshould show 'streaming'sentlsnminusflushlsncalculates network lagreplay_lagindicates processing delays
Alert when any replica disappears from this view or lag exceeds your threshold (typically 10-30 seconds for most applications).
Step 3: Network Path Validation Between Datacenters
Test connectivity using both application ports and replication-specific connections. Use nc -z to verify port accessibility without establishing full database connections:
Test the PostgreSQL port (5432) and any streaming replication ports. This catches firewall changes, routing problems, or load balancer failures before they affect replication.
Step 4: Cross-Region Socket State Analysis
Implement socket monitoring similar to socket state MySQL replication monitoring. Check /proc/net/tcp for connection states between datacenter database servers.
Parse connection states to identify broken replication streams. Look for entries where local and remote addresses match your database servers but connection state indicates problems.
Testing and Validating Your Monitoring Setup
Validation requires controlled failure injection. Test scenarios that mirror real production failures.
Step 5: Simulating Common Failure Scenarios
Test network partitions using iptables rules to block replication traffic:
- Block specific database ports between datacenters
- Introduce packet loss using
tc netem - Simulate primary server failure through service stops
Verify your monitoring detects each scenario within 20 seconds. False positives are better than missed failures in multi-datacenter setups.
Step 6: Alert Integration and Notification Chains
Integrate detection scripts with your alerting system. For unified infrastructure monitoring, ensure PostgreSQL alerts include context about related systems - network switches, load balancers, and application servers.
Configure alert escalation paths specific to database failures. Split-brain scenarios require immediate human intervention, not automated failover.
Common Pitfalls and Troubleshooting
Network monitoring often shows green whilst database replication silently breaks. Common issues include:
Monitoring the wrong metrics: CPU and memory stats won't reveal replication lag. Focus on PostgreSQL-specific metrics and socket states.
Insufficient timing granularity: 5-minute checks miss transient failures that cause permanent replication breaks. Use sub-minute intervals for critical checks.
False positive management: Hardware-specific alert thresholds apply to database monitoring too. Network latency varies between cloud providers and regions.
Server Scout's PostgreSQL monitoring includes built-in replication lag detection through both database queries and socket analysis. The lightweight agent runs these checks without adding load to your database servers, providing real-time alerts when cross-region failures occur.
FAQ
How quickly can socket-level monitoring detect PostgreSQL split-brain scenarios?
Socket state analysis typically detects replication connection failures within 10-20 seconds, compared to 2-5 minutes for application-level health checks. The key is monitoring TCP connection states between database servers rather than waiting for query timeouts.
What's the minimum monitoring interval for reliable PostgreSQL failover detection?
Check socket states every 10 seconds and replication lag every 15 seconds. More frequent checks create unnecessary load without improving detection time, while longer intervals risk missing transient failures that cause permanent replication breaks.
Should automated failover trigger immediately when monitoring detects problems?
Never automate immediate failover for split-brain scenarios. Use monitoring for rapid human alerting instead. Automated failover should only trigger after multiple confirmation checks and manual validation of the failure scope.