Last September, a regional investment bank discovered their €340,000 trading system outage wasn't caused by their shiny new infrastructure. The culprit was hiding three network hops away, inside their AS400 mainframe running payroll and settlement systems.
The bank's expensive IBM monitoring stack showed everything green. Their z/OS performance dashboards displayed healthy CPU usage. Their COBOL batch jobs completed within SLA windows. Yet traders couldn't access client portfolios, and settlement processing ground to a halt every Tuesday morning.
The Gateway Server Investigation
The breakthrough came from an unexpected source: socket analysis on their Linux gateway servers. These machines bridged the mainframe to their modern infrastructure, handling thousands of COBOL application requests daily.
Their systems administrator started examining /proc/net/tcp during the next Tuesday morning incident. What he found changed everything.
cat /proc/net/tcp | grep :1445 | wc -l
847
Port 1445 handled AS400 job queue communications. Normal connection counts never exceeded 200. During the outage window, socket states told the real story: hundreds of connections stuck in CLOSE_WAIT status, indicating the mainframe applications weren't properly closing database connections after batch processing.
Network Socket Patterns Expose Mainframe Bottlenecks
The team built monitoring around socket state transitions. They tracked connection lifecycle patterns between the Linux gateways and mainframe systems, focusing on three key metrics:
- Active connection counts per mainframe service port
- Socket state distribution ratios (ESTABLISHED vs CLOSEWAIT vs TIMEWAIT)
- Connection establishment rates versus cleanup rates
This approach revealed that COBOL batch jobs were creating connection leaks every time they processed large settlement files. The mainframe's internal monitoring couldn't see this because it only tracked CPU and memory usage, not network resource consumption.
File Descriptor Monitoring Catches Application Health Issues
Parallel monitoring of /proc/PID/fd on gateway processes provided the missing piece. The DB2 connection pool processes showed steadily climbing file descriptor counts during batch windows.
ls /proc/$(pgrep db2_bridge)/fd | wc -l
1847
Normal operations used around 200 file descriptors. The pattern was clear: each Tuesday's payroll run triggered a cascade of connection leaks that eventually starved the trading systems of database resources.
Implementation Results and Cost Analysis
The socket analysis methodology they developed caught the next incident 40 minutes before it impacted trading systems. Early detection allowed them to restart the gateway processes proactively, clearing the leaked connections.
More importantly, they identified the root cause: a COBOL subroutine wasn't properly releasing DB2 cursors after processing settlement batches larger than 50,000 records.
Avoiding Proprietary Tool Costs
Previously, the bank had considered purchasing IBM's Tivoli monitoring extensions for mainframe application performance management, quoted at €180,000 annually. Their system-level COBOL monitoring approach using /proc filesystem analysis cost them nothing beyond internal development time.
The lightweight monitoring scripts consumed less than 2MB of memory per gateway server and produced more actionable alerts than their existing enterprise monitoring stack.
Scaling Across Multiple Mainframes
Success with the AS400 led them to apply similar techniques across their z/OS systems. They discovered that socket analysis could predict mainframe performance issues across different applications:
- CICS transaction processing showed connection buildup patterns 15 minutes before response time degradation
- IMS batch job socket behaviour indicated queue exhaustion before jobs started failing
- TSO user session socket states revealed security policy violations and resource hogging
The bank now monitors 12 mainframe systems through their Linux gateway infrastructure, catching performance issues before they cascade into business-critical failures. Their monitoring approach demonstrates how modern alerting can bridge legacy and contemporary infrastructure without requiring proprietary agents on the mainframes themselves.
Socket analysis transformed their incident response from reactive firefighting to proactive prevention. Tuesday morning settlement processing now completes without affecting trading systems, and their mainframe performance visibility rivals what they achieved on modern Linux infrastructure.
This approach to infrastructure monitoring proves that understanding network behaviour often reveals more about application health than traditional performance metrics. When vendor monitoring tools show green but users experience problems, the answers frequently hide in the network layer that connects your systems.
FAQ
Can this socket analysis approach work with other mainframe vendors besides IBM?
Yes, the technique works with any mainframe system that communicates through TCP connections. We've seen similar success with Unisys and Fujitsu mainframes, as long as you have Linux gateway servers handling the network communication.
How much overhead does continuous /proc monitoring add to gateway servers?
The monitoring scripts consume less than 2MB of memory and negligible CPU resources. The bash-based approach outlined in the Linux Foundation's proc documentation is far more efficient than agent-based solutions.
What's the minimum connection volume needed to make socket pattern analysis effective?
You need at least 50-100 concurrent connections to establish meaningful baselines. Below that threshold, connection state changes become too sporadic to identify clear patterns indicating application problems.