Re-posting an OPNET APM Xpert user's recent troubleshooting experience...Run-Time Application Dependency Maps for the Network Team.
It was 8:47am when he got the call. 5 helpdesk incidents reported in a 5 minute period for the recently upgraded portfolio management application. Massive spikes in user experience. Angry employees claiming to be paralyzed by "the slow network."
Employees in multiple departments rely on the portfolio manager application to retain customers, make trading decisions, and drive revenue. As usual, the network team was on the hook to either find the problem or prove "it's not the network." And because this was a Class A incident, the CIO required real time updates on the troubleshooting status. Fun.
The good news is that AppResponse Xpert revealed that this was an application problem (not a network problem) right away. Each transaction seemed to be stuck on a web server, either because the web server had issues or because back end application tiers were help up for some reason. Rather than passing the incident over to the systems team, he decided to dig deeper.
The first hurdle was figuring out which of the 1,000+ systems in the datacenter were talking to the web server for the slow transactions. He asked three different people involved in managing the application to explain the architecture and dependencies. No clear answers that made sense emerged. This was a complicated SOA architecture with lots of moving parts, and no system level alarms provided useful clues.
Based on his knowledge of the front tier, he queried AppMapper Xpert to provide a current multi-tier application dependency map. 10 critical servers apparently hosted different components of the guilty application at run time. But AppMapper also revealed an unexpected relationship: some load was bypassing the load balancer and going directly to a server named "credit1." Perhaps specific web services had not been correctly configured to hit the load balancer? Perhaps the upgrade over the weekend had something to do with this?
CPU, memory, and other metrics from SNMP queries seemed normal. But upon closer inspection with AppResponse Xpert, response time spikes were visible from credit1 that lined up exactly with reported user experience problems, and were not present in the other two systems supporting the credit tier. Some service calls were likely directed to a single server based on a stale configuration file. He walked over to the head of development with a PDF showing the dependency map, the load balancer bypass, and the response time spikes in user experience.
10 minutes later, a configuration change was made to the application to drive all traffic to the load balancer. And that was it. The rest of the week proved less stressful.
Contact your Interlink Account Manager to find out how to become an OPNET Synergy Partner.