Friday, June 12, 2015

The Art Of Troubleshooting

I've worked tech support for a long time.  Too long actually.  I've watched people come and go, new ideas get implemented and then tanked, old ideas rehashed as new ideas and then tanked.  The one thing that is constant is the methodology of troubleshooting.

I'd like to say it's an art.  It is a skill but I don't know if it can actually be taught.  I've seen a lot of people that can pass all these wonderful tests that crap themselves when a down system call comes in.

Case in point: Down system with a HIGHLY AGITATED customer.  His boss is yelling at him, the users are yelling at him.  Everyone wants the system back online but no one seems to know how to get to that state.

Phones are offline but powered up.

The server is running.

The Phone switch is running.

The IP Network is working for everyone.

Tier 1 gets the call and the customer explodes on them.  They want to troubleshoot as they have been down for over an hour and need this fixed immediately.

Troubleshooting has to begin with the flow of data.  Data is life.

Approach the problem from the most likely to the least likely point of failure.  The phones are down and are all in the same physical building.  They span across multiple phone switches, but the same phone switches that have some non-working phones have working phones.

The customer reset phones and the phone switches and things didn't fix themselves so he called TAC (which is the right thing to do when you are stuck).  The only thing I can fault him on was panicking when things went South.

Panic. Never. Helps.

Follow the Hitchhiker's Guide to the Galaxy's rule: "Don't Panic".

In this case we had to look from Most Likely to Least Likely.  It is HIGHLY unlikely that 50 phones all suddenly go bad at exactly the same time with exactly the same symptoms.  Think from the phone out.

Phones are network devices first and foremost.

Does the phone have power? Yes
Does the phone boot up? Yes
Can the phone get a DHCP address? No
Does it come up with a cached IP address? Yes
Can you ping the phone's IP address from the server? No
Can you ping the phone's IP address from the phone switch? No
Can you ping the server from the phone? No
Can you ping the phone switch from the phone? No

From these quick tests we've determined the phone is getting power from the network POE switch (so likely the network cable between the phone and network switch is OK).  The phone boots up which indicates internally the phone is OK (no hardware/firmware failure).  The phone DOES NOT get an IP address (potential network problem).  The phone does indicate it's using its cached DHCP address (so the network used to work) but you cannot ping it from the phone equipment it has to connect to in order to operate.  You also cannot ping the phone equipment from the phone sooooo... SURVEY SAYS????

*ding*           IP Network Problem!

Rebooting the network switch the phones connected to corrected the problem and the phones logged in successfully.  Total time to identify the point of failure was around 10 minutes.

I'm not bashing the customer AT ALL.  I just want to use this to give an example of how important clear thinking, identification of the problem and an understanding of data flow is to proper troubleshooting.

1 comment:

Jason Maxham said...

This is a great troubleshooting story and I like how you quickly narrowed in on the source of the problem. You're absolutely right about how unlikely it would be that 50 phones would simultaneously malfunction! Especially if the system was recently working, the law of parsimony favors a single cause. The network switch was a dependency all the broken phones had in common, so that's a great place to start the investigation.

I wrote an article called "A Common Problem" about exactly this kind of scenario, and how important it is to be aware of context and dependencies while troubleshooting.