Friday, December 24, 2010

Working backwards

Sometimes I get a bug that I just can't figure out. If the problem is reproducible with a good test case it's usually fairly easy to narrow the problem down pretty quickly. But what do you do if your product crashed once on a customer's server, and hasn't failed again?

Well, you start with the logs and diagnostic files. Usually we can figure it out from tracepoints and a core file. But sometimes this doesn't work.

In cases like this I don't like to throw in the towel without doing something. It feels like defeat (probably because it is). Instead, I always try to figure out what additional information could have helped me solve the problem.

How did we get to the point of failure? If I can identify two or three paths to the failure point and can't infer which one was taken I'll add some tracepoints to those paths. Or maybe I can add assertions on those paths to detect the error a bit earlier.

Since the problem isn't reproducible they won't help me now, but they might help me in the future. If the problem does occur again (and 'not reproducible' really just means 'very rare') hopefully these diagnostics will get me one step closer to the actual problem. And if it didn't hit any of my new tracepoints or assertions when it reoccurs, that's useful (and potentially maddening) too.

Of course I still might not be able to figure out what's happening. Then I add another round of tracepoints and assertions. Each failure gets me one step closer to the solution.

No comments:

Post a Comment