Saturday, August 28, 2010

Debugging tip: timing holes

Parallel programming is hard. Really hard. And, as we're now firmly in the era of multi-core computing, it's an increasingly important skillset.

A couple of days ago we hit an assertion in some code. The failing test case reproduced the problem 2 out of 300 times. So it was reproducible, but highly intermittent.

The developer who was debugging the problem speculated that there might be a timing hole. There was a section of code which set two variables kind of like this:

section->hasRedObjects = true;
section->baseOfRedObjects = basePointer;

He hypothesized that if one thread ran through this code while another thread read the same fields, the second thread might see that the section has red objects, but, briefly, might think they start at NULL instead of the correct address somewhere in the middle of the section.

We used a really simple technique to prove this: make the timing hole bigger.

section->hasRedObjects = true;
usleep(10000); // sleep for 10 ms
section->baseOfRedObjects = basePointer;

Adding a 10ms sleep between the two assignments caused the problem to occur nearly every time, confirming his hypothesis.

(We fixed the problem by getting rid of the bool hasRedObjects field. We defined a new function, bool hasRedObjects() { return NULL != baseOfRedObjects; }. With only one field, the updates are atomic and the timing hole is eliminated.)

No comments:

Post a Comment