A couple of days ago we hit an assertion in some code. The failing test case reproduced the problem 2 out of 300 times. So it was reproducible, but highly intermittent.
The developer who was debugging the problem speculated that there might be a timing hole. There was a section of code which set two variables kind of like this:
section->hasRedObjects = true;
section->baseOfRedObjects = basePointer;
He hypothesized that if one thread ran through this code while another thread read the same fields, the second thread might see that the section has red objects, but, briefly, might think they start at NULL instead of the correct address somewhere in the middle of the section.
We used a really simple technique to prove this: make the timing hole bigger.
section->hasRedObjects = true;
usleep(10000); // sleep for 10 ms
section->baseOfRedObjects = basePointer;
(We fixed the problem by getting rid of the bool hasRedObjects field. We defined a new function, bool hasRedObjects() { return NULL != baseOfRedObjects; }. With only one field, the updates are atomic and the timing hole is eliminated.)
No comments:
Post a Comment