Armed with a text editor

mu's views on program and recipe! design

June 2006

Painful Debugging Posted 2006.06.03 15:32 PDT

The last couple days I went through one of the more painful debugging experiences of my life. The symptom was that one of our OggFlac unit tests in Mutagen was failing. But not on x86; that would be too easy. Instead it was only failing on AMD64 systems, which is reminiscent of a bug I covered before. Thankfully Pete was able to provide direct network access to an AMD64 machine for me to test on; otherwise we'd undoubtedly still be fighting this bug. As it was, we spent at least four hours each on Thursday and Friday.

So knowing that this was a painful bug, how do we debug such symptoms?

Tighten our tests

First things first, try to isolate the problem as much as possible. The failing test did a small sequence of things, starting from a nearly blank file, adding some stuff, deleting it, adding other stuff, and then failing the test on line 39. So pare it down. As it turns out all we need are the three lines 37–39: add, save, test.

Examine the code in question

Now that we know which methods reveal the bug, we can look into the code behind them. This can help find many kinds of errors. I think it helped us tighten our code a bit. But after many hours of staring at it and trying various things it was still no help.

Compare to passing code

We had a passing version of the same test on x86. Was the problem in our code, or in the crosscheck we were performing? Taking the output produced on x86 and testing it with flac on x64 (and vice versa) showed it was a problem in our code, and not the tool. Drat. But hey, the file is the same up until that last step. That confirms everything else so far is good.

Since looking at the file gave us no good clues, other than the broken file in question looked like garbage, it was time to delve deeper. I coded up a tracing wrapper, which injects tracing primitives into python code. I invoked this from the OggFLAC tests, and traced ogg, oggvorbis, and oggflac modules under mutagen.

Note: the linked version is the final version. Earlier versions lacked tracefile, function return values, and indentation.

Confuse yourself over and over

The only differences for the longest time, after filtering out inconsequential differences, were at the failure spots themselves. The passing code would continue; the failing code would report an exception. I would add more tracing. The same. What else can I trace? There has to be a difference somewhere. So I added tracefile.

The dreaded heisenbug

And the size of the tracelog exploded. I spent time coping with that when I should have been realizing that there is no longer any difference. The good and bad trace were too different (because of our fallback code for the AMD64 mmap problems in python) so make them more similar. Somehow the failing code started working. What did I do? I added tracefile. Huh? Great...now I can't even observe my bug without it morphing.

In concurrent programming, or code with extreme speed needs, heisenbugs can be a real problem. Fortunately Mutagen is neither.

The final push

The thing about the final push is you never know when it will hit. When you first start debugging you're sure it's right around the corner. Then the more you work at it without solving it, the more convinced you become that something insane is wrong with your tools. So here's a reconstructed stream of consciousness from when I first saw the heisenbug.

Take all tracing out, but leave in tracefile. Okay, it's still working. Take out tracefile. Okay, it fails. Hey, it fails on x86 too. Hallelujah! Wait. Huh? Pare down tracefile. Use it and take out #read. It fails. Put #read back and take out #write. It works. Leave in #read but simplify it to just recurse the call. It fails. What's left?

Eureka

I fiddle with it and find the only piece I need is self.tell() before calling self.read. So I mention my bewilderment to Joe and he finds a gem: this bug has been reported to Python before. And denied. Apparently rightfully so. Six month old tested code is to blame. I fix it up by using explicit seek calls, and everything passes.

Lessons learned

There's always another dark corner of your toolchain left to be learned, often the hard way. In our case it was the fact that interleaved read and write operations on a file pointer, without explicitly resetting the stream position, are undefined. They work most of the time on the implementations we run on, but they fail sometimes. I feel we did an okay job of challenging and proving our assumptions right or wrong, but it was really hard to make the leap that let us trust the new barely tested code that was revealing the failure, and stop trusting the old tested code that was causing it. I'd like to do better about that in the future.

In the meantime, Mutagen 1.4, with corrected file handling code on 64-bit systems, here we come!

(0 Comments ) (0 Trackbacks) debug mutagen python