Armed with a text editor

mu's views on program and recipe! design

December 2005

Where do tags come from? Posted 2005.12.29 11:56 PST

Library-based music players are great. They let you use all sorts of data to tweak your listening habits, or to find the song you want, or to help make sure you don't downloadbuy a song twice by accident. But aside from the data that can be automatically generated like how many times it's played a track, or when the last time was, metadata has to come from somewhere. I don't know of any online sources that provide the richness of data that I want, whether I purchase the track on disc or online. Most of the time I have to tag tracks myself. Here's a quick history of the ways I've done it before Quod Libet.

Early offerings

When I first started ripping my collection in the late 1990s, WinAmp was the pinacle program; MusicMatch Jukebox was a helpful ripper with its own ambitions. I would rip a CD with the latter, and then fix up its tags with the former. Tag editing was limited to ID3v1 and the state of the art tag editor was a simple form allowing me to specify a Title, an Artist, an Album, a Comment, a Genre, a Year and a Track number. If I wanted to edit all the tracks of an album, I had to duplicate the information across the tracks manually.

I believe MusicMatch worked differently and would actually allow me to specify some of these items in bulk. However it would append its own frame to the end of a song, effectively overwriting the real ID3 tag, and when an ID3 tag was appended, it couldn't find its frame. Yes, ID3 tags are a hack. At this stage they weren't even an elegant hack, and intercompatibility wasn't ready.

Free alternatives

Some years later I used the free software equivalents: XMMS and grip. Grip would query CDDB or FreeDB for track information, and automatically add it to an encoded MP3 or Ogg Vorbis track. Thankfully there was no extra frame conflict; ID3v1/ID3v2 had become standard for MP3s by this time, and Ogg Vorbis had its own comment standards. Now that the tools agreed, they worked together well. When CDDB had all the information correct there was no editing work to do; when it did not, grip allowed me to edit it before ripping. The interface was similar to WinAmp's in that there was a form with several entries, but since grip shared items such as Artist and Album, it saved a lot of copy and paste.

Migrations

Not all my albums had good coverage on CDDB. One frequent problem would be spelling and accents for unusual names. After one session, half of my discs had been ripped under artist Moxy Fruvous and half under Moxy Früvous. Other times I would want to change storage conventions, or learn new information that meant my previous tags were wrong. Or I started using Rhythmbox and wanted more informative tags—it's too bad that Rhythmbox is only now starting to get editing capabilities.

At this point grip would be no help unless I were to re-rip and re-encode the tracks, so I turned to cantus. Cantus was the first program I used to rename files based on their tags, or retag files based on their names. Unfortunately cantus was tuned for MP3 with ID3v1 style ID3v2 tags, and was extremely buggy in the version that added support for Ogg Vorbis. I'm sure it's fixed by now, but I've never looked back. Cantus among others never really handled arbitrary tags very well, instead they handled only the ID3v1-esque predefined set.

All together now

When Joe started writing Quod Libet, I knew it was time to bring all the pieces together. I needed good support for converting between filenames and tags, and i needed great ways to apply the same changes to several songs at once. Many programs (the command-line id3v2 tool) would allow you to apply certain changes (setting the artist) to several tracks, but would not support symmetric operations (removing a comment). I would make Quod Libet support doing both at once. As other time-savers became obvious Joe would add them. At this point I'm comfortable saying that Quod Libet's tag editing is top notch. I hear EasyTag is good, but I can't imagine anything actually being better than Quod Libet for how I use it.

(1 Comments ) (0 Trackbacks) metadata quodlibet

Python format string vulnerabilities (3): addendum Posted 2005.12.17 20:26 PST

Got my first patch in to Python. The important parts are just as I sent them; the cleanup bits were spiffed up and done better. Thanks for the fast work, Neal!

(0 Comments ) (0 Trackbacks) patch python

Python format string vulnerabilities (3) Posted 2005.12.17 12:04 PST

A while back I pointed to http://python.org/sf/1365916 as I lamented how long it was going without response. Since then good things have happened. Armin Rigo and Neal Norwitz weighed in, first to see if Joe was willing to check for other bad format strings in mmapmodule.c, and then when I pointed them to my work they enthusiastically supported me creating a patch.

Creating a patch for this problem which crosses over many parts of the C implementation of Python is no easy task. Having looked at the simple output of my format checker, I expected things to be pretty cut and dry. Some of them are. Parsing an int format into a size_t variable is as bad as parsing it into a long. Scenarios like this need to be changed to use an intermediate variable of an explicitly understood size.

But some of them aren't. The ones that seem most confusing are the signed/unsigned conflicts. There are only subtle differences between signed and unsigned integers, but sometimes these differences are very desirable. Furthermore the difference on the Python parsing layer between signed and unsigned isn't the same as on the C level. If I read the code correctly, when parsing a signed integer, Python will first check that the value is between -2**(bits-1) and 2**(bits-1)-1 and raise an error if it is not. But it treats unsigned integers as bitfields, so does not verify the value is between 0 and 2**bits-1.

For the changes I'm making this is pretty good. It means user code won't suddenly start throwing exceptions after replacing a signed format with an unsigned one. However in the odd chance they were relying on the out of signed-bounds exception on a semantically unsigned value, there could still be issues when Python stops throwing the exception.

For now I'm planning to change each ParseTuple signed-format with unsigned variable to use an unsigned format, but I'll leave alone unsigned-format with signed variable. Fixing bugs like this without impacting everyone is impossible, but hopefully this is the right compromise.

(0 Comments ) (0 Trackbacks) python

Developing Mutagen (5): Unit Tests Posted 2005.12.08 21:58 PST

I finished describing the important capabilities, and the design decisions I made for Mutagen to support them cleanly last time. Now I'm going to talk about the other half. Literally.

A wonderful little program that makes me feel terribly underpaid for all my python software, SLOCCount, pits Mutagen's ID3 support at 1042 lines, plus 58 lines to be shared with non-ID3 formats. The associated unit tests are 1069 lines.

I didn't quite follow a regiment of test-first-programming, or test-driven-development, but I came pretty close. I'll tell you one thing: it feels really good. There have been a couple times where I've made large implementation changes, and due to the body of tests I felt secure doing so. One such change was a switch from manual seek/read/write code, to mmap.move code. Then when there were repercussions on 64-bit platforms, I brought back the manual code as a fallback and was able to quickly extend the tests to fake the failure in order to test the fallback. Rather than spend two more weeks in some form of beta, I was immediately confident enough to release my changes.

Metaprogramming unit tests

Enough bragging about unit tests; it's time to introduce one of my favorite testing patterns. In testing there are bound to be batches of very similar tests that need to be run on different sets of data. With something like Mutagen the prime example is frames. Each frame needs to be tested to guarantee that a change won't silently break it. Its operations include the three I've previously covered: reading from the bytestream, writing back to the bytestream, and the frame == eval(repr(frame)) circle.

I could write a series of test cases with very similar looking tests. Do something like:

def test_TXXX_repr(self):
    frame = TXXX.fromData(..., data)
    frame2 = eval(repr(frame))
    self.assertEquals(frame, frame2)

Repeat this 100 times for 100 supported tags, and then go on to the read-write tests, and do something like the following another hundred times:

def test_TXXX_write(self):
    frame = TXXX.fromData(..., data)
    data2 = frame._writeData()
    self.assertEquals(data, data2)

Eeeeewwww. I just finished showing how I avoided this kind of mess in the implementation of Mutagen, I'm not about to create a new mess here. What's the next incremental step? Data-driven tests. Collect all the data in one place, and test it all.

framedata = [
    (TXXX, txxx_bytestream_here),
    (TPE3, tpe3_bytestream_here),
    ...
]

def test_frames_repr(self):
    for fid, data in framedata:
        frame = fid.fromData(..., data)
        frame2 = eval(repr(frame))
        self.assertEquals(frame, frame2)

def test_frames_write(self):
    for fid, data in framedata:
        frame = fid.fromData(..., data)
        data2 = frame._writeData()
        self.assertEquals(data, data2)

That's definitely an improvement, right? Just plop all frames in the framedata list and they're all tested. It does have two fatal flaws. First it exits early on any failure, and only tests the frames that come before the first failing frame. Second there is no indication of what frame it was testing when it fails. Could I print out the frame ID before each assertEquals? Sure. Do I want to? Absolutely not. Clean test output is essential to catching failing tests; and since the tests should pass more often than they fail, it's the wrong effort in the wrong place.

Python's dynamic support for metaprogramming comes to the rescue. A test case is any class derived from unittest.TestCase; a test is just a method on that class whose name begins with "test". Python will let me create a class on the fly out of little more than a dictionary. If that class happens to derive from TestCase, it's good to go. So I keep the list of framedata from above, and instead create some dictionaries to turn into classes.

framedata = [ ... ]
repr_tests = {}
write_tests = {}

# build dictionaries filled with methods ready to become tests
for fid, data in framedata:

    def repr_test(self, fid=fid, data=data):
        frame = fid.fromData(..., data)
        frame2 = eval(repr(frame))
        self.assertEquals(frame, frame2)
    repr_tests["test_%s_repr" % fid.__name__] = repr_test

    def write_test(self, fid=fid, data=data):
        frame = fid.fromData(..., data)
        data2 = frame._writeData()
        self.assertEquals(data, data2)
    write_tests["test_%s_write" % fid.__name__] = write_test

# turn the dictionaries into classes
# class foo(bar): def baz():... <-> foo = type('foo', (bar,) {'baz':...})
testcase = type('TestRepr', (unittest.TestCase,), repr_tests)
testcases.append(testcase)
testcase = type('TestWrite', (unittest.TestCase,), write_tests)
testcases.append(testcase)

Steps to success

All the problems I mentioned above are solved. Each test runs independently so all frames are tested and all failures are reported. In each failure, the name of the tag and the kind of the operation are part of the failure output so it's obvious which test failed. There are just a few things to keep in mind.

Scopes and closures in Python don't always work like you expect. In the above code, I use def test(self, fid=fid, data=data) for a very specific reason: if I don't, the binding is purely dynamic. From the scope of the function, Python searches out until a name binding of fid (or data) is used. By the time these functions run, the names have one binding at best, which is the last entry in the framedata list. I'd have 100 tests gleefully testing one frame if I wrote it that way. By listing it in the default arguments, the value is captured at function creation time, each of the 100 tests will test a different frame, and I am happy.

Know your framework. The framework within which I run my tests will execute the tests in the testcases list. If you use unittest.main() instead, it needs to be able to find them via introspection. Name them individually in the module's global namespace (TestRepr = type(...); TestWrite = type(...)), and it should work.

And finally, don't confuse the number of tests, or lines of code that are tests, with the coverage or quality of the tests. Don't get hung up on making everything meta either. While Mutagen has a sanity test for every frame (and a meta-test that tests if each frame is tested!), that's only a quarter of its test code; most of the remaining tests are standard simple tests. The added cognitive complexity of meta-testing is not worth it for the remaining tests, but boy does it work well for frames!

Any questions?

With this I think my series on developing Mutagen is complete. If you have any questions, or there are areas you'd like to read more about, please ask!

(0 Comments ) (0 Trackbacks) mutagen

Developing Mutagen (4): Meta-Leverage Posted 2005.12.07 21:23 PST

After covering the dual is-a has-a (or has-many) tree of Frame and Spec classes last time, I promised I'd share some of the other advantages of this data-driven metaprogramming structure. Here's where it gets really cool. With some slight trickery in a couple base-class functions, every last Frame class becomes very easy to use.

This entry is going to use a lot of code, because that's where the benefits are most apparent and most useful. To make sense of it, remember that each class has a member called _framespec which is a list of specs. Each spec has a read, write, and validate method, and stores its own preferred name.

Construction

Most useful python classes allow you to create an instance of a class based on a set of values. For example, the canonical text frame has an encoding and a list of strings. It would be really cool if I could construct a TXXX frame by passing it an encoding, a description, and a list of strings. So how can I enable this? There's a framespec defined already:

_framespec = [
  EncodingSpec('encoding'),
  EncodedTextSpec('desc'),
  MultiSpec('text', EncodedTextSpec('text'), sep=u'\u0000')
]

I could just define an __init__ method that mirrors it like so:

def __init__(self, encoding=None, desc=None, text=None):
    self.encoding = encoding
    self.desc = desc
    self.text = text

This mostly does the trick, but it's really easy to pass invalid data and generate a nonsense frame. That's a shame because specs have a validate method, and it's easy, but ugly, to fix:

def __init__(self, encoding=None, desc=None, text=None):
    self.encoding = self._framespec[0].validate(self, encoding)
    self.desc = self._framespec[1].validate(self, desc)
    self.text = self._framespec[2].validate(self, text)

No, that's really ugly. Not only is there no useful semantic connection between 0 and encoding, 1 and desc, or 2 and text, but there's all this boilerplate self._framespec[n].validate() repeated for each spec. How to fix that? Go meta!

def __init__(self, encoding=None, desc=None, text=None):
    for spec in self._framespec:
        setattr(self, spec.name, spec.validate(self, locals()[spec.name]))

I've now removed the repeated code within the __init__ method, and partly fixed the ugliness. But this still needs to appear in each relevant frame class. I'd much rather write something a little harder just once, rather than something a little easier twenty times.

On top of that, using locals()[spec.name] smells funny to me. It's not bad, but there's something about it I don't like, perhaps how easy it would be to accidentally clash a name if I added code before this for loop. If I'm willing to trade away the function signature, I can use an explicit dictionary...

def __init__(self, **kwargs):
    for spec in self._framespec:
        validated = spec.validate(self, kwargs.get(spec.name, None))
        setattr(self, spec.name, validated)

...and lose the ability to use positional parameters. That's not too bad. One more tweak and I can have positional parameters again: pull in a positional variadic and process it; then process the remaining specs as keyword arguments in the kwargs dictionary.

def __init__(self, *args, **kwargs):
    for spec, val in zip(self._framespec, args):
        setattr(self, spec.name, spec.validate(self, val))
    for spec in self._framespec[len(args):]:
        validated = spec.validate(self, kwargs.get(spec.name, None))
        setattr(self, spec.name, validated)

These last two versions fare very well as a base-class Frame.__init__ method. The latter is used almost verbatim in Mutagen. As an example of why this is a good factoring, consider the other task Mutagen's Frame.__init__ performs: if passed an existing frame as its sole argument, it copy-validates it into the new frame. If this code were all over the place, many different versions would have to be made. As is, a loop like the kwargs loop fetches values from the passed frame, validates and sets them.

Representation

Little is worse than having a simple data structure instance, but not being able to easily view its contents quickly and succinctly in the interactive interpreter. Hot on the heels of the constructor implementation, it should be almost obvious how we can construct a single unified __repr__ method. Best yet, by leveraging the constructor I just wrote, it's easy to follow recommendations and make a representation that eval()s to the same frame.

def __repr__(self):
    values = []
    for attr in self._framespec:
        values.append('%s=%r' % (attr.name, getattr(self, attr.name)))
    return '%s(%s)' % (type(self).__name__, ', '.join(values))

Easing access

There's one more touch that makes Mutagen's frames easy to use. While there is a lot of structure to the frame's data, most of the time client code only really cares about one part. In a text frame that's almost uniformly the list of string values. If there's only one value in that list, then the client code only cares about one string.

To facilitate this access simplification, Mutagen defines __str__ and __eq__, and then some list access methods which forward directly to its list. These include __getitem__, __iter__, append, and extend. The stringification function returns a single string composed of the list joined by NULL characters. Equality handles a comparison against another text frame, or against a string. The list methods make it easier to modify or refer to single strings in the list. Oddly Mutagen doesn't have a __setitem__; apparently we haven't needed it yet in any of our client code.

Unforunately I haven't seen a good way to specify this data less frequently than in in each frame class that specifies a framespec. Perhaps priority values could be given to spec classes, and this implementation could be generated with a true metaclass. So far I don't think that would be a winning trade over the straightforwardness of this code. While there will be repeated code, none of it is as full of boilerplate copy-mispaste chances as the initialization code.

Mutagen tidbits

That's the first half of Mutagen. I've described the interesting parts of the design now, and have just a few more tidbits about the implementation that I want to share. If you're wondering how I could only have a few tidbits left when there's an entire second half, then you'll have to wait for the next entry when I dive into the second half. The second half is arguably more important than the first, even though the first half does all the work. But that will wait. For now, just some tidbits.

ID3v2.2

The set of available frames changes significantly between ID3v2.2, ID3v2.3, and ID3v2.4. By and large 2.3 and 2.4 are drop-in compatible, and I won't go into the incompatibilities here. However 2.2 uses an entirely different data structure format, often using six bytes to 2.3 and 2.4's ten, and with three-character frame IDs instead of four. However the payload is almost the same.

Mutagen handles this by defining a series of subclasses of the final 2.3/2.4 frames. The user-defined text frame is class TXX(TXXX): pass; The starting key is class TKE(TKEY): pass; and so forth. The data reading code in Frame handles the headers, but then the payload is shared. Finally all 2.2 frames are "upgraded" to their 2.4 equivalents by executing:

if len(type(frame).__name__) == 3:       # if it's a 2.2 frame,
    frame = type(frame).__base__(frame)  # use the copy-validate from above!

ID3v1 and ID3v1.1

The first versions of ID3, ID3v1 and 1.1 aren't going away any time soon. As I mentioned in the first of this series, they're extremely limited with only 128 bytes of storage including the header TAG, and thus can only encode a very small set of predefined set of values. Since I prefer rich and flexible metadata, I detest ID3v1. But many portable devices only understand ID3v1. Of those that understand some form of ID3v2, many do not understand the most recent. Despite my distaste for ID3v1, Mutagen cannot ignore it.

Mutagen implements this with a completely separate read and write loop, and translates to and from ID3v2.4 tags internally to shield the client code from too many differences. There are definite downsides to this if ID3v1 is important to you user, as frame value length limits are not enforced locally; instead the data is truncated to fit when it is written. Oh well; that's a small sacrifice for incorporating a legacy format into this modern library.

Remember: the second half of Mutagen awaits next entry.

(0 Comments ) (0 Trackbacks) mutagen

Developing Mutagen (3): Data-Driven Metaprogramming Posted 2005.12.06 21:18 PST

Yesterday I described a lot of the complexities of the ID3v2 frame structures. By grouping frames into text frames, URL frames, and other frames, and then subgrouping appropriately, it's possible to avoid some code redundancy. But considering the number of remaining frames, it's really not enough. It would also be a royal pain to test all the copy-pasted implementations. So what can we do?

FrameSpecs

Mutagen solves this with data-driven metaprogramming on a _framespec. (Don't bother looking for this in the ID3 specifications; it's a Mutagen construct). This member stores a list of specifiers (specs). Each spec is an instance of a class derived from Spec. Although each spec knows only four things, the four things are enough to provide really rich functionality. It knows:

Using specs

To tie this down let's look at how to implement a text frame class. First we'll examine the following framespec, which corresponds directly to the text frame structure of one byte specifying encoding followed by a payload of text data:

_framespec = [
  EncodingSpec('encoding'),
  EncodedTextSpec('text'),
]

The EncodingSpec knows how to read a byte off the stream, and return its interpreted integer value. The EncodedTextSpec knows how to read the text data, decode it through the stored encoding, and return it. To load the value of a single frame, the base class Frame exposes a factory function: give it some information about the tag and the frame, and also pass it the payload data of the frame, and it will return a instance filled with the data from the payload bytestream. This factory function then iterates across all its framespecs, calls its read method, and stores its return value on the frame instance as a member data value with the spec's name.

MultiSpec

Since the spec receives the frame whose data it is reading or writing, you might wonder why it doesn't store the value directly. The reason: flexibility. The framespec above isn't sufficient. While most text frames may have a single value, the ID3v2.4 spec allows for multiple strings separated by NULLs in most cases. In other scenarios it allows for multiple pairs of such strings. Rather than understand the NULLs in client code, or build list behavior into each relevant spec, the indirection allows us to put all the logic in one spec.

MultiSpec is a meta-spec. It doesn't know how to read from or write to a bytestream, though it does have a name. It is the one spec that stores multiple values as a list. When it's asked to read a bytestream, it iterates over its sub-specs calling their read methods as if it were a Frame, as many times as is necessary to exhaust the available data. So using MultiSpec, the real TextFrame._framespec is:

_framespec = [
  EncodingSpec('encoding'),
  MultiSpec('text', EncodedTextSpec('text'), sep='\u0000'),
]

Reading a frame

Now when SomeTextFrame.fromData is called to read a frame, the process will be as follows:

Writing a frame

Writing a frame has symmetrical steps:

Extensibility

Doing things this way makes it really easy to add additional frames to Mutagen. If the frame structure is composed of already specified piece, just reuse the spec. If there's a new one, implement it as a new spec. List out the specs as they can appear in the frame, and it not only just works, it's practically documented. For instance the PairedTextFrames (TIPL and TMCL) have

_framespec = [
  EncodingSpec('encoding'),
  MultiSpec('people',
    EncodedTextSpec('involvement'),
    EncodedTextSpec('person))
]

TXXX has

_framespec = [
  EncodingSpec('encoding'),
  EncodedTextSpec('desc'),
  MultiSpec('text', EncodedTextSpec('text'), sep=u'\u0000')
]

URL frames (other than WXXX which has an encoded text description) have just

_framespec = [
  Latin1TextSpec('url')
]

Even APIC is almost understandable just by reading its framespec.

_framespec = [
  EncodingSpec('encoding'),
  Latin1TextSpec('mime'),
  ByteSpec('type'),
  EncodedTextSpec('desc'),
  BinaryDataSpec('data')
]

Next entry I'll cover some more benefits of this structure; data driven metaprogramming is an exceptional fit for Mutagen.

(0 Comments ) (0 Trackbacks) mutagen

Developing Mutagen (2): ID3v2.4 Frames Posted 2005.12.05 21:24 PST

Last entry I discussed a very high level view of how ID3v2.4 metadata is stored as a tag which is composed of several frames. Now I'm going to switch gears and discuss frames from the bottom up, and how it influenced Mutagen's design. You will probably want to refer to the ID3v2.4 frame list several times during this discussion.

Frames

All frames start with ten bytes of header with the following semantic, and are immediately followed by their payload:

    FRAM : four bytes of the frame ID
      00 : two bytes of binary flags
    1234 : four bytes containing the size of the frame payload

Text frames

The vast plurality of predefined frames are text frames. All of these text frames have payloads of the form:

But some have higher level semantic requirements. The frames TIPL and TMCL, for example, require pairs of strings listing functions or instruments paired with artist or person. Frames TRCK and TPOS are supposed to be numeric, and may be either one number or two separated by a slash. Frame TKEY is supposed to be one of A-G followed optionally by either b or # and then optionally by m, all in order to indicate A through G sharp, natural, or flat major or minor; or a single o for offkey. Frames TFLT and TMED have similarly complicated schemes to represent specific information chosen from various partly orthogonal sets. Other frames store dates in a specific format, or numbers that aren't allowed the slash. Finally TXXX stores a name before its main payload.

URL frames

Semantically there are only three types of URL frame. They are all like a text frame with a predefined latin-1 encoding for their text (URLs are expected to be ASCII-clean while artist names are not). The first two differ by how many are allowed in a tag - either one, or one with a given URL. The third, WXXX, is analagous to TXXX, and stores a URL with a user-defined description.

Other frames

There are a slew of other frames. They all have different layouts. For instance APIC has an encoding byte, a null-terminated name in Latin-1, a byte indicating the type of picture, a null-terminated description in the indicated encoding, and a binary blob. On the other end of the complexity spectrum, PCNT stores just an encoded number.

Implementation

How does Mutagen handle this? Primarily by inheritance. Mutagen defines a Frame type with the underpinnings, and then for the text related frames there are five intermediate classes: TextFrame, NumericTextFrame, NumericPartTextFrame, TimeStampTextFrame, and PairedTextFrame. Finally for each frame ID, Mutagen defines a subclass of the appropriate TextFrame class. Most of these subclasses are one line classes, with a description stored in their docstring. Occasionally a frame—such as the genre frame TCON—requires unique processing. Mutagen handles URL frames with two intermediate classes—UrlFrame and UrlFrameU—along with WXXX for the final unique URL frame type. The other frames tend to share little or none of their payload structure, so they generally inherit directly from Frame or FrameOpt (a variant of frame I'll describe next entry).

With what I've described so far, you should have a decent picture of the arbitrary complexity of the ID3 frames. The obvious implementation of the Frame class hierarchy I've described saves a lot of the work of writing many individual handlers, or huge condition trees. There's still a lot of redundancy: many frame types store an encoding and a list of thus-encoded strings, many frames store numbers—some in binary and some as text—and many store arbitrary type indicators. Except within the intermediate classes above, these are uniform in neither positioning nor meaning. Mutagen needs to be able to read and write each frame, and also to present a usable programmer's interface. How can we satisfy these needs while not going crazy with cut and paste?

I've got one big trick up my sleeve to reduce the redundancy, which I'll talk about in my next entry.

(0 Comments ) (0 Trackbacks) mutagen

Developing Mutagen (1): Background Posted 2005.12.02 21:24 PST

Mutagen is a pure Python ID3v2 library I developed to improve Quod Libet's MP3/ID3 support. It's objectively one of the better things I've written. There are several parts of it I feel particularly good about and will talk about over this series. But first I want to cover three pieces of terminology.

Tags and frames

When people started adding metadata to their music files, pretty soon everyone called it "tagging MP3s," and then went on to refer to the artist tag or title tag. This makes a lot of sense, as the process of tagging a file with an artist and a title seems a lot like attaching a series of price tags to the song. But that's not the way I'll use the term tag here.

ID3 tags are a set of informal standards—there's no formal specification submitted to any standards body—and it has specific meanings to the words tag and frame. The frame is the atomic unit that corresponds to the price tag above. It's the name-value pair. The tag is a collection of frames. There can be more than one tag per file, but in practice there is rarely more than one (two if you count the ID3v1 tag). In contrast, while there can be any positive number of frames in a tag, there are rarely fewer than a standard complement of artist, title, album, tracknumber, and genre.

For this series, I will use the term tag when I mean a structured collection of frames, and frame when I mean a structured and encoded name-value pair.

Frame IDs

ID3v1 came with a strict 128 byte block from which you could poll a small predefined set of information. The location in the block would tell you what name implicitly goes with it. Since this was insufficient—both the limited names and the limited space for values—ID3v2 was designed to use name-value pairs. Unlike VorbisComments or APEv2 tags which have free-form descriptive names and free-form binary or unicode string values, ID3v2 was designed to be extremely space-conscious and uses a short predefined character "names" (frame IDs) with variable- or fixed-length value fields.

This leads to frame IDs like TPE2 for text-frame performer 2 (the band, orchestra, or accompaniment), APIC for attached picture (cover or other images), and WCOP for URL to copyright information. There's over 130 defined frame IDs, of which over 50 are three-letter ID3v2.2 IDs, and over 80 are four-letter ID3v2.3 or ID3v2.4 IDs.

Next entry I will talk about the frame structure, and how that influenced Mutagen's design.

(0 Comments ) (0 Trackbacks) mutagen

Python format string vulnerabilities (2) Posted 2005.12.01 21:41 PST

After finding Python's mmap implementation fails on 64-bit systems due to a simple format string problem, I was distressed. This is code that works on the systems it was developed on. It's code that compiles without error on systems it fails on. It's due to a subtle format string inaccuracy that doesn't even receive the help of the compiler checking for invalid association. Yet the consequences are anything but subtle.

Unit tests are a great way to catch these sorts of bugs. If the mmap module had been developed with Test First Programming, chances are very good that this behavior would have been covered with a test and caught on the first supported 64-bit platform with mmap. I hope when Python fixes the mmap bug that my sample reproduction code is turned into a unit test. Then if things blow up on some future 128-bit system we'll find out sooner than a bug report.

But that's not enough. I want my programs to be bullet-proof. I want something I write in Python and test on 32-bit systems to work on 64-bit systems. Always. Or at least to blow up early at import. To proactively ensure this I crafted and now present one ugly program to check format strings in PyArg_ParseTuple, PyArg_ParseTupleAndKeywords, or (poorly) Py_BuildValue. The results (below) are surprisingly comforting.

There are very few errors that are obviously in this 32-bit/64-bit clash. There are a lot of unclear cases. Is a size_t the size of an int or a long? Is it always that size, or should it be transferred through a variable of known size such as an unsigned long? There are also plenty of false mismatches from my format checker, but since the numbers aren't insanely large it's good to err in that direction.

What is the best way to approach Python parsing format strings in the future? I offer two alternatives, and seek others. Do you think the current code needs to change?

Format Check Results

When both unknown and unsigned-mismatches are filtered, these are the results for trunk/Python and trunk/Modules as of r39818. Note bltinmodule is the only pure long vs int conflict outside of mmapmodule (although the size_t vs int/long goes one way in mmapmodule and another in bsddbmodule). The conflict in bltinmodule isn't currently a problem, as the parsed value is not used—it appears to exist solely for cleaner error reporting. Anything too weird is a limitation of my checker.

mactoolboxglue.c:143                      `err' type `OSErr' >< i: int
pythonrun.c:864                   `filename' type `PyObject' >< z: char
pythonrun.c:864                            `text' type `int' >< z: char
bltinmodule.c:1910                     `reverse' type `long' >< i: int
socketmodule.c:1590                `buflen' type `socklen_t' >< i: int
socketmodule.c:3071                     `fd' type `SOCKET_T' >< i: int
mmapmodule.c:465                        `size' type `size_t' >< l: long
mmapmodule.c:538                 `dest' type `unsigned long' >< i: int
mmapmodule.c:538                  `src' type `unsigned long' >< i: int
mmapmodule.c:538                `count' type `unsigned long' >< i: int
mmapmodule.c:867                 `access' type `access_mode' >< i: int
mmapmodule.c:955                 `access' type `access_mode' >< i: int
xxmodule.c:207                               `b' type `long' >< #: int
_bsddb.c:2583                       `flags' type `u_int32_t' >< i: int
_bsddb.c:3751                       `flags' type `u_int32_t' >< i: int
_bsddb.c:3778                       `flags' type `u_int32_t' >< i: int
_bsddb.c:3802                       `flags' type `u_int32_t' >< i: int
_bsddb.c:3825                     `timeout' type `u_int32_t' >< i: int
_bsddb.c:3825                       `flags' type `u_int32_t' >< i: int
_bsddb.c:4256                       `flags' type `u_int32_t' >< i: int
_bsddb.c:4371                       `flags' type `u_int32_t' >< i: int
operator.c:23              `a1' type `PyObject *a1; int a2;' >< O: PyObject
operator.c:23              `a2' type `PyObject *a1; int a2;' >< i: int
zlibmodule.c:126                         `input' type `Byte' >< s: char
zlibmodule.c:203                         `input' type `Byte' >< s: char
zlibmodule.c:399                         `input' type `Byte' >< s: char
zlibmodule.c:469                         `input' type `Byte' >< s: char
zlibmodule.c:781                           `buf' type `Byte' >< s: char
zlibmodule.c:781                   `adler32val' type `uLong' >< l: long
zlibmodule.c:799                           `buf' type `Byte' >< s: char
zlibmodule.c:799                     `crc32val' type `uLong' >< l: long
bsddbmodule.c:473                     `recno' type `recno_t' >< i: int
bsddbmodule.c:512                     `recno' type `recno_t' >< i: int
bsddbmodule.c:794                     `reclen' type `size_t' >< i: int
clmodule.c:886                             `x' type `int x;' >< i: int
_cursesmodule.c:379                     `attr' type `attr_t' >< l: long
_cursesmodule.c:388                     `attr' type `attr_t' >< l: long
_cursesmodule.c:426                     `attr' type `attr_t' >< l: long
_cursesmodule.c:436                     `attr' type `attr_t' >< l: long
_cursesmodule.c:472                     `attr' type `attr_t' >< l: long
_cursesmodule.c:482                     `attr' type `attr_t' >< l: long
_cursesmodule.c:517                     `attr' type `attr_t' >< l: long
_cursesmodule.c:546                     `attr' type `attr_t' >< l: long
_cursesmodule.c:603                      `ch1' type `chtype' >< l: long
_cursesmodule.c:603                      `ch2' type `chtype' >< l: long
_cursesmodule.c:692                     `attr' type `attr_t' >< l: long
_cursesmodule.c:863                     `attr' type `attr_t' >< l: long
_cursesmodule.c:872                     `attr' type `attr_t' >< l: long
_cursesmodule.c:907                     `attr' type `attr_t' >< l: long
_cursesmodule.c:916                     `attr' type `attr_t' >< l: long
_cursesmodule.c:1010                    `attr' type `attr_t' >< l: long
_cursesmodule.c:1020                    `attr' type `attr_t' >< l: long
_cursesmodule.c:1056                    `attr' type `attr_t' >< l: long
_cursesmodule.c:1066                    `attr' type `attr_t' >< l: long
_cursesmodule.c:1390                    `attr' type `attr_t' >< l: long
_cursesmodule.c:1399                    `attr' type `attr_t' >< l: long
_cursesmodule.c:1677      `(int *) &event.bstate' type `int' >< l: long
fcntlmodule.c:174                          `arg' type `char' >< i: int
skipping ../python/trunk/Modules/_hashopenssl.c:405 pop from empty list

See also unfiltered ParseTuple and unfiltered BuildValue logs.

(0 Comments ) (0 Trackbacks)