Armed with a text editor

mu's views on program and recipe! design

Developing Mutagen (4): Meta-Leverage Posted 2005.12.07 21:23 PST (#)

After covering the dual is-a has-a (or has-many) tree of Frame and Spec classes last time, I promised I'd share some of the other advantages of this data-driven metaprogramming structure. Here's where it gets really cool. With some slight trickery in a couple base-class functions, every last Frame class becomes very easy to use.

This entry is going to use a lot of code, because that's where the benefits are most apparent and most useful. To make sense of it, remember that each class has a member called _framespec which is a list of specs. Each spec has a read, write, and validate method, and stores its own preferred name.

Construction

Most useful python classes allow you to create an instance of a class based on a set of values. For example, the canonical text frame has an encoding and a list of strings. It would be really cool if I could construct a TXXX frame by passing it an encoding, a description, and a list of strings. So how can I enable this? There's a framespec defined already:

_framespec = [
  EncodingSpec('encoding'),
  EncodedTextSpec('desc'),
  MultiSpec('text', EncodedTextSpec('text'), sep=u'\u0000')
]

I could just define an __init__ method that mirrors it like so:

def __init__(self, encoding=None, desc=None, text=None):
    self.encoding = encoding
    self.desc = desc
    self.text = text

This mostly does the trick, but it's really easy to pass invalid data and generate a nonsense frame. That's a shame because specs have a validate method, and it's easy, but ugly, to fix:

def __init__(self, encoding=None, desc=None, text=None):
    self.encoding = self._framespec[0].validate(self, encoding)
    self.desc = self._framespec[1].validate(self, desc)
    self.text = self._framespec[2].validate(self, text)

No, that's really ugly. Not only is there no useful semantic connection between 0 and encoding, 1 and desc, or 2 and text, but there's all this boilerplate self._framespec[n].validate() repeated for each spec. How to fix that? Go meta!

def __init__(self, encoding=None, desc=None, text=None):
    for spec in self._framespec:
        setattr(self, spec.name, spec.validate(self, locals()[spec.name]))

I've now removed the repeated code within the __init__ method, and partly fixed the ugliness. But this still needs to appear in each relevant frame class. I'd much rather write something a little harder just once, rather than something a little easier twenty times.

On top of that, using locals()[spec.name] smells funny to me. It's not bad, but there's something about it I don't like, perhaps how easy it would be to accidentally clash a name if I added code before this for loop. If I'm willing to trade away the function signature, I can use an explicit dictionary...

def __init__(self, **kwargs):
    for spec in self._framespec:
        validated = spec.validate(self, kwargs.get(spec.name, None))
        setattr(self, spec.name, validated)

...and lose the ability to use positional parameters. That's not too bad. One more tweak and I can have positional parameters again: pull in a positional variadic and process it; then process the remaining specs as keyword arguments in the kwargs dictionary.

def __init__(self, *args, **kwargs):
    for spec, val in zip(self._framespec, args):
        setattr(self, spec.name, spec.validate(self, val))
    for spec in self._framespec[len(args):]:
        validated = spec.validate(self, kwargs.get(spec.name, None))
        setattr(self, spec.name, validated)

These last two versions fare very well as a base-class Frame.__init__ method. The latter is used almost verbatim in Mutagen. As an example of why this is a good factoring, consider the other task Mutagen's Frame.__init__ performs: if passed an existing frame as its sole argument, it copy-validates it into the new frame. If this code were all over the place, many different versions would have to be made. As is, a loop like the kwargs loop fetches values from the passed frame, validates and sets them.

Representation

Little is worse than having a simple data structure instance, but not being able to easily view its contents quickly and succinctly in the interactive interpreter. Hot on the heels of the constructor implementation, it should be almost obvious how we can construct a single unified __repr__ method. Best yet, by leveraging the constructor I just wrote, it's easy to follow recommendations and make a representation that eval()s to the same frame.

def __repr__(self):
    values = []
    for attr in self._framespec:
        values.append('%s=%r' % (attr.name, getattr(self, attr.name)))
    return '%s(%s)' % (type(self).__name__, ', '.join(values))

Easing access

There's one more touch that makes Mutagen's frames easy to use. While there is a lot of structure to the frame's data, most of the time client code only really cares about one part. In a text frame that's almost uniformly the list of string values. If there's only one value in that list, then the client code only cares about one string.

To facilitate this access simplification, Mutagen defines __str__ and __eq__, and then some list access methods which forward directly to its list. These include __getitem__, __iter__, append, and extend. The stringification function returns a single string composed of the list joined by NULL characters. Equality handles a comparison against another text frame, or against a string. The list methods make it easier to modify or refer to single strings in the list. Oddly Mutagen doesn't have a __setitem__; apparently we haven't needed it yet in any of our client code.

Unforunately I haven't seen a good way to specify this data less frequently than in in each frame class that specifies a framespec. Perhaps priority values could be given to spec classes, and this implementation could be generated with a true metaclass. So far I don't think that would be a winning trade over the straightforwardness of this code. While there will be repeated code, none of it is as full of boilerplate copy-mispaste chances as the initialization code.

Mutagen tidbits

That's the first half of Mutagen. I've described the interesting parts of the design now, and have just a few more tidbits about the implementation that I want to share. If you're wondering how I could only have a few tidbits left when there's an entire second half, then you'll have to wait for the next entry when I dive into the second half. The second half is arguably more important than the first, even though the first half does all the work. But that will wait. For now, just some tidbits.

ID3v2.2

The set of available frames changes significantly between ID3v2.2, ID3v2.3, and ID3v2.4. By and large 2.3 and 2.4 are drop-in compatible, and I won't go into the incompatibilities here. However 2.2 uses an entirely different data structure format, often using six bytes to 2.3 and 2.4's ten, and with three-character frame IDs instead of four. However the payload is almost the same.

Mutagen handles this by defining a series of subclasses of the final 2.3/2.4 frames. The user-defined text frame is class TXX(TXXX): pass; The starting key is class TKE(TKEY): pass; and so forth. The data reading code in Frame handles the headers, but then the payload is shared. Finally all 2.2 frames are "upgraded" to their 2.4 equivalents by executing:

if len(type(frame).__name__) == 3:       # if it's a 2.2 frame,
    frame = type(frame).__base__(frame)  # use the copy-validate from above!

ID3v1 and ID3v1.1

The first versions of ID3, ID3v1 and 1.1 aren't going away any time soon. As I mentioned in the first of this series, they're extremely limited with only 128 bytes of storage including the header TAG, and thus can only encode a very small set of predefined set of values. Since I prefer rich and flexible metadata, I detest ID3v1. But many portable devices only understand ID3v1. Of those that understand some form of ID3v2, many do not understand the most recent. Despite my distaste for ID3v1, Mutagen cannot ignore it.

Mutagen implements this with a completely separate read and write loop, and translates to and from ID3v2.4 tags internally to shield the client code from too many differences. There are definite downsides to this if ID3v1 is important to you user, as frame value length limits are not enforced locally; instead the data is truncated to fit when it is written. Oh well; that's a small sacrifice for incorporating a legacy format into this modern library.

Remember: the second half of Mutagen awaits next entry.

Categories: mutagen