Armed with a text editor

mu's views on program and recipe! design

Developing Mutagen (1): Background Posted 2005.12.02 21:24 PST (#)

Mutagen is a pure Python ID3v2 library I developed to improve Quod Libet's MP3/ID3 support. It's objectively one of the better things I've written. There are several parts of it I feel particularly good about and will talk about over this series. But first I want to cover three pieces of terminology.

Tags and frames

When people started adding metadata to their music files, pretty soon everyone called it "tagging MP3s," and then went on to refer to the artist tag or title tag. This makes a lot of sense, as the process of tagging a file with an artist and a title seems a lot like attaching a series of price tags to the song. But that's not the way I'll use the term tag here.

ID3 tags are a set of informal standards—there's no formal specification submitted to any standards body—and it has specific meanings to the words tag and frame. The frame is the atomic unit that corresponds to the price tag above. It's the name-value pair. The tag is a collection of frames. There can be more than one tag per file, but in practice there is rarely more than one (two if you count the ID3v1 tag). In contrast, while there can be any positive number of frames in a tag, there are rarely fewer than a standard complement of artist, title, album, tracknumber, and genre.

For this series, I will use the term tag when I mean a structured collection of frames, and frame when I mean a structured and encoded name-value pair.

Frame IDs

ID3v1 came with a strict 128 byte block from which you could poll a small predefined set of information. The location in the block would tell you what name implicitly goes with it. Since this was insufficient—both the limited names and the limited space for values—ID3v2 was designed to use name-value pairs. Unlike VorbisComments or APEv2 tags which have free-form descriptive names and free-form binary or unicode string values, ID3v2 was designed to be extremely space-conscious and uses a short predefined character "names" (frame IDs) with variable- or fixed-length value fields.

This leads to frame IDs like TPE2 for text-frame performer 2 (the band, orchestra, or accompaniment), APIC for attached picture (cover or other images), and WCOP for URL to copyright information. There's over 130 defined frame IDs, of which over 50 are three-letter ID3v2.2 IDs, and over 80 are four-letter ID3v2.3 or ID3v2.4 IDs.

Next entry I will talk about the frame structure, and how that influenced Mutagen's design.

Categories: mutagen