Reinventing the Database



  • The Livejournal code has something called the "Directory Search", which searches for users by age, location, etc. (On LiveJournal.com, it's only available to paid users - sfter reading the rest of my post, you'll be able to guess why). I noticed someone was trying to set it up on their own install and got errors about LJ::UserSearch being missing, and became curious. Turns out LJ::UserSearch is a seperate module that requires installation. Why? Because it's a Perl XS module (written in C). What it does is... interesting.

    It appears to keep an array of 8-byte structures (one for every user that exists or ever existed - the user ID is used as the array index) with things like age, gender, region etc. When someone does a search, it scans over all of them and creates an array of pointers to all matching items. (Technically, it only scans the entire list for the first match criteria - subsequent criteria are done against the list of items matching the first one). It then clears items not matching all the criteria from the list, converts it to a Perl list of the user IDs, and returns it.

    Where does the data in these structures come from? Well, there's a background task that polls for new and modified users and updates a table in the database (yes, packed is exactly what you'd expect it to be):

    CREATE TABLE usersearch_packdata (
    userid INT UNSIGNED NOT NULL PRIMARY KEY,
    packed CHAR(8) BINARY,
    mtime INT UNSIGNED NOT NULL,
    good_until INT UNSIGNED,
    INDEX (mtime),
    INDEX (good_until)
    )


    Oh, and it's also non-threadsafe (due to heavy use of global variables) and not 64bit-clean (due to pseudo-pointer arithmetic that assumes pointers are 32 bits and could probably be done more cleanly as real pointer arithmetic anyway). Just to make life that little bit more interesting. Fortunately, it'll only crash on 64-bit systems if the array in question crosses a 32-bit boundary, and how likely is that to happen anyway?



  • Okay, this is just insane. There's four background tasks, a database table, and a flatfile. bin/worker/directory-meta polls the database for rows that need updating and new users, and then updates the usersearch_packdata table appropriately. bin/worker/search-updater checks for changes to usersearch_packdata and updates the flatfile. bin/worker/search-master does the actual searches - it loads in the data from the flatfile at startup and gets changes to existing users from the database. bin/worker/search-slave does something, but I'm not quite sure what. (Then there's gearmand, which is another process used for communication with the workers.)

    Making new users searchable requires a restart of search-master - search-updater does this after every 5000 entries updated (but not more than once a minute) if you have ljworkerctl installed, but they don't appear to have released ljworkerctl.

    Oh, and I still can't figure out how to get it to work.



  • I find it greatly amusing, given how people implementing tacky web-apps almost always try to solve every problem by banging on it with a relational database regardless of how appropriate that may be, that you've managed to find somebody making a ridiculous mess by failing to use one in one of the rare situations where it would have been a really good idea.


Log in to reply