XML, SQL, PHP and it's mine (sorry, long)

Arantor

OK, so this one is somewhere between a horror story and a confession. This one is 100% mine and in whatever passes for a defence, this was 2008 and I had theoretically 5 years of PHP, though it's probably closer to 2.5 years of experience done twice. I was generally not looking at other peoples' code at this point in time, and was working solo so I never learned from feedback.

So, in 2008, I made a search engine for online comics. It isn't up any more mostly because it proved to be an insane time sink, because I was very proud of the quality of the source data - which meant you could do a lot with it.

It was primarily geared to those long running comic strips where you had a regular cast of characters and it did things like track which characters were speaking, which characters were present even if not saying anything. And I actually tracked individual panels of each strip because I wanted to be able to provide a meaningful search result - instead of hotlinking the actual images, which would have been awful (no ad revenue for the creators!), I created a sort of text-based representation of the strip.

In hindsight there were many, many, many things wrong with all of this - not the intent but the execution - so I will attempt to explain myself and hope that other people learn from this.

WTF 1: XML
Well, XML is pretty much the ideal format for this when you think about it. It's textual data with semantic extras. But let me give you a sample of it from one of the strips I had covered - Angst Tech. Just the first in the list.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE comic SYSTEM "comics.dtd">
<comic id="angsttech" numid="15" status="archive">
	<comicmeta>
		<comicname>Angst Technology</comicname>
		<baseurl>http://www.inktank.com/archives/AT/archive.cfm</baseurl>
		<copyright>Barry T. Smith</copyright>
		<forumurl>http://www.inktankforums.com/</forumurl>
		<earliest>2000-08-28</earliest>
		<latest>2005-06-20</latest>
	</comicmeta>
	<strips speakers="named">
... some entries cut to get you a better example of how not to...
		<strip id="strip3">
			<url>http://www.inktank.com/archives/AT/index.cfm?nav=3</url>
			<title>The Webmaster: Being Discreet</title>
			<date>2000-08-30</date>
			<authors>
				<author>Barry T. Smith</author>
			</authors>
			<panels>
				<panel>
					<text type="spoken" speaker="Marc">We need to come up with a signal for when we don't like a particular candidate. Something subtle, like quietly placing our cell phones on he table.</text>
					<text type="spoken" speaker="Dante">Subtle. Got it.</text>
				</panel>
				<panel>
					<text type="inset">45 minutes later</text>
					<text type="spoken" speaker="Marc">Well Maynard, everything seems to be in order. You'd make a fine webmaster.</text>
					<text type="spoken" speaker="Marc">Do you have any questions for us? Our benefits perhaps?</text>
					<text type="present" speaker="Dante"></text>
				</panel>
				<panel>
					<text type="spoken" speaker="Maynard">Does this job involve a lot of spurting blood?</text>
					<text type="spoken" speaker="Maynard">Gushing blood?</text>
					<text type="spoken" speaker="Maynard">With the bleeding and the spurting...</text>
					<text type="spoken" speaker="Maynard">...blood...</text>
					<text type="spoken" speaker="Maynard">...messy...</text>
				</panel>
				<panel>
					<text type="sound">SLAM!!</text>
					<text type="present" speaker="Marc"></text>
					<text type="present" speaker="Dante"></text>
				</panel>
			</panels>
		</strip>

So, yes, I recorded a bunch of text types to denote the content (all of which was filterable), and yes in hindsight I should have declared <inset> and <spoken> tags but this was my first outing in XML and as such it could have done with anyone with a brain looking at it.

But that's not really the WTF. The real WTF with this is how the XML was built. I didn't build it by hand. I made an OpenOffice spreadsheet for each comic I did which had a macro in it to take a table of data and build XML out of it which could be copy/pasted.

WTF 2: parsing
So I had a parser script. Which put it all into MySQL. No, I didn't have any kind of CMS for this, I did it solely with a parser script.

The theory I had was that if I never provided a front-facing interface I never had to worry about XML injection or anything.

So I had a script that ran on the command line and only on the command line - and yes, it used Zend_Console to do this. And it would nuke the contents of a given strip in the database and rebuild it fresh. I figured I could just upload the XML file to a specified directory on the server outside of webroot and run the script, done.

At least I used one of the existing PHP extensions for handling the XML rather than rolling my own, I wasn't that insane. Plus by specifying a DTD (and writing said DTD) I could even have some attempt at pre-validation on it.

WTF 3: database storage
I stored everything in the database line by line. This was necessary for indexing purposes because filtering was per line - if you wanted to find something a character said, you needed to be able to filter on that. Or if you wanted to find a caption or something you needed to be able to filter on that too. Or even filter by character.

The XML may have indicated a name but the names were normalised internally to ids, so that was something (and it let me notice typos by having a list of 'names' in a strip)

The only problem is, it didn't occur to me initially to have a baked version of a strip. And given how much meta data could be attached to a search result... an awful lot of meta-data was captured. If a strip was in a book of strips, that was known in the database. If it was tagged in some way (and tags were used to denote story arcs), again that was captured.

This potentially meant for 10 results per page, I could end up doing 40 queries to fetch it all. Bad design there.

WTF 4: NSFW
I implemented a crude NSFW filter on strips. Strips could be marked NSFW manually if there was no inappropriate language but the strip itself was otherwise NSFW. The Cyanide & Happiness guys got pretty much all their strips marked NSFW if I remember rightly.

The rest was done with a simple regex with a list of words in it. In hindsight it is curious what ended up on that list since it wasn't populated by my being a prude in the first place, it was all material that came from the comics. At least I had the temerity to \b in the right places to avoid that clbuttic type mistake, but the idea of even having a NSFW filter seems crazy in hindsight, especially given that at the time I was listing Cyanide & Happiness and the Menage a 3 comic strips.

WTF 5: indexing
There were some smart things in the indexing; because I had total control over data entry, I could spot obvious typos in the source material and fix them at index time - preserving the original as-displayed but matching either spelling.

This wouldn't have been quite so bad if it weren't so abused. Because I was still all about having formatting be represented fairly, I ended up embedding various HTML directly in the data too, usually the strong and em tags, but also entities for various purposes. I even put smileys in that list, too, looking at it.

But that wasn't even the WTF. At the time my first go to was pure PHP... and I found Zend_Search_Lucene. Seemed pretty good... if it weren't for the fact that a few hundred KB of index would require over 128MB of RAM to process. And the fact that it regularly corrupted the index.

It was around this point I switched to using Sphinx for the indexing. It was much faster than a PHP implementation of Lucene and seemed to work better too, but the way I solved the various issues - storing strips, having the relevant items be baked properly on display... I ended up using the xml_pipe option for Sphinx, as in PHP puts everything into the database, only to get it out of the database again, only this time wrapping it in different XML to get it to Sphinx. Ugh.

WTF 6: heavy load

So as I mentioned, 40 queries per page wasn't pretty. So at that point it occurred to me to bake strips pre-rendered. And I did this on my local PC rather than the server, to limit the server workload.

I had a script dump a PHP file containing an array of a given strip's details, one file per strip (meaning that at this point, the largest beast in the collection, Bruno the Bandit, generated in excess of 2000 files) which would then be synched to the server.

On the flip side, the naming scheme for the files meant I didn't have to have the database stuff on the server any more since I could build everything on my PC - database for baking purposes, XML dump for Sphinx etc. could all be done locally, I only had to synch to the server, upload everything and then trigger Sphinx's indexer to run (this being years before it got live updates)

WTF 7: graphs
I had all this data, I thought it was a shame not to use it. So I wrote a script to make graphs out of it all.

At a minimum, every strip would have pie charts of 'SFW vs NSFW distribution', 'number of strips from guest artists vs the main artists', bar charts of how many strips the comic tended to have, if a strip had multiple authors there would be a pie chart showing the breakdown, a breakdown of spoken words/thought bubbles/inset boxes etc.

On some level it was kind of cool, but largely not. The one thing I did do that was kind of cool was the timeline. I had each comic on it, ordered by earliest-latest strips, with one pixel per day as the scale, so you could see which comics were out on which days.

This was mashed up into a fun HTML environment where I wanted the legend on the left to be fixed in place with the graph part itself scrollable, which ended up having two different images and mucking around with CSS to get it right, but it worked in the end.

In hindsight, it taught me a variety of how not to do this, how not to implement things that I thought were cool but actually weren't, and in general how not to implement things. But the idea still seems cool.

dkf

@Arantor said:

But that wasn't even the WTF. At the time my first go to was pure PHP... and I found Zend_Search_Lucene. Seemed pretty good... if it weren't for the fact that a few hundred KB of index would require over 128MB of RAM to process. And the fact that it regularly corrupted the index.

Oh my… that's… why did someone try reimplementing Lucene in something else in the first place? And if it was corrupting indices, why did you struggle on with it? (Crapping on the indices is a very bad sign indeed.)

Arantor

Well, not everyone is in a position to run Java components on a server. People do use shared hosting, you know.

At first I struggled on with it because I didn't really know any better, and just routed around the problem.

Frank

Even though it was a time sink I bet you learned a fuck ton of stuff. I bet if you open that code base today it would give you shivers.

I was also in the grouping of teach yourself programming and programmed for years alone. Wish I had a mentor or did some kind of apprenticeship back in those early days.

To Jeff's credit Stack Overflow improved me 1000%. But I am still learning...always learning.

Good story.

Arantor

I did, mostly how not to do things, but it's all good. And yes, I did crack open the code to check how bad some of the WTFs really were.

cartman82

Nice. I like this deconstruction of your own stuff, even if it seems a bit humblebrag-y in places.

Just for reference:

http://www.inktank-studios.com/comics/comic/phone-signals/

Arantor

Humblebrag? That wasn't something I was going for. I was just trying to tell it kind of how i was and accept that 6 years ago I made a lot of mistakes ;)

ben_lubar

@Arantor said:

Humblebrag? That wasn't something I was going for. I was just trying to tell it kind of how i was and accept that 6 years ago I made a lot of mistakes ;)

One great thing about not being an irredeemable idiot is that you can look back a few years and say "wow, I was dumb back then, but I am slightly less dumb now"

Arantor

'slightly' being the important word here.

cartman82

@Arantor said:

Humblebrag? That wasn't something I was going for. I was just trying to tell it kind of how i was and accept that 6 years ago I made a lot of mistakes

Poking fun with 'humblebrag'.

The point is, this is more like a post-mortem of a project than an outright WTF. Some parts were bad, some good. Some have improved over the lifetime of the project. I appreciate the time you took to put this together.

I like the idea of tracking one's programming skills through project history. I can barely stand to look at the code I wrote at the beginning of my current project, not to mention a few projects ago. Even my code formatting has changed, which is an additional source of annoyance (THANK YOU Resharper's "reformat code"). IMO if you look at your work from a few years ago and don't cringe, something has gone wrong with your career.

Arantor

It was a post mortem, sure, and they're all WTFs - more importantly, all the WTFs are fundamental design problems, not just implementation problems. Post mortem seemed like the best format for explaining the WTFery.

Frank

I am less dumb these days. I am glad other people on this planet can admit the same.

dkf

@Arantor said:

Well, not everyone is in a position to run Java components on a server. People do use shared hosting, you know.

Yeah, but having to put so much work into working around the bugginess of something just to save a few bucks/pounds of service costs… TRWTF is failing to value your time correctly.

Arantor

Well, when you're building something that other people will be using and giving it away in the process (a la Zend Framework) that kind of thing does happen. Mind you, Zend Framework is largely a source of WTFs.

boomzilla

@cartman82 said:

IMO if you look at your work from a few years ago and don't cringe, something has gone wrong with your career.

Definitely. This actually looks like a cool project. A tiny bit of the semantic web, actually done.

@Frank said:

I am less dumb these days.

I usually feel less dumb than a younger self, but in retrospect that's not always true.

chubertdev

It sounds like you were learning really well how to complete tasks, even if your root understanding of the technically was lagging behind.

Arantor

I learned how to perform some tasks, and learned a lot about the underlying science in some of these things. Stood me well in the years to come.

The sad part is that I actively feel less imaginative these days. I'm so bogged down in the detail and the technology that being imaginative is at least 3x harder now.

It was a cool project. I've always loved the idea of having meaningful semantic data available to do cool things with, but what I found at the time - and in other fields - is that very frequently the quality of data is too low to do anything with. That's what killed the project in the end; it took me too long to get the data together, and I didn't trust anyone else to provide it up to the same standard.

chubertdev

@Arantor said:

The sad part is that I actively feel less imaginative these days. I'm so bogged down in the detail and the technology that being imaginative is at least 3x harder now.

I've often wondered about that part. Does the fact that I worry less about the technology mean that I'm better at it, and more concerned about getting results? I've used Python for about 5 months, and I feel so comfortable in it that trying to figure out what to do isn't a hurdle to me anymore, I just do it, and the nature of the changes that I make are of far bigger concern.

Arantor

That's not quite what I mean.

I'm crazy fluent in PHP by now. The thought of writing anything in PHP does not scare me in the slightest. The thought, however, of building something without designing it to the nth degree, on the other hand... there's my problem.

When I started the above, I didn't really know how I was going to get from A to B. A fair amount of it was 'figuring it out as I went', but now I wouldn't do that again. I'd pre-plan everything.

chubertdev

Being able to do it, versus knowing how to do it well, then?

Arantor

Once I know what I'm building, I can just build it. But all the design work is where I have issues... I used to be able to wing it and not worry about all the consequences, and more importantly to not worry about the consequences at least until I got there - now I can't get out of the design phase.

cartman82

I feel the same way. The more you improve, the more you know how easily the obvious solution now can be a stone around your neck a few months later.

Once I have a path, I can blast through it. But give me a few choices without the one obvious 'best' solution and I'm stuck weighting pros and cons to the point that I start bringing work home and falling behind. Ahh, to be a junior again...

dkf

@cartman82 said:

But give me a few choices without the one obvious 'best' solution and I'm stuck weighting pros and cons to the point that I start bringing work home and falling behind.

That's the real difference between different ranks of software engineer: how quickly they can sort through such piles of unclear requirements to come up with a workable long-term-supportable solution. It's not about banging out the code, it's about figuring out what code needs to be banged out.

Experience helps a lot. You can go “oh, I've seen this sort of before and those solutions over there failed miserably; let's avoid them”. Knowing plenty of theory and lots about the abstract models of the systems in use also both help.

PJH

@dkf said:

That's the real difference between different ranks of software engineer: how quickly they can sort through such piles of unclear requirements to come up with a workable long-term-supportable solution. It's not about banging out the code, it's about figuring out what code needs to be banged out.

Deja-vu - sorta. And he even came up with another half-assed requirement which was ambiguous while clearing up the previous one (or at least I'm left with the impression he thinks he has...)

Arantor

So completely business as usual?

cartman82

@dkf said:

That's the real difference between different ranks of software engineer: how quickly they can sort through such piles of unclear requirements to come up with a workable long-term-supportable solution. It's not about banging out the code, it's about figuring out what code needs to be banged out.

Experience helps a lot. You can go “oh, I've seen this sort of before and those solutions over there failed miserably; let's avoid them”. Knowing plenty of theory and lots about the abstract models of the systems in use also both help.

Hopefully, one day I'll become good enough to be able to do that.

dkf

@cartman82 said:

Hopefully, one day I'll become good enough to be able to do that.

Keep practicing. Keep trying. Keep watching for what works and what doesn't, both in your own stuff and other people's. (Some other people seem to think I'm good at this sort of thing, but I'm definitely still learning. I know a few things in a few areas, but not really all that much. Got a reasonable memory though. )

darkmatter

@dkf said:

Lucene

The only implementation I've seen of it Lucene was terrible. It indexed the development database by default and then returned those results in the production environment as if they all got promoted. It took way too much work to make the implementation functional for any situation where the dev environment was not simply a mirror of production.

Arantor

Was it the native PHP implementation? Because that's probably worse ;)

darkmatter

C# actually, Mr. Necro-man.

Speaking of terrible PHP implementations, how about when they stole the Soap class name from NuSoap (basically the biggest SOAP plugin of the time, afaik) when they implemented it in PHP5, but then didn't keep the functionality the same, such that EVERY PIECE OF NUSOAP CODE EVER fails by default. You had to either fix the php config to exclude native soap implementation or rename the NuSoap class. GG lack of namespacing in older PHP.

Arantor

That sucks.

But seriously... take a look at Zend_Search_Lucene. Regularly corrupts databases, consumes 128M+ of RAM just to parse a < 1MB database...

chubertdev

@Arantor said:

That sucks.

But seriously... take a look at Zend_Search_Lucerne. Regularly corrupts databases, consumes 128M+ of RAM just to parse a < 1MB database...

I read that wrong.

Arantor

No, no you didn't. It is full of retarded.

chubertdev

@Arantor said:

No, no you didn't. It is full of retarded.

Edited for emphasis.

Arantor

Ohhhhhhhhhhhhhhhh

another_sam

@chubertdev said:

Zend_Search_Lucerne

Useful for searching haystacks?

Arantor

@another_sam said:

Useful for searching haystacks?

No.

chubertdev

@another_sam said:

Useful for searching milkstacks?

FTFY

NO idea what that is

dkf

@Arantor said:

But seriously... take a look at Zend_Search_Lucene. Regularly corrupts databases, consumes 128M+ of RAM just to parse a < 1MB database...

Hmm, that might explain a few things…

Arantor

But yes, I learned to avoid that firmly afterwards

Not all of Zend Framework is terrible. Just large swathes of it.

another_sam

Whoosh?

http://www.dpi.nsw.gov.au/agriculture/pastures/pastures-and-rangelands/species-varieties/a-z/lucerne

chubertdev

@another_sam said:

Whoosh?

http://www.dpi.nsw.gov.au/agriculture/pastures/pastures-and-rangelands/species-varieties/a-z/lucerne

Did you just whoosh a whoosh?

Whooshception!

http://static.caloriecount.about.com/images/medium/lucerne-milk-reduced-fat-18113.jpg

another_sam

@chubertdev said:

Did you just whoosh a whoosh?

Maybe. I didn't know if @Arantor knew what Lucerne was other than a city in Switzerland or if he got the reference and was trying to squash my attempt at humour. Or maybe I missed something, wouldn't be the first time.

boomzilla

@another_sam said:

or if he got the reference and was trying to squash my attempt at humour.

He just seemed really angry at Zend to me. Unknown if that overrode the humor or prevented him from noticing.

flabdablet

@dkf said:

You can go “oh, I've seen this sort of before and those solutions over there failed miserably; let's avoid them”.

That's where TDWTF is doing the programming world a great service.