The Bug Report Factory



  • I'm working on a system that calculates statistics on all the transactions that goes in and out of our company - which is slightly more than a million a day.

    To handle these amounts of data, we do lots of preprocessing in nightly jobs;

    • Daily data-deltas are copied from the production servers to a dedicated statistics server.
    • Once the data is on the dedicated server, it is processed through a number of transforming queries that incorperates as much of the statistical businesslogic as possible (one of these queries is over 500 lines of SQL).
    • After about 30 minutes, the data is ready to be consumed by the report generating front-end.
    That is the normal nightly flow anyway. But one night was out of the ordinary; we had to enable statistics on a new system, and this meant that we had to move more than 6 months worth of data - 69 million records in total.

    So much data was impossible to handle in just one night - actually, it would take three days to fully process it. So we set up a new job that wouldn't run nightly, but seperately from the other jobs so they wouldn't disturb one another. This new job would be considerate of the database server (yes, it is possible to crash a database server with sheer load) and handle the data in bulks of 10.000 records at a time, looping until there was no more data left to process.

    I was the one charged with the task of creating this job.

    The framework used for building these jobs has a nice feature: if an exception is thrown when a job runs, the stack trace is emailed to the developers - this allows us to give prompt notification to the rest of the company that we might have some data corruption at the moment, and allows us quickly hunt down and fix these bugs, minimizing the amount of corrupt data that needs fixing.

    This was a very efficient system, and, as it turns out, would be used a lot. You see, the job I created had a bug in it: I forgot to set which database the transforming queries should run on, and so the query failed. A stack trace was produced and emailed to me.
    But then this special job was build with a loop, and the loop found that it still had data left to process, so it ran the query again, and it ofcouse it failed, again. And again. And again.
    In fact, it was left to fail several times a second over night.

    It's not like I had started it and just gone home. You see, before the transforming part of the job could start, all the data had to be downloaded (and processed by yet another transforming query). This meant that it didn't start to fail until several hours after I had gone home for the day.

    When I came to work the next day, I was eager to see how the job was coming along, so I logged into the server and tailed the log file. I usually start tail with the -n 500 parameter to I don't miss anything should the last lines be part of some stack trace, this however also means that I have wait a little while for all the lines to be printed, but it's usually pretty fast so I don't mind. So I tailed and lines flew by. And they flew by, and they just kept on coming, and they were all stack traces.

    This is when I thought to myself: "Fuck".

    I went and stopped the job. Then I opened my Mail application and it started downloading. I downloaded the first 10.000 emails, and deleted them all in bulk. Then the next 20.000. And the next 9.000. Another 10.000, and then 12.000. The emails just kept coming in a steady stream of bulks of 256.

    We tried emailing our exchange supplier, but they could really do anything except tell me that I had 220.000 items in my inbox.

    And with this, my friday was pretty much ruined.



  • And now you've learned to test your code on one before looping through all.  I learned a similar lesson when my code to find and delete specific files trashed a system because I screwed up one key if statement.  That was special.



  • We have a similar exception reporting system where ALL of the developers get ALL of the exceptions for ALL of the projects. I shut down Outlook one evening (which I never do) and stayed home the next day. I came in to find [b]900[/b] emails waiting for me. I have Outlook set to "permanently delete" these garbage emails, but that means the rule only runs on my local instance of Outlook, not the server. It is also slow and means that it takes about 20 minutes to run the rule on 900 emails in the morning.



  • I once wrote a script that would email me every time it ran.  I then went on holiday for 2 weeks... upon my return I found out the morning after I left, someone had added the script on a cron job set to run every minute and I had 20,000 emails to download.  Fun times.



  • How to handle flooded Exchange accounts.

    @nobody said:

    I went and stopped the job. Then I opened my Mail application and it started downloading. I downloaded the first 10.000 emails, and deleted them all in bulk. Then the next 20.000. And the next 9.000. Another 10.000, and then 12.000. The emails just kept coming in a steady stream of bulks of 256.

    We tried emailing our exchange supplier, but they could really do anything except tell me that I had 220.000 items in my inbox.

    And with this, my friday was pretty much ruined.

    Well, sorry if it's a bit late now, but you never know when there might be a next time.... 

    Exchange server supports IMAP access.  Install a half-decent free client and you can use a pattern match to delete all that crud straight off the server.  Your 'supplier' should have known that.




  • My old job had a real problem with change control.  Since we refused to actually force our customers to upgrade to the latest version on a recent interval, anytime a bug was fixed I had to post the fix to about 5 different code branches.  For each of those branches, a project manager had to approve the fix and add it to an access list before I could check it in.

    They had a frequent habit of not telling me when my stuff was on the access list, until right before the build.  So I wrote a simple script to monitor the access list and SMS me whenever my bugs got added.  Of course it broke while I was on vacation.  One message every 5 minutes all night long and no way to stop the thing from where I was.



  • @vt_mruhlin said:

    My old job had a real problem with change control.  Since we refused to actually force our customers to upgrade to the latest version on a recent interval, anytime a bug was fixed I had to post the fix to about 5 different code branches.  For each of those branches, a project manager had to approve the fix and add it to an access list before I could check it in.

    They had a frequent habit of not telling me when my stuff was on the access list, until right before the build.  So I wrote a simple script to monitor the access list and SMS me whenever my bugs got added.  Of course it broke while I was on vacation.  One message every 5 minutes all night long and no way to stop the thing from where I was.




    Ah, this one's a lot easier to handle; kill whatever sending you SMS messages, and leave your phone turned off for 3 days. After three days, any undelivered messages will disappear into the thin air.



  • I love the thread title.

    I've got myself stuck in infinite loops before. Usually it just brought the server to its knees. I did it enough times that now I inspect any loops I create and examine them for potential issues like that. If it logically (not a bug) could go into an infinite loop, I'd put a max counter in it. What fun.



  • @DaveK said:

    @nobody said:

    I went and stopped the job. Then I opened my
    Mail application and it started downloading. I downloaded the first
    10.000 emails, and deleted them all in bulk. Then the next 20.000. And
    the next 9.000. Another 10.000, and then 12.000. The emails just kept
    coming in a steady stream of bulks of 256.

    We tried emailing our exchange supplier, but they could really do anything except tell me that I had 220.000 items in my inbox.

    And with this, my friday was pretty much ruined.

    Well, sorry if it's a bit late now, but you never know when there might be a next time.... 

    Exchange
    server supports IMAP access.  Install a half-decent free client
    and you can use a pattern match to delete all that crud straight off
    the server.  Your 'supplier' should have known that.

    Maybe they were aware of the Exchange bug that means that, if there are more than 65535 mails in a folder, it won't let you open that folder with IMAP... (At least, this was true last time I tried to do something similar - I sincerely hope it's been fixed now!)



  • Youch. The lesson (and the source of the WTF) is, always test it without the loop. Then put the loop in, and make sure that any exceptions stop it dead.



  • @LightningDragon said:

    Youch. The lesson (and the source of the WTF) is, always test it without the loop. Then put the loop in, and make sure that any exceptions stop it dead.



    The exception never made it to the outer workings of the loop, but was swallowed by some misbehaving framework code that decided to log it rather than rethrow.... now it always rethrows...



  • @LightningDragon said:

    Youch. The lesson (and the source of the WTF) is, always test it without the loop. Then put the loop in, and make sure that any exceptions stop it dead.

    The real lesson should be: always have a rate limiter on anything that sends mails automatically. If you have sent 20 error mails in the past 10 seconds, it's time to start dropping them on the floor rather than sending them. The rate limit can be determined by measuring the maximum length of time between your inspections of your inbox, and dividing this period by the maximum number of mails you ever want to see in there. I typically find "1 per minute, burst of 10" to be about right.



  • @asuffield said:

    The real lesson should be: always have a rate limiter on anything that sends mails automatically.

    No kidding. Of course, then you have to debug the rate-limiter code.

     My personal worst - sending about 100 emails to everybody on a 20-person development team in just a few seconds. The initial problem wasn't so bad, but all the "what the hell are these messages doing in my inbox?" that were reply-to'd to everybody else on the list made for an interesting couple of hours.

     The worst I've seen was a "perfect storm" of a misconfigured script, a badly-formatted email address, and a mailer that interpreted addresses with no user name as "to everytbody". Sending several dozen obscurely-worded, machine-generated emails to everybody in the company (all 300 of us) was probably not the best advertising for our new automated test system.



  • @nobody said:

    @LightningDragon said:

    Youch. The lesson (and the source of the WTF) is, always test it without the loop. Then put the loop in, and make sure that any exceptions stop it dead.



    The exception never made it to the outer workings of the loop, but was swallowed by some misbehaving framework code that decided to log it rather than rethrow.... now it always rethrows...

    That the framework did that possibly qualifies as a separate WTF. Only exceptions which get all the way out of the program should be logged.
     


Log in to reply