If at first you don't succeed try again 20 times. Or: How we spammed our customers



  • It seems that every time we touch our firewall rules, something in the depths of our arcane system setup decides that it can't send mail any longer. This time, we replaced one physical box with a VM, and had to do a bunch of routing to get it to be visible to the outside world, causing our main WWW server to clam up for no particular reason. Of course, we didn't notice for five days, until Sales got a call from a long time customer noting that he didn't get his usual order confirmation email.

    A quick dive into the mail queue shows ~47,000 messages waiting to be delivered. Considering we normally send maybe 50 or so mails from that server a day, and there's supposed to only have been a five day backlog, that number was an itsy bit concerning.

    The code says it better than I can, but I'm sure you can guess already. Copied and pasted:

        // Setup the overload script so that we attempt to send a few times before giving up the ghost
            $overload = 0;
            do {
            // if we loop more then once, sleep for a second.
                if ($overload)
                    sleep(1);
            // Send the message
                $sent = mail($to,
                             $subject,
                             $message,
                             $mail_headers,
                             "-f $sender_email");
            } while (++$overload <= OVERLOAD && !$sent);
        // Error?
            if (!$sent)
                trigger_error('Unknown error sending email -- please go back and try again', E_USER_FATAL);
    

    I've seen this code for two years, and never fully grokked the fullness of its insanity. OVERLOAD is a constant defined in the configuration as 20. It's used in other places throughout the code, mostly when dealing with potentially recursive things that should, by their rights, never be recursive. You know, like category trees.

    For those not familiar with trigger_error, it's a PHP function designed to let programmers call what normally would be PHP-level big nasty errors. It can also be configured to call a user-defined function. In our case, we also reuse that function for assert() calls, and we get a nice happy email every time it's called. On WWW, our end-users get a nice shiny pretty "oops!" page when it's a fatal as well.

    If you missed that, the email function tries to send an email when it can't send an email.

    Even though the mail failure was bad enough to cause the PHP mail() function to return false, it actually successfully queued up the mail. This means that each normal mail was sent twenty times, and then a failure mail was sent twenty times, and then a failure mail was sent twenty times, and then... well, I don't know when it stopped, but it did. And of course, the mail daemon diligently started sending four hour delivery warnings, and then timeout bounces. Before we knew it, 47k mails in the queue. Argh. The decision was made to simply unclog the mail queue and let the deliveries happen, after removing all the notifications and bounces and error emails. Only after deliveries started did we fully realize that every customer that caused mail to be sent would be getting twenty copies. Thankfully, most of our customers are technical and when they called to complain, it was more of a "hey, you guys broke something" than "omg what are you doing stop that!"

    The code has now been "fixed." Yay, I can finally do error logging! I hate this codebase sometimes.



  • Heh, I've seen that bug before. Ours was equally bad. It was in ASP.NET and about once per new developer someone comes up with the great idea that instead of showing a "Sorry" page on errors he'd just send the user to the homepage. Then one day an edge case came around, data in our feed which hadn't been considered caused the homepage to crash (I know, TRWTF was the lack of unit tests and input validation). So it sent us an email and then redirected the user to the homepage... which then crashed, sent an error, and sent the user to the homepage... then repeat ad nauseum.

    But I still think the core WTF of your system was that mail sent back false when it had sent the mail. Admittedly that'd just bring it down to an infinite loop though :P 



  •  The mail function is a standard-one in PHP, so the only fault on that part of them could be faulty configuration. Otherwise it's either PHP's fault for not returning true if a mail has queed up, the mailservers fault for giving an error but also queing the mail up or not-really-anybody's-fault because the SMTP protocoll isn't specifick enough.(which I don't know)



  • Yeah, I know it's a standard function in PHP, but when it comes to sending a message saying it wasn't sent when it was is a pretty big screwup. Not even trying to guess who caused it, just saying that seems like the worst thing that went wrong in the whole puzzle :P


Log in to reply