Yahoo + POS website = fun



  •    A month or so ago, one of the websites under my care got slammed hard for a solid three weeks to a month.

       

    What really confused all of us was the gradual increase in bandwidth without an increase in access_log, error_log, or google analytics based hits, so we suspected foul play.   That break in the bandwidth is when we implemented an abuse throttle that would temporarily ban IP's with more then 60 connections to the server in the span of 1 minute, unfortunately we had to make some minor "tweaks" as we were banning a number of corporate gateways, and Google bot  which allowed the culprit got back into the swing of things like nothing had happened.  Its at that point we figured out who or what as responsible  ~$4,800  USD in unscheduled bandwidth charges later, Yahoo turns out to have found a circular hole in the rats nest of mod_rewrite rules for the website.  More intelligent robots like Googlebot and MSN have a bunch of sanity checks to prevent this, both to save themselves money and to avoid killing what their aggregating...but not Yahoo, Yahoo appears to get impatient and actually re-request a transaction if its taking too long, excaberating an already bad situation, then to make things even more fun, from looking back over the logs... it looks like its various threads would often grab the same jpg or swf object repeatedly... just in case the file had changed in the span of a minute or so.

     

    The obvious WTF is that our website is a POS that needs to be printed out, burned, and mailed to the genius who implemented his own version of MVC that doesn't even come close to what a MVC framework is.    The other wtf in my mind is that Yahoo appears obvlious to the havoc it creates with a search aggregoter bot that is borderline DoS'ing its targets.



  • @Ion9 said:

    The other wtf in my mind is that Yahoo appears obvlious to the havoc it creates with a search aggregoter bot that is borderline DoS'ing its targets.

    A few years ago I had a problem with Googlebot DOSing a web app.  The problem was we had a random GET var on most requests to get around stupid caching proxies and the bot got stuck in a loop and couldn't get out.  By the time the DB server stopped accepting connections and the load hit 100 Googlebot was doing 900 GETs per second. 



  • Yahoo's bot is also the single worst offender here -- Yahoo Slurp accounting for 183.62 MB, Googlebot accounting for 31.15 MB and MSNBot-media accounting for 16.25 MB.

    I'm really not sure what it's doing to make it use nearly 6 times as much bandwidth as Google, or 11 times as much as MSN, though I'm not using mod_rewrite anywhere so I can rule out your explanation!



  • Oooooooooooold.

    I once had a HTTP server behind a DSL line for some temporarily hosted data. After I linked to an image to it on a forum, many search engines started checking even the start page.

    That's nice - and most search engines didn't cost much traffic. They only did one connection at once, and only about once a day.

    Not Yahoo, which repeatedly read the very same page over and over again - which cost a noticable amount of bandwidth. I ended up banning Yahoo's bot by Apache rules, as the site couldn't be found on Yahoo anyway.



  •  I just checked my stats, fearing the worst.

    And What The Fuck?

    <font size="-1">#</font> <font size="-1">Hits</font> <font size="-1">Files</font> <font size="-1">KBytes</font> <font size="-1">Visits</font> <font size="-1">Hostname</font>

    <font size="-1">1</font> <font size="-1">165</font> <font size="-2">0.48%</font> <font size="-1">115</font> <font size="-2">0.45%</font> <font size="-1">88944</font> <font size="-2">12.39%</font> <font size="-1">58</font> <font size="-2">0.93%</font> <font size="-1">llf320038.crawl.yahoo.net</font>
    <font size="-1">2</font> <font size="-1">202</font> <font size="-2">0.59%</font> <font size="-1">172</font> <font size="-2">0.67%</font> <font size="-1">71044</font> <font size="-2">9.90%</font> <font size="-1">105</font> <font size="-2">1.69%</font> <font size="-1">llf520173.crawl.yahoo.net</font>
    <font size="-1">3</font> <font size="-1">173</font> <font size="-2">0.51%</font> <font size="-1">139</font> <font size="-2">0.54%</font> <font size="-1">43039</font> <font size="-2">6.00%</font> <font size="-1">81</font> <font size="-2">1.30%</font> <font size="-1">llf520064.crawl.yahoo.net</font>
    <font size="-1">4</font> <font size="-1">42</font> <font size="-2">0.12%</font> <font size="-1">28</font> <font size="-2">0.11%</font> <font size="-1">41219</font> <font size="-2">5.74%</font> <font size="-1">11</font> <font size="-2">0.18%</font> <font size="-1">llf320056.crawl.yahoo.net</font>
    <font size="-1">5</font> <font size="-1">70</font> <font size="-2">0.21%</font> <font size="-1">53</font> <font size="-2">0.21%</font> <font size="-1">34655</font> <font size="-2">4.83%</font> <font size="-1">28</font> <font size="-2">0.45%</font> <font size="-1">llf520125.crawl.yahoo.net</font>
    <font size="-1">6</font> <font size="-1">106</font> <font size="-2">0.31%</font> <font size="-1">95</font> <font size="-2">0.37%</font> <font size="-1">31317</font> <font size="-2">4.36%</font> <font size="-1">57</font> <font size="-2">0.92%</font> <font size="-1">llf520190.crawl.yahoo.net</font>
    <font size="-1">7</font> <font size="-1">29</font> <font size="-2">0.09%</font> <font size="-1">25</font> <font size="-2">0.10%</font> <font size="-1">22313</font> <font size="-2">3.11%</font> <font size="-1">14</font> <font size="-2">0.23%</font> <font size="-1">llf520107.crawl.yahoo.net</font>
    <font size="-1">8</font> <font size="-1">8</font> <font size="-2">0.02%</font> <font size="-1">8</font> <font size="-2">0.03%</font> <font size="-1">21892</font> <font size="-2">3.05%</font> <font size="-1">0</font> <font size="-2">0.00%</font> <font size="-1">
    </font>
    <font size="-1">9</font> <font size="-1">38</font> <font size="-2">0.11%</font> <font size="-1">22</font> <font size="-2">0.09%</font> <font size="-1">18813</font> <font size="-2">2.62%</font> <font size="-1">18</font> <font size="-2">0.29%</font> <font size="-1">llf520165.crawl.yahoo.net</font>
    <font size="-1">10</font> <font size="-1">37</font> <font size="-2">0.11%</font> <font size="-1">31</font> <font size="-2">0.12%</font> <font size="-1">16527</font> <font size="-2">2.30%</font> <font size="-1">1</font> <font size="-2">0.02%</font> <font size="-1">
    </font>

     



  • @OperatorBastardusInfernalis said:

    Oooooooooooold.
      I know that the search engine's conquest to ass rape the internet, one site at a time, is old... but that just makes things worse, you'd think for a symbiotic relationship ( search & content ) it wouldn't be parasitical or destructive to the tech and coders that are not OCD about looking for flaws in their sites.



  • What gets me about this is that it not only cost the OP's company $4800 in bandwidth charges, it also probably cost Yahoo additional money in bandwidth charges, too. And if they're doing this to other sites, which it appears they are, they're really screwing themselves. There's all this talk of lost ad revenue at Yahoo. I wonder how much they could make back by writing decent software?





  • @dcardani said:

    What gets me about this is that it not only cost the OP's company $4800 in bandwidth charges, it also probably cost Yahoo additional money in bandwidth charges, too.

    The difference between the OP's bandwidth costs and Yahoo's would be dramatic.  Companies like Yahoo lease their own lines and have a static pool of bandwidth whereas the OP is obviously paying on some kind of committment plan.  In the long run this would require Yahoo to have more lines coming in, but the cost is not nearly what you think it is, at least relative to the $4800 the OP had to pay in overages. 



  •  Yeah, the difference between a conglomerate web search company's network setup and a website with 200K unique visits... would be like comparing a small office lan to a national telecom.   The site is parked in oblivion and meant to only serve about 350-500GB a month, not 1.8 TB.



  • @sentix said:

    Yeah, the difference between a conglomerate web search company's network setup and a website with 200K unique visits... would be like comparing a small office lan to a national telecom.   The site is parked in oblivion and meant to only serve about 350-500GB a month, not 1.8 TB.

    I should take this opportunity to point out that TRWTF is that the OP's server is not using bandwidth throttling to prevent massive overcharges.  I mean, what if I was just some jerk who wanted to cost his company a lot of money?  I could sit and suck down tons of bandwidth before anyone noticed. 



  • @OperatorBastardusInfernalis said:

    Not Yahoo, which repeatedly read the very same page over and over again - which cost a noticable amount of bandwidth. I ended up banning Yahoo's bot by Apache rules, as the site couldn't be found on Yahoo anyway.

     

    If you don't want to be indexed, why not use a robots.txt file? Or doesn't Yahoo honour it anymore?



  • @Dalden said:

    If you don't want to be indexed, why not use a robots.txt file? Or doesn't Yahoo honour it anymore?
     

     

    Well we do want to be indexed, for the websites target market its #1 and #2|3 in the search results for all of its core keywords.  Just wish it didn't involve the server being raped in the process.



  • @sentix said:

    Well we do want to be indexed, for the websites target market its #1 and #2|3 in the search results for all of its core keywords.  Just wish it didn't involve the server being raped in the process.
     

    Well of course your #1 - your site has thousands and thousands of keyword related content - all almost identical except for the URL of course.  Actually, after rapping your server, you are lucky they didn't accuse you of "blacklisted SEO tactics" and ban your site.  Abusing a poor Yahoo indexing bot just to crank your listings up - shame on you!


Log in to reply