Amazon's Cloud may be a little too Elastic



  • AWS Elastic Load Balancer sends 2 Million Netflix API Requests to Wrong Customer

    ELB is a load-balancing service that you can use to spread incoming traffic across many different EC2 server instances.   ELB, like all things in the AWS cloud, is a dynamic service that scales up and down on its own (managed by AWS internally) based on the number of inbound requests you have.

    When ELB instances are swapped in/out, the IP address of the actual load balancing servers that clients are connecting to is changing.

    Amazon works around this being a glaring problem by setting very low TTLs (time-to-live) on the domain name mappings for those machines, but clients that are caching the IP address and not honoring the TTL may still be trying to connect to the old ELB after it has been replaced by another one that your app was using.

    For example, this customer ended up with Netflix’s API traffic from 1 (or more) clients for 4 days, accounting for 30% of their systems daily traffic. That is all bandwidth and server capacity that customer has to pay for that doesn’t belong to them.

    Considering that the old ELB may have been assigned to a new customer, this is where you end up with the frequent forum questions: “Why am I getting all this traffic for a different site?“.

     



  • Tragic, but it's not really Amazon's fault people are ignoring the TTL value.



  • @blakeyrat said:

    Tragic, but it's not really Amazon's fault people are ignoring the TTL value.

    Flawed design. It should be easy for Amazon's customers to do the right thing and hard for them to do the wrong thing. When they do the wrong thing, it really oughtn't adversely impact other customers.

    No, I don't have a better design right now; they're not paying me to come up with one.



  • @fennec said:

    @blakeyrat said:
    Tragic, but it's not really Amazon's fault people are ignoring the TTL value.

    Flawed design.

    And a perfect example about how an engineer can design something that, on paper, is perfect, but when exposed to the real world has issues. Good Aesop for that, if you need one.

    @fennec said:

    It should be easy for Amazon's customers to do the right thing and hard for them to do the wrong thing.

    ??? Amazon's customers aren't involved. Various random ISPs around the world that inexplicably ignore the TTL value in DNS are doing the wrong thing. The customers who use Elastic Load Balancing are innocent victims, as is Amazon.

    @fennec said:

    No, I don't have a better design right now; they're not paying me to come up with one.

    If it was only 15 or 30 minutes of traffic, of course you could fix this by just letting ELB DNS names "lie fallow" for a few hours before switching from one client server to another. Since he saw traffic over four days I doubt this is practical...

    The solution is what they're probably already doing: ignore the problem, and refund any unwarranted costs it produces.

    Now what I really want to know is whether this effects ELBs using HTTPS! Then your site is serving up bad security certificates to the poor browser that thought it was getting Netflix, and some sucker Netflix customer thinks their site has been pwned. That would be pretty much a disaster... but since I haven't heard of it happening, it's probably a non-issue.



  • @blakeyrat said:

    [quote user="fennec"]It should be easy for Amazon's customers to do the right thing and hard for them to do the wrong thing.

    ??? Amazon's customers aren't involved. Various random ISPs around the world that inexplicably ignore the TTL value in DNS are doing the wrong thing. The customers who use Elastic Load Balancing are innocent victims, as is Amazon.
    [/quote]

    Amazon don't just get to throw up their hands and say "Oh, blame the ISPs!" - they're designing, marketing and selling services for the Internet; it's they are responsible for the way those systems perform in the Real World where not everyone on the Internet follows the rules. It's significantly Amazon's fault that they don't defend against this.

    Amazon's customer here is Netflix. Setting something up so their traffic is going to the wrong place is easy: easy enough that it happened, and easy enough that there are dozens of other cases where it's happened. A better-designed system would make this much more difficult and extraordinary. Amazon's load balancers, therefore, are simply not really a well-designed system in that respect. They've cut corners; their software is defective. Which happens all the time at all sorts of places, naturally, and sometimes it matters more than others.

    As for the solution: you're almost certainly right that the best solution for Amazon's bottom line, especially in the short- to intermediate-term, probably *is* to make it Someone Else's Problem. I would hardly hold this up as a standard of excellence, though, and wouldn't recommend the tactic for an organization without Amazon's clout that can't get away with quite so much.



  • @fennec said:

    Amazon don't just get to throw up their hands and say "Oh, blame the ISPs!"

    Sure they do.

    @fennec said:

    It's significantly Amazon's fault that they don't defend against this.

    Defend against it... how? Physically chainsaw their way into every ISP that doesn't respect TTL and forcibly configure their servers?

    @fennec said:

    Amazon's customer here is Netflix. Setting something up so their traffic is going to the wrong place is easy: easy enough that it happened, and easy enough that there are dozens of other cases where it's happened.

    I don't think you understand the problem. It has absolutely nothing to do with Amazon's load balancer, except that it assumes DNS servers follow the rules.

    You're welcome to call that "defective", I guess.

    @fennec said:

    As for the solution: you're almost certainly right that the best solution for Amazon's bottom line, especially in the short- to intermediate-term, probably is to make it Someone Else's Problem.

    It is Someone Else's Problem. I really honestly think you have no clue what the actual WTF is here.

    Look, Amazon has two choices:
    1) Spend millions of dollars and years of effort either trying to modify their system to work with defective DNS servers
    2) Spend millions of dollars and years of effort trying to get ISPs to fix their buggy-ass DNS servers
    3) Just shrug and refund any erroneous charges that result from this

    They've picked option 3. Cope.

    @fennec said:

    I would hardly hold this up as a standard of excellence, though, and wouldn't recommend the tactic for an organization without Amazon's clout that can't get away with quite so much.

    On the contrary; it's all the ISPs with buggy DNS servers that are "getting away with something". Amazon's following the rules. Amazon's clients are following the rules.



  • @blakeyrat said:

    @fennec said:
    I would hardly hold this up as a standard of excellence, though, and wouldn't recommend the tactic for an organization without Amazon's clout that can't get away with quite so much.

    On the contrary; it's all the ISPs with buggy DNS servers that are "getting away with something". Amazon's following the rules. Amazon's clients are following the rules.

    But the article says it's the clients that incorrectly cache the IP addresses, not the ISPs.

    Edit: not Amazon's clients, Amazon's clients' clients.



  • @fennec said:

    they are responsible for the way those systems perform in the Real World where not everyone on the Internet follows the rules.
     

    This is going to sound more hostile than is intended, but: in what rational school of thought can responsibility be assigned to people who come up with rules (that, when followed, mean things work nicely) instead of people who don't follow those rules?

    It's the old tradeoff between freedom and responsibility: if you are free to put whatever you want on the internet, you bear the responsibility of making sure your part works correctly, not the other people.  The only time "the other people" should bear that responsibility is if they are enforcing what you can and cannot connect to the net.  So in this case, the responsibility really should lie wholly on those entities that are not correctly honoring the TTL specifications.



  • @too_many_usernames said:

    @fennec said:

    they are responsible for the way those systems perform in the Real World where not everyone on the Internet follows the rules.
     

    This is going to sound more hostile than is intended, but: in what rational school of thought can responsibility be assigned to people who come up with rules (that, when followed, mean things work nicely) instead of people who don't follow those rules?

    It's the old tradeoff between freedom and responsibility: if you are free to put whatever you want on the internet, you bear the responsibility of making sure your part works correctly, not the other people.  The only time "the other people" should bear that responsibility is if they are enforcing what you can and cannot connect to the net.  So in this case, the responsibility really should lie wholly on those entities that are not correctly honoring the TTL specifications.

    The problem here is the lack of distinction between being "technically correct" and the "real world". The former is what you're arguing, as are other people, while the latter is what it is. While it is technically correct that Amazon's services should work with low TTLs, in the real world they do not because it is a well-known fact that there are services out there that do not respect the TTLs of DNS. Because it is well known, the responsibility is not as black and white as it might seem on the surface. Yes, it's still the fault of the services that do not honor the TTLs, but at the same time it is also Amazon's fault for blindly assuming that all services will do the Right Thing ™. While Amazon should be able to trust that their service will work correctly, they also should have done something to help ensure that when a faulty service is caching DNS that it shouldn't, there be a reasonable work around. What this is, I do not know. But those gurus at Amazon are infinitely smarter than I - at least, they theoretically should be.



  • @dohpaz42 said:

    The problem here is the lack of distinction between being "technically correct" and the "real world". The former is what you're arguing, as are other people, while the latter is what it is. While it is technically correct that Amazon's services should work with low TTLs, in the real world they do not because it is a well-known fact that there are services out there that do not respect the TTLs of DNS.



    i agree .. its like designing a webpage according to standards and blaming the browser manufacturers when it does not render properly ...



  • ♿ (Parody)

    I don't really know much about how TTLs work, and I don't really care. The bottom line is that Amazon and their customers look like idiots, not unlike blue screens from flaky windows drivers making MS look incompetent.



  • @Spectre said:

    But the article says it's the clients that incorrectly cache the IP addresses, not the ISPs.

    Ok well let's slow down and define "clients". Amazon: no problem. Amazon's clients (Netflix, guy who posted that ticket): no problem. Amazon's client's clients: (Netflix API users): problem. Personally in this case, I'd call Netflix the client and the guy using the Netflix API the consumer, to save confusion. But... point is, neither Amazon nor Netflix are at fault.

    Why do I mention the ISPs? While it's technically possible the consumer is running their own DNS server, it's much, much, much more likely that they're using a DNS server owned by their ISP. It's the misconfigured DNS server that's causing the problem. So yeah, I could have been pedantic and made sure I always said "misconfigured DNS server" instead of "ISP", but I'm not the pedantic type.

    @dohpaz42 said:

    Yes, it's still the fault of the services that do not honor the TTLs, but at the same time it is also Amazon's fault for blindly assuming that all services will do the Right Thing™.

    But now you're assuming: do you even know that's the case?

    For all we know from what's presented here, this issue was brought up on day one of Amazon Load Balancer design meetings, and Amazon decided to ignore the technical issue in favor of just compensating victims of it. System design is about more than just a class tree and some APIs.

    @boomzilla said:

    I don't really know much about how TTLs work, and I don't really care. The bottom line is that Amazon and their customers look like idiots, not unlike blue screens from flaky windows drivers making MS look incompetent.

    What the...?

    How did you read that article, and come away with the impression that Amazon and their customers look like idiots?


  • ♿ (Parody)

    @blakeyrat said:

    @boomzilla said:
    I don't really know much about how TTLs work, and I don't really care. The bottom line is that Amazon and their customers look like idiots, not unlike blue screens from flaky windows drivers making MS look incompetent.

    What the...?

    How did you read that article, and come away with the impression that Amazon and their customers look like idiots?

    I must admit that I never saw that particular article. I read some other coverage that talked about Netflix, though as a user of neither Netflix nor AWS, I mostly skimmed it, and didn't pay attention to the technical details.

    In any case...Firstly, Netflix looks like idiots from the consumer point of view, because stuff doesn't work. Then the consumer finds out that Netfilx's issues come from something Amazon is doing. Do you suppose the average Netfilx consumer can even spell TTL? All he knows is that he can't watch his movies because of something Netflix and/or Amazon did. The rest of the Internets were working for him, right?

    How do you not see this? You're usually the first guy to try to see things from the end user's perspective.



  • @boomzilla said:

    In any case...Firstly, Netflix looks like idiots from the consumer point of view, because stuff doesn't work.

    Sometimes shit happens and stuff doesn't work the way it should and there's no way for anybody to fix it. Also assigning blame to Amazon when virtually all load balancing systems (with more than one consumer attached) have the same flaw seems kind of unfair. Amazon's just the biggest. Also, we have no statistics to show how often it's happened-- for all we know this is the first time ever.

    The funny thing is I'm usually the first person who get on the whole: "shit doesn't work, you need to fucking fix it" bandwagon. But in this case, I understand that there's virtually nothing Amazon can do, and definitely nothing economical Amazon can do, so I accept it.

    @boomzilla said:

    Then the consumer finds out that Netfilx's issues come from something Amazon is doing.

    No it doesn't. Where is this misconception coming from? Amazon doesn't magically control every DNS server on Earth! Why do you think it does? Unless you expand "something Amazon is doing" to the ridiculous extreme of "offering a service other companies can subscribe to."

    @boomzilla said:

    All he knows is that he can't watch his movies because of something Netflix and/or Amazon did.

    Now you're assuming: the traffic in question was API calls. Which could mean there were consumers, or consumer devices, that were unable to watch movies. But it just as likely means some spammer's website to get Google clicks based on movie titles stopped working. The fact is, we don't know the impact of this problem to consumers, we only know the impact to the Amazon client who mistakenly got the traffic. (Which was basically: "huh. That's weird.")

    @boomzilla said:

    How do you not see this? You're usually the first guy to try to see things from the end user's perspective.

    You prove to me the end user was affected, and yeah I'll get angry. Right now, AFAIK, that evidence doesn't exist, it's just an assumption rattling around in your skull. But I'll aim that anger at the correct place: that user's ISP, the one who fucked up their DNS server. I'm not going to blame Amazon or Netflix for something that isn't their fault.

    Going back to the Windows bluescreen example: you're entirely right that a bluescreen always looks like Microsoft's fault. But its our duty as People Who Know Better to trace the blame and blame the correct party and, depending on the circumstances, put pressure on the correct party to fix the problem. Just as I'd expect an expert in insurance companies to be able to correctly assign the blame for AIG's collapse, something I'm completely unqualified to do.



  • I agree with blakey in this thread, and he's been presenting his argument without any trollbait.  +1


  • ♿ (Parody)

    @blakeyrat said:

    You prove to me the end user was affected, and yeah I'll get angry. Right now, AFAIK, that evidence doesn't exist, it's just an assumption rattling around in your skull. But I'll aim that anger at the correct place: that user's ISP, the one who fucked up their DNS server. I'm not going to blame Amazon or Netflix for something that isn't their fault.

    Yeah, I have no clue. Like I said, I've only skimmed stuff, and saw something to the effect that Netflix was affected. I'll agree with you about the root cause here. I'm just talking about the PR issues involved, which, as usual have very little relation to the technical issues. Maybe it didn't make much of a difference to any end users.

    @blakeyrat said:

    Just as I'd expect an expert in insurance companies to be able to correctly assign the blame for AIG's collapse, something I'm completely unqualified to do.

    That's easy, and you only need to know about NY politics, not insurance. It's Eliot Spitzer's fault. He chased out the CEO who knew WTF he was doing. Note that all the funny business happened after Greenberg left. But Spitzer got to the governor's mansion so I guess it all worked out in the end.



  • @boomzilla said:

    @blakeyrat said:
    You prove to me the end user was affected, and yeah I'll get angry. Right now, AFAIK, that evidence doesn't exist, it's just an assumption rattling around in your skull. But I'll aim that anger at the correct place: that user's ISP, the one who fucked up their DNS server. I'm not going to blame Amazon or Netflix for something that isn't their fault.

    Yeah, I have no clue. Like I said, I've only skimmed stuff, and saw something to the effect that Netflix was affected. I'll agree with you about the root cause here. I'm just talking about the PR issues involved, which, as usual have very little relation to the technical issues. Maybe it didn't make much of a difference to any end users.

    @blakeyrat said:

    Just as I'd expect an expert in insurance companies to be able to correctly assign the blame for AIG's collapse, something I'm completely unqualified to do.

    That's easy, and you only need to know about NY politics, not insurance. It's Eliot Spitzer's fault. He chased out the CEO who knew WTF he was doing. Note that all the funny business happened after Greenberg left. But Spitzer got to the governor's mansion so I guess it all worked out in the end.

     

    Umm... what?

    If you think this is the fault of anything recent, you understand a lot less than you think you do.  The crash is the result of economic screwups that have been building since the Reagan administration and were more or less inevitable since Clinton's time.


  • ♿ (Parody)

    @Mason Wheeler said:

    @boomzilla said:
    That's easy, and you only need to know about NY politics, not insurance. It's Eliot Spitzer's fault. He chased out the CEO who knew WTF he was doing. Note that all the funny business happened after Greenberg left. But Spitzer got to the governor's mansion so I guess it all worked out in the end.

    Umm... what?

    If you think this is the fault of anything recent, you understand a lot less than you think you do.  The crash is the result of economic screwups that have been building since the Reagan administration and were more or less inevitable since Clinton's time.

    If you thought I said that, then your reading comprehension is a lot worse than you think it is.

    He was talking about AIG (and presumably their credit default swaps), not the larger crisis. And that business really got going after Spitzer's extortion prosecution witch hunt drove Greenberg out of AIG. I mean, there was lots of stupid going on, but I think that without Client #9, AIG probably could have avoided the bailout.



  • @Sutherlands said:

    I agree with blakey in this thread, and he's been presenting his argument without any trollbait. +1

    Oh. Uh. Your mom's a whore?



  • @blakeyrat said:

    @Sutherlands said:

    I agree with blakey in this thread, and he's been presenting his argument without any trollbait. +1

    Oh. Uh. Your mom's a whore?

    Errr, he said trollbait, everybody knows about Sutherlands's mother interesting choice of employment and some have even enjoyed it, I mean the flesh trade is not that bad and it was what paid for his education, but trollbaiting should be more pungent or ruthless and possibly inaccurate and your statement is neither.

    I encorage you to do better



  • @blakeyrat said:

    @Sutherlands said:

    I agree with blakey in this thread, and he's been presenting his argument without any trollbait. +1

    Oh. Uh. Your mom's a whore?

    -1@serguey123 said:

    @blakeyrat said:

    @Sutherlands said:

    I agree with blakey in this thread, and he's been presenting his argument without any trollbait. +1

    Oh. Uh. Your mom's a whore?

    Errr, he said trollbait, everybody knows about Sutherlands's mother interesting choice of employment and some have even enjoyed it, I mean the flesh trade is not that bad and it was what paid for his education, but trollbaiting should be more pungent or ruthless and possibly inaccurate and your statement is neither.

    I encorage you to do better

    -2! D=


  • @serguey123 said:

    I encorage you to do better

    Your mom is... two whores?

    Edit: reminded me of this:



  • @Sutherlands said:

    -2! D=

    <FONT face="Courier New">(v)(;,,;)(v)</FONT>



  • @boomzilla said:

    That's easy, and you only need to know about NY politics, not insurance. It's Eliot Spitzer's fault. He chased out the CEO who knew WTF he was doing. Note that all the funny business happened after Greenberg left. But Spitzer got to the governor's mansion prostitutes so I guess it all worked out in the end.
     

    ESTFY.



  • @blakeyrat said:

    @serguey123 said:
    I encorage you to do better
    Your mom is... two whores?

    Sweeeet!  How is that bad?  The only problem I have with this is that is a bit vague and can become multiple choice,

    1. Fat whore
    2. Siamese twins whores
    3. Adopted by two womens
    4. Combination of the above

    So pick one or add to the list, the sky is the limit!



  • @blakeyrat said:

    @dohpaz42 said:
    Yes, it's still the fault of the services that do not honor the TTLs, but at the same time it is also Amazon's fault for blindly assuming that all services will do the Right Thing™.

    But now you're assuming: do you even know that's the case?

    For all we know from what's presented here, this issue was brought up on day one of Amazon Load Balancer design meetings, and Amazon decided to ignore the technical issue in favor of just compensating victims of it. System design is about more than just a class tree and some APIs.

    You're right, but my point wasn't to assign blame*. In fact, even if I were Netflix (or some other customer), I'd expect not to have this issue happen because I would expect that the necessary precautions and fail safes were in place; that regardless of how the system is implemented, and regardless of the invisible middle-men involved (in this case, DNS), that my traffic would not get redirected to some poor schmuck's website instead of mine (and vice versa for the poor schmuck).

    Yes, that's a catch-22, because I'm expecting a "perfect world" situation, which rarely happens in the real world. But, I also keep in mind that there are no perfect services, and Shit Happens ™. Hopefully something will be done - to whatever extent possible - to reduce the chance of this happening again.



  • @dohpaz42 said:

    You're right, but my point wasn't to assign blame*

    Your asterisk doesn't go anywhere.

    Anyway, the real problem is that you're making this into some huge federal case, when it's more a "eh, that's a bit weird" kind of situation. If and when this bug causes actual damage, then let's all get outraged, 'kay? Right now it's (probably) completely harmless. Lighten up, Francis.

    @dohpaz42 said:

    Hopefully something will be done - to whatever extent possible - to reduce the chance of this happening again.

    And you assume Amazon hasn't already done this because...?



  • @blakeyrat said:

    @dohpaz42 said:
    You're right, but my point wasn't to assign blame*

    Your asterisk doesn't go anywhere.

    It does if you read the tags. :)



  • @dohpaz42 said:

    @blakeyrat said:
    @dohpaz42 said:
    You're right, but my point wasn't to assign blame*

    Your asterisk doesn't go anywhere.

    It does if you read the tags. :)

    But blakey gets his posts emailed to him, which excludes the tags. Instead, his asterisk resolver got lost following a bad DNS entry through AWS's redirected virtual servers. It eventually got picked it up as "* These claims have not been evaluated by the FDA". This caused him much confusion and anger, and he is blaming you, even though it was the post emailer that dropped the relevant information, not you. Fortunately, CS compensates you for the lost tag. Unfortunately, you've just lost a customer.



  •  * this is a phishing footnote for Blakey. It contains no relevant information.



  • @Xyro said:

    But blakey gets his posts emailed to him, which excludes the tags.

    Oh. This explains a lot, actually*.

    @Xyro said:

    Instead, his asterisk resolver got lost following a bad DNS entry through AWS's redirected virtual servers.

    Nah, it's the ISP. It's always the ISP*.

    @Xyro said:

    This caused him much confusion and anger, and he is blaming you the world...

    Also because it's Tuesday.



  • @dohpaz42 said:

    Filed under: * For example why blakey was so mad about his e-mail quota getting farked

    That was my work email.

    And the real issue is that tags get wiped every so often, so in a couple months when those tags are actually gone, you'll find my post is spot-on.



  • @blakeyrat said:

    And the real issue is that tags get wiped every so often, so in a couple months when those tags are actually gone, you'll find my post is spot-on.

    It is interesting the things that I'm learning today. But, that's okay, in a couple of months this thread (as well as the other) will be buried on page n+1, so it's as a moot as a tree falling in a forest. :)



  • @dohpaz42 said:

    @blakeyrat said:

    And the real issue is that tags get wiped every so often, so in a couple months when those tags are actually gone, you'll find my post is spot-on.

    It is interesting the things that I'm learning today. But, that's okay, in a couple of months this thread (as well as the other) will be buried on page n+1, so it's as a moot as a tree falling in a forest. :)

    But what about posterity?  Someone might want to quote and old thread (it happens) or necro post or something to do with making blakeyrat happy (very important). We should do what Spectate does, keeps images of all his old post so information is always there, never lost.



  • @too_many_usernames said:

    This is going to sound more hostile than is intended, but: in what rational school of thought can responsibility be assigned to people who come up with rules (that, when followed, mean things work nicely) instead of people who don't follow those rules?

    IT Security.  TYVM.  HTH.  HAND.

    That having been said, it should be pretty trivial to put a monitor on the 'fallow' IPs, and if people are still hitting them, don't use them for something else yet.  One could even have the customers come up with an app that Amazon could use to notify the customers' customers that they have IP cache issues, and how to fix them (including the possibilities of bitching to their ISPs or switching ISPs).  It's all pretty trivial if you just stop to reflect that tcpdump exists.  And it doesn't matter if the traffic's encrypted, because all you care about is that the traffic *exists*, when the IP address isn't being given to anybody.



  • @tgape said:

    @too_many_usernames said:
    This is going to sound more hostile than is intended, but: in what rational school of thought can responsibility be assigned to people who come up with rules (that, when followed, mean things work nicely) instead of people who don't follow those rules?

    IT Security.  TYVM.  HTH.  HAND.

    That having been said, it should be pretty trivial to put a monitor on the 'fallow' IPs, and if people are still hitting them, don't use them for something else yet.  One could even have the customers come up with an app that Amazon could use to notify the customers' customers that they have IP cache issues, and how to fix them (including the possibilities of bitching to their ISPs or switching ISPs).  It's all pretty trivial if you just stop to reflect that tcpdump exists.  And it doesn't matter if the traffic's encrypted, because all you care about is that the traffic *exists*, when the IP address isn't being given to anybody.

    Your solution: Put a monitor on the fallow IPs and don't use them if someone is still hitting them.

    Why it doesn't work: The article mentioned they had 4 days of bad traffic.  That's a long time to have the addresses be fallow.  Couple with this IPv4 addresses being needed a lot, and you have a very expensive non-solution.

     

    Your solution: Create an app that tells customers that they have cache issues, and how to fix them.

    Why it doesn't work: ISPs have to care.



  • @Sutherlands said:

    Your solution: Put a monitor on the fallow IPs and don't use them if someone is still hitting them.

    Why it doesn't work: The article mentioned they had 4 days of bad traffic.  That's a long time to have the addresses be fallow.  Couple with this IPv4 addresses being needed a lot, and you have a very expensive non-solution.

    It didn't take four days to suddenly be a problem.  It took four days to fix the problem.  Considering the amount of traffic the story mentioned, I don't think it would've taken a very long fallow period to notice that some people were continuing to hit it a lot.

    @Sutherlands said:

    Your solution: Create an app that tells customers that they have cache issues, and how to fix them.

    Why it doesn't work: ISPs have to care.

    ISPs would care, if they started getting the volume of support calls that could generate, or if their customers started leaving them in droves because of their poor service.  Admittedly, I'll be leaving Comcast as soon as they have viable competition in my area (traditional Verizon DSL at 17,000 feet from the CO does not count as viable competition...)

    Alternate solution: have the customer's apps modified so their client side portion incorporates a DNS engine that works.  Any customer that doesn't want to comply gets charged the fees for all of the bandwidth that's taken up by their clients using stale IP addresses, and any additional time for the fallow IPs that would be released but for this traffic.



  • @tgape said:

    It didn't take four days to suddenly be a problem.  It took four days to fix the problem.

    They didn't fix the problem. The traffic went away after 4 ways either because the "bad" DNS name got assigned to a different AWS customer, or because the ISP that didn't respect the TTL value finally expired the DNS and re-queried it.

    @tgape said:

    ISPs would care,

    You are... extremely optimistic. Just like they care about the IPv6 rollout?

    @tgape said:

    Alternate solution: have the customer's apps modified so their client side portion incorporates a DNS engine that works.

    But the bad DNS servers are owned by the ISPs. The same ISPs that haven't fixed the problem for the last few decades that the TTL value has been required. The customer has nothing to do with the problem. Neither does Amazon. As we've said before in this thread.


Log in to reply