The discovery of Apache ZooKeeper’s poison packet



  • The discovery of Apache ZooKeeper’s poison packet

    Just a few :wtf:s here:

    :wtf: Zookeeper has a missing validation for a length for a packet transfer. Turns out you can ask it to handle a packet of 1.7GB which, when you try to allocate a buffer for that...for some odd reason...gets an out of memory error from the Linux kernel. It seems to me I should be able to put a finger on why...

    :wtf: Zookeeper uses a leadership strategy where one single thread of many threads is elected leader and handles that role (that's not a wtf). Part of the role includes pre-processing the incoming packets (also not a wtf). But when the pre-processor tries to allocate a 1.7GB buffer, it gets an out of memory exception...which was not handled. Let me say that again: Not caught. So now the elected leader thread has died...and the group would elect another leader, but the group doesn't know the leader is dead because the exception that caused its death went into the bit bucket. This leaves the group leaderless and, for some odd reason, it doesn't really work well without a leader. I suppose it could be the fact that there's no thread receiving packets anymore.

    :wtf: Why was the input length bad? Well, it got replaced by ^#^!garbage#&!*@& from a bug. But this is TCP/IP, so the corruption in the packet should be detected right? Well, they're using transport mode (which I don't understand well) but because they're in that mode, the Linux kernel skips validation of the checksum for the payload of the TCP packet. That goes back to RFC 3948, which defines transport mode and says, "... the client MAY forgo the validation of the TCP/UDP checksum." MAY...not SHOULD, not MUST, not even IT IS A GOOD IDEA. (It is not.) But the authors of the Linux kernel, in their wisdom interpreted , "MAY" to mean "ABSOLUTELY NEVER EVER SHALL WE CHECK IT!" Nosireee, we'll just let that corruption flow right on into the application! :wtf: So, anyway, should we call this an RFC bug? Or a Linux bug? See the thing is, transport mode is normally used for VPN networks, which would have other ways of detecting the problem. Funny thing, though, the RFC and Linux both assume no one would ever have a good reason to use transport mode for anything other than VPN.... How about let's say it is a bug in both?

    :wtf: And whose fault is the last WTF, the garbage in the packet? Well, that's not entirely clear. But it appears that if you are running Zookeeper version waning.moon under Linux kernel version high.noon and that, in turn, runs under the Xen paravirtual machine version winter.solstice and you use the aesni-intel module in Linux, it seems to randomly clobber occasional TCP/IP packets at random. Was it Intel's aes-ni instruction that did it? Was it something in aesni-intel module? Was it something in Xen? Have fun guessing.

    ...and you thought your bugs were hard to track down.

    ...and you thought that :wtf:s aren't everywhere. Zookeeper, Linux, RFC, Intel(?), Xen(?)...if any of them had worked right, no problem. Instead, we have the bugz from hell.



  • Isn't this the IPSec can mangle TCP packets and not checksum them bug/feature?



  • @MathNerdCNU said:

    Isn't this the IPSec can mangle TCP packets and not checksum them bug/feature?

    Ummmm....I think so, but that description would be the short-short version.



  • Just checking. When I read about the ZooKeeper thing a few weeks ago that was the jist I got of it. At least that was the root cause, I think. There was still a bug in ZooKeeper but fixing it wouldn't fix/mitigate IPSec passing on corrupt packets.



  • Yes, this is Ars Technica reposting the original blog post which was a week or two so ago.



  • Isn't the "author" the original author? Normally Ars Technica is pretty on-key with security-issues.



  • Yes, the byline is the same.

    The infobox above the article explains it:

    This story originally appeared on PagerDuty.

    May 7, 2015 http://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/
    May 13, 2015 http://arstechnica.com/information-technology/2015/05/the-discovery-of-apache-zookeepers-poison-packet/



  • So repost-ception. Glad I'm not going insane and thought this was something new.



  • I hadn't seen any details on it before. I posted it today because it finally made Schneier on Security.



  • Apache Zookeeper

    Should've used IMPS for that.



  • @CoyneTheDup said:

    That goes back to RFC 3948, which defines transport mode and says, "... the client MAY forgo the validation of the TCP/UDP checksum."

    If a client MAY forgo the validation, that clearly means that the application MUST NOT rely on it having been done.

    @CoyneTheDup said:

    But the authors of the Linux kernel, in their wisdom interpreted , "MAY" to mean "ABSOLUTELY NEVER EVER SHALL WE CHECK IT!"

    Well, when the application may not rely on the check anyway, there is no point in doing it. Besides, all the tunnelling applications do indeed do it (on the decapsulated/decoded packet). And there are cases where shaving off those CPU cycles does help like router or NAS hardware. So they interpreted the "MAY" as "we will, because we have cases where it makes sense and we don't know of anything so far that would break".

    @CoyneTheDup said:

    So, anyway, should we call this an RFC bug? Or a Linux bug?

    It's squarely ZooKeeper bug. The RFC is clear, so it does not have a bug. Linux kernel follows the RFC, so it does not have a bug. ZooKeeper relies on something the RFC says MAY not be there, so it has a bug.



  • @Maciejasjmj said:

    Should've used IMPS for that.

    Could have worked as well. 😄

    @Bulb said:

    If a client MAY forgo the validation, that clearly means that the application MUST NOT rely on it having been done.
    @Bulb said:
    Well, when the application may not rely on the check anyway, there is no point in doing it. Besides, all the tunnelling applications do indeed do it (on the decapsulated/decoded packet). And there are cases where shaving off those CPU cycles does help like router or NAS hardware. So they interpreted the "MAY" as "we will, because we have cases where it makes sense and we don't know of anything so far that would break".

    Not buying. The proper way to do this would be to allow the application to request that this check be bypassed in the interests of saving time. As it is, the user of the protocol is left to say to himself, "Oooo, the client MAY not do this so I must always do it, just in case the client did not." Since no one is even going to know that they need to check, always omitting the check in the kernel is an invitation to the kind of breakage reported here.



  • @CoyneTheDup said:

    As it is, the user of the protocol is left to say to himself

    Given that transport mode is advanced option for special cases, the user should know. Or not meddle with things they don't understand.

    It would be reasonable to provide an option, because in some cases doing the checksum in the stack is the more efficient option, though in most it is not. But it's still fine this way.



  • @Bulb said:

    Given that transport mode is advanced option for special cases, the user should know. Or not meddle with things they don't understand.

    Be reasonable: How would the user know that? They're running on, say, Microsoft Windows: How about you go check the source right quick and see if that O/S happens to have chosen to bypass the check?


  • Discourse touched me in a no-no place

    @CoyneTheDup said:

    Be reasonable: How would the user know that?

    The user in this case is best considered to be the ZooKeeper application (or perhaps its developers).



  • @dkf said:

    The user in this case is best considered to be the ZooKeeper application (or perhaps its developers).

    Yes. Application developers. Not O/S developers, where this decision was made.

    The point is that it is not reasonable for a standard to assume the user will always know or know how to check. You don't allow O/S designers to bypass safety items in a standard simply because some people don't want to waste a bit of CPU.

    "...a client MAY bypass the check but, if it proposes to do so, MUST provide a control mechanism and MUST do so only if the application requests..."

    There's also ambiguity here in the term "client". It could be taken to mean O/S or Application.


  • Java Dev

    Indeed. I think the alternatives in RFC writing is saying the client SHOULD or MUST forgo the validation, but that would imply there are significant arguments from RFC writing standpoint to not do the validation.



  • @CoyneTheDup said:

    Apache ZooKeeper

    I want to hire whoever is in charge of naming things in Apache.

    @CoyneTheDup said:

    "... the client MAY forgo the validation of the TCP/UDP checksum." MAY...not SHOULD, not MUST, not even IT IS A GOOD IDEA. (It is not.) But the authors of the Linux kernel, in their wisdom interpreted , "MAY" to mean "ABSOLUTELY NEVER EVER SHALL WE CHECK IT!" Nosireee, we'll just let that corruption flow right on into the application! So, anyway, should we call this an RFC bug? Or a Linux bug?

    The RFC is very clear: you don't have to check it. I suppose the best choice would be to let the application override this.

    But, just from this fragment, I'd interpret it as "you have to set the checksum to its correct value, but only because legacy devices might drop the packet if you don't, don't actually bother checking it".



  • @CoyneTheDup said:

    ...it gets an out of memory exception...which was not handled. Let me say that again: Not caught...

    You cannot catch OOM exceptions in multi-threaded Java application and expect to go on doing useful work, because you do not know what other threads may have died due to the same OOM condition. That is, right before the OOM was thrown to the leader, all other threads may have tried to allocate a reasonable sized object from the heap and also got OOM exceptions. There can also be issues with lower-level resources not allocated or cleaned up properly in other threads due to the OOM.

    There may be some small cases where OOM can be handled gracefully, but in the general case the only thing to do is to exit the program as gracefully as possible at that point.

    So if anything, this is more of a Java :wtf: than ZooKeeper.


  • Discourse touched me in a no-no place

    @quijibo said:

    You cannot catch OOM exceptions in multi-threaded Java application and expect to go on doing useful work, because you do not know what other threads may have died due to the same OOM condition.

    That applies to most multi-threaded systems and even on single-threaded systems too (because it could be another process that's the memory hog). Recovering from being out of memory is hard except in one specific case: where you're seeking to allocate a single large chunk and that's what fails. That happens to also be the easiest case to test.



  • @quijibo said:

    You cannot catch OOM exceptions in multi-threaded Java application and expect to go on doing useful work, because you do not know what other threads may have died due to the same OOM condition. That is, right before the OOM was thrown to the leader, all other threads may have tried to allocate a reasonable sized object from the heap and also got OOM exceptions. There can also be issues with lower-level resources not allocated or cleaned up properly in other threads due to the OOM.

    There may be some small cases where OOM can be handled gracefully, but in the general case the only thing to do is to exit the program as gracefully as possible at that point.

    So if anything, this is more of a Java than ZooKeeper.

    According to the architecture overview, each instance is on a separate server. So this wouldn't apply since different virtual engines would be involved.

    Now, as to the sanctity (?) of communicating with other servers after an out of memory exception, well... 😕

    @dkf said:

    That applies to most multi-threaded systems and even on single-threaded systems too (because it could be another process that's the memory hog). Recovering from being out of memory is hard except in one specific case: where you're seeking to allocate a single large chunk and that's what fails. That happens to also be the easiest case to test.

    Then I read this. ☀


  • Discourse touched me in a no-no place

    @CoyneTheDup said:

    Then I read this.

    My point is that you could chisel the program out of raw machine code (likely an insane amount of work) and it would still have the problem that small-chunk memory allocation failure is hard to recover from. It's hard precisely because you can't allocate memory from then on for an uncertain amount of time.

    Fortunately, it usually doesn't happen in practice except where something has a bad memory leak.


Log in to reply