Brittle tests



  • In our project, we have bunch of “brittle tests”, which means that they sometimes fail randomly and that does not mean the code is broken. This is usually because they test timing of various functions and occasionally that something else isn't taking too long. They are testing things like that Thread::sleep(100)¹ does not sleep more than 130 ms and that if you CondVar::notify(), the other thread wakes up in 50 ms.

    And I would like to know whether:

    1. having such “brittle” tests is a :wtf: and how big and
    2. how do people normally test that asynchronous operations do what they are supposed to?

  • Impossible Mission - B

    @Bulb A test that doesn't reliably tell you anything about the state of your project is useless by definition, because that is the entire point of having tests.


  • Java Dev

    @Bulb Intermittent failures reduce test value significantly as it necessitates rerunning that test on every failure. The possibility of the test returning success when it should fail reduces test value to zero immediately.


  • Winner of the 2016 Presidential Election

    @Bulb said in Brittle tests:

    In our project, we have bunch of “brittle tests”, which means that they sometimes fail randomly and that does not mean the code is broken. This is usually because they test timing of various functions and occasionally that something else isn't taking too long. They are testing things like that Thread::sleep(100)¹ does not sleep more than 130 ms and that if you CondVar::notify(), the other thread wakes up in 50 ms.

    And I would like to know whether:

    1. having such “brittle” tests is a :wtf: and how big and
    2. how do people normally test that asynchronous operations do what they are supposed to?

    Those seem more like profiling than tests, and should probably be separated out and treated like such.


  • Fake News

    @Bulb said in Brittle tests:

    They are testing things like that Thread::sleep(100)¹ does not sleep more than 130 ms and that if you CondVar::notify(), the other thread wakes up in 50 ms.

    It sounds like you're testing the OS or language library. Such tests should be nuked because they highly depend on what else is running when you run the tests, AFAIK the OS is not strictly required to resume your thread within that time span.

    If instead you want to test that some event happens if a period of time is exceeded, try to abstract the wall-time source into some class which you can substitute with a class which gives either a time shortly after the start of the test or a time in the future.


  • FoxDev

    @Bulb said in Brittle tests:

    having such “brittle” tests is a :wtf: and how big

    It's a :wtf: the size of Vermont.

    @Bulb said in Brittle tests:

    how do people normally test that asynchronous operations do what they are supposed to?

    I can only speak of NodeJS, as that's where all my TDD experience is, but there's a module for that called sinon (spies, stubs, and mocks) which, when used with mocha (a test runner) and chai (an assertion library), allows you to write tests using fake timers, giving you precise control of timing in a way that completely disconnects it from the underlying runtime, OS, and hardware.



  • I had a chat recently with one of our testing researchers about this very thing. He called it "Test flakiness", and it is apparently becoming quite a problem...

    First google hit:


  • Discourse touched me in a no-no place

    @Bulb said in Brittle tests:

    They are testing things like that Thread::sleep(100)¹ does not sleep more than 130 ms and that if you CondVar::notify(), the other thread wakes up in 50 ms.

    Unless you're able to get some realtime guarantees (even soft realtime would do), those tests are a massive pile of WTF. You need to stop thinking about checking that you've got such guarantees and start thinking about ensuring that the probability that the check doesn't hold is within an acceptable level, and that will still be a whole-system property as timing is always a whole-system property.

    Whole-system properties cannot be unit tested. Even integration testing of them can be hard (unless you're crazy enough to write your own OS on bare metal that you've designed; that's what the project I'm on at the moment has done, but we're definitely mad). Note that timing is not the only such property; deadlock freedom is another one (as is its inverse: total liveness). They're the sorts of things that you throw at a software engineer or computer engineer when you think that they're getting a bit too smug.


  • Winner of the 2016 Presidential Election

    @RaceProUK said in Brittle tests:

    giving you precise control of timing in a way that completely disconnects it from the underlying runtime, OS, and hardware.

    And presumably reality.


  • FoxDev

    @Dreikin said in Brittle tests:

    @RaceProUK said in Brittle tests:

    giving you precise control of timing in a way that completely disconnects it from the underlying runtime, OS, and hardware.

    And presumably reality.

    You've never used sinon, have you? :P


  • Winner of the 2016 Presidential Election

    @RaceProUK said in Brittle tests:

    @Dreikin said in Brittle tests:

    @RaceProUK said in Brittle tests:

    giving you precise control of timing in a way that completely disconnects it from the underlying runtime, OS, and hardware.

    And presumably reality.

    You've never used sinon, have you? :P

    Nope, but I'm pretty sure I've been in its proximity.



  • @RaceProUK - not all tests are unit tests; if you want to test that an API that abstracts a system API behaves as expected on multiple types of systems, unit tests will not be of any help.


    @Bulb said in Brittle tests:

    Thread::sleep(100)¹ does not sleep more than 130 ms

    Very few systems can guarantee such things, so that's not a very good test.
    To put it another way:
    If you expect your API to guarantee such things, then the fact that the test sometimes fails means your API fails at its purpose.
    If you don't expect it to guarantee such things, why test what your API doesn't guarantee?

    There are a few alternatives, though:

    1. You can make a more basic test that checks that your Thread::sleep function does - in fact - terminate. There's a good chance that this might be the only guarantee your API will provide.

    2. If your API does have other guarantees, e.g. that the sleep function won't exit before the time has passed - you can make a test that checks this. if this test fails on some platform even once, it means that platform has no such guarantee and you must fix up your API on that platform.

    However, the above won't help you test that your Thread::sleep(100ms) function "generally" terminates in less than 130ms, and that may still be worth doing (well, I'm assuming it is, anyway - I wouldn't bother doing it myself).

    To do so sanely, you should:

    1. Move the test to a separate performance test-suite, which shouldn't run just on any build, and shouldn't run just on any machine (but only on powerful and preferably dedicated machines). Naturally, you'll need processes on when and how to run it (preferably automatic ones).

    2. Make the test check its condition an average of N times, and fail only if it succeeded at least in K of those. That's how you check that your API "generally" succeeds, as opposed to always succeeds.

    3. If your test fails (when run per your process) even once, stop and analyze why it failed and what should you fix (your code, your test, or your process) to prevent this from happening again.

    (3) is particularly important - if you let your test fail randomly, it won't be long before you'll be ignoring your entire performance test suite (and if you didn't do (1), your entire test suite), so be careful.

    This is by no means simple, but don't let that discourage you if you really need or really want to do it.



  • @Bulb said in Brittle tests:

    In our project, we have bunch of “brittle tests”, which means that they sometimes fail randomly and that does not mean the code is broken. This is usually because they test timing of various functions and occasionally that something else isn't taking too long. They are testing things like that Thread::sleep(100)¹ does not sleep more than 130 ms and that if you CondVar::notify(), the other thread wakes up in 50 ms.

    And I would like to know whether:

    1. how do people normally test that asynchronous operations do what they are supposed to?

    This test in question has issues because of how OS process schedulers work. The number you pass to sleep is the minimum amount of time to sleep a process. The OS doesn't have to reschedule the process exactly 100ms later, particularly if it has either:

    1. Higher priority processes running.
    2. A lot of processes waiting to be run.

    This is basic Operating Systems stuff. Do they not teach this sort of thing in CS any more?



  • @powerlord said in Brittle tests:

    @Bulb said in Brittle tests:

    In our project, we have bunch of “brittle tests”, which means that they sometimes fail randomly and that does not mean the code is broken. This is usually because they test timing of various functions and occasionally that something else isn't taking too long. They are testing things like that Thread::sleep(100)¹ does not sleep more than 130 ms and that if you CondVar::notify(), the other thread wakes up in 50 ms.

    And I would like to know whether:

    1. how do people normally test that asynchronous operations do what they are supposed to?

    This test in question has issues because of how OS process schedulers work. The number you pass to sleep is the minimum amount of time to sleep a process. The OS doesn't have to reschedule the process exactly 100ms later, particularly if it has either:

    1. Higher priority processes running.
    2. A lot of processes waiting to be run.

    This is basic Operating Systems stuff. Do they not teach this sort of thing in CS any more?

    1. The system crashes or the power goes out before your process resumes.
    2. A debugger suspends your process during the sleep.
    3. The computer's clock is updated during the sleep and you wake up from a 100ms sleep 3 hours earlier than you started. (There's a monotonic clock on most OSes that only moves forward, so use that if your program can't handle time moving backwards.)
    4. Instead of a clock, your computer uses Real Time with Bill Maher, so your program can only run for an hour every Friday.

  • Discourse touched me in a no-no place

    @ben_lubar said in Brittle tests:

    Instead of a clock, your computer uses Real Time with Bill Maher, so your program can only run for an hour every Friday.

    👍


  • Notification Spam Recipient

    @powerlord said in Brittle tests:

    This is basic Operating Systems stuff. Do they not teach this sort of thing in CS any more?

    Hmm... IIRC it was only briefly touched on tbh...


  • FoxDev

    @CreatedToDislikeThis said in Brittle tests:

    @RaceProUK - not all tests are unit tests; if you want to test that an API that abstracts a system API behaves as expected on multiple types of systems, unit tests will not be of any help.

    sinon et al can be used for other types of tests aside from unit tests.


  • I survived the hour long Uno hand

    @RaceProUK You don't even really have to use Sinon if you have another method of mocking out Thread.sleep(). The key is, you want to control the scheduling, not leave that up to the OS. Then you schedule the callback immediately and verify that it runs correctly.

    sinon just makes it easier because it controls the passage of time, so you can say "100 ms passed, did the callback run?".

    Ditto for async stuff, really: you want to control the scheduling of routines within the test. So you can say, run this one, run that one, check if this other one ran.