This wtf is a classic





  • Should I ever visit your continent, I will go by ship. Airtraffic controlled by a Windows Server... terrific.



  • @ammoQ said:

    Should I ever visit your continent, I will go by ship. Airtraffic controlled by a Windows Server... terrific.


    In all fairness, if you RTFA, Microsoft is not to blame in any way.  It was a software issue, not an OS issue.



  • @merreborn said:

    @ammoQ said:
    Should I ever visit your continent, I will go by ship. Airtraffic controlled by a Windows Server... terrific.


    In all fairness, if you RTFA, Microsoft is not to blame in any way.  It was a software issue, not an OS issue.

    Microsoft not to blame in any way?? Now I invite you to RTFA where you might notice the part about how the servers are designed to shut down automatically after 50 days. How is Microsoft NOT to blame? Unix servers can run for months or years without needing a reboot. This is another case of "if it ain't broke, don't replace it with M$ products".

    It was the backup server that failed because of a software glitch. The primary server problem was entirely because of the OS. That's like saying the plane crashed because the wheels didn't work, nevermind the fact that the engines exploded.



  • @merreborn said:

    @ammoQ said:
    Should I ever visit your continent, I will go by ship. Airtraffic controlled by a Windows Server... terrific.

    In all fairness, if you RTFA, Microsoft is not to blame in any way.  It was a software issue, not an OS issue.

    This is a definitely Microsoft OS issue. 49.7 days is 2^32 seconds, and the crash problem is caused by an overflow of a second counter.  This is a well-documented issue with older versions of Microsoft Windows (I think it's fixed in XP and Windows Server 2003).

    The work around was to restart the system every 30 days to reset the number. The referenced problems occurred because that restart didn't happen.



  • @Jefffurry said:

    @merreborn said:
    @ammoQ said:
    Should I ever visit your continent, I will go by ship. Airtraffic controlled by a Windows Server... terrific.

    In all fairness, if you RTFA, Microsoft is not to blame in any way.  It was a software issue, not an OS issue.

    This is a definitely Microsoft OS issue. 49.7 days is 2^32 seconds, and the crash problem is caused by an overflow of a second counter.  This is a well-documented issue with older versions of Microsoft Windows (I think it's fixed in XP and Windows Server 2003).

    The work around was to restart the system every 30 days to reset the number. The referenced problems occurred because that restart didn't happen.



    I had an NT4 server up for nearly a year (without a single reboot). Windows 98 had a problem like you describe though. The software (not OS) was the issue. The server was running software as well. It ran a simple FTP site and an internal website for testing.


    The shutdown is intended to keep the system from becoming overloaded
    with data and potentially giving controllers wrong information about
    flights, according to a software analyst cited by the LA Times.





  • @merreborn said:

    @ammoQ said:
    Should I ever visit your continent, I will go by ship. Airtraffic controlled by a Windows Server... terrific.


    In all fairness, if you RTFA, Microsoft is not to blame in any way.  It was a software issue, not an OS issue.


    Even if it was not an OS issue (looks like it is one...), I would never dare to use off-the-shelf Dell hardware with a general-purpose operating system like Windows or RHEL for a critical system like this one. When it comes to tasks like air traffic control, where every failure can cost hundreds of lifes, only the finest (not cheapest) hardware with the most reliable (not cheapest) operating system is good enough.



  • @Jefffurry said:


    This is a definitely Microsoft OS issue. 49.7 days is 2^32 seconds, and the crash problem is caused by an overflow of a second counter.  This is a well-documented issue with older versions of Microsoft Windows (I think it's fixed in XP and Windows Server 2003).

    The standard unix time counter uses 31 bits to count the seconds since Jan 1st, 1970,  and will overflow somewhere in 2038. I think you rather mean milliseconds.



  • I think M$ gets the initial blame for this, as their "GetTickCount" method loops every 49.7 days, but this issue is clearly stated in their documentation.  I've written several applications that use this function to calculate simple performance metrics, or give data transfer rates (I know there are better methods, but laziness usually wins), and even for these simple checks I still made sure to properly account for the time loop.  It's really not that hard.  How that ever past code reviews is beyond me (if there was one).



  • @ammoQ said:

    @merreborn said:
    @ammoQ said:
    Should I ever visit your continent, I will go by ship. Airtraffic controlled by a Windows Server... terrific.


    In all fairness, if you RTFA, Microsoft is not to blame in any way.  It was a software issue, not an OS issue.


    Even if it was not an OS issue (looks like it is one...), I would never dare to use off-the-shelf Dell hardware with a general-purpose operating system like Windows or RHEL for a critical system like this one. When it comes to tasks like air traffic control, where every failure can cost hundreds of lifes, only the finest (not cheapest) hardware with the most reliable (not cheapest) operating system is good enough.


    The true WTF is that they went from UNIX to Windows. What the hell were they thinking. Come to think of it, every time I've been to an airport since that changeover I've seen at least one BSOD on a monitor, and there are always other problems running rampant through the airport. Granted, the systems crashing usually weren't the air traffic controlling systems, but still.

    Even though I would LOVE to place all of the blame on microsoft for this one, I have to place the blame on the software writers. If the software required the machines to be rebooted to reset a counter, then that's a software error. If the counter to be reset was something like the tick counter or something like that even, then it's still a software error. Windows was NOT the culprit in this case.



  • @skippy said:

    I think M$ gets the initial blame for this, as their "GetTickCount" method loops every 49.7 days, but this issue is clearly stated in their documentation.  I've written several applications that use this function to calculate simple performance metrics, or give data transfer rates (I know there are better methods, but laziness usually wins), and even for these simple checks I still made sure to properly account for the time loop.  It's really not that hard.  How that ever past code reviews is beyond me (if there was one).


    YES exactly.... except for the initial blame. It's WIDELY documented and well known. It's NOT HARD to take it into consideration when programming, and as you even state you account for it in your programming. This is a poorly written software WTF. They probably gave the job to 20 fresh college grads. Go figure the software would be crappy.



  • @ammoQ said:

    @merreborn said:
    @ammoQ said:
    Should I ever visit your continent, I will go by ship. Airtraffic controlled by a Windows Server... terrific.


    In all fairness, if you RTFA, Microsoft is not to blame in any way.  It was a software issue, not an OS issue.


    Even if it was not an OS issue (looks like it is one...), I would never dare to use off-the-shelf Dell hardware with a general-purpose operating system like Windows or RHEL for a critical system like this one. When it comes to tasks like air traffic control, where every failure can cost hundreds of lifes, only the finest (not cheapest) hardware with the most reliable (not cheapest) operating system is good enough.


    When it comes to air traffic control, controller error caused by using rock-solid, but archaic and thus harder-to-use, hardware and software is possibly a greater risk than the failure of modern hardware and software that makes the controllers life as easy as possible.

    Personally I'd go for a multiply redundant system, perhaps running the same software simultaneously on different OSes (eg Debian Linux, OpenBSD, Windows Server 2003), on different architectures (eg x86 and Power), and having used different compilers. Have their output closely monitored for even the slightest deviation.

    In my view, however, the two biggest WTfs are

    1) Why couldn't the needed restart be scheduled to occur automatically?
    2) Why does your backup radio system NEED any software! Radio can be done all in hardware, and somewhat fault-tolerant hardware at that.



  • More questions/WTFs:

    3)  Why does it take 3 hours to reboot the system when it fails?  Do they have 3 hours of downtime every month for scheduled reboots?
    4)  This problem must have been first noticed in 2001 (when they implemented the reboot routeen).  Why does it take 5 years to get a patch?



  • @skippy said:

    I think M$ gets the initial blame for this, as their "GetTickCount" method loops every 49.7 days, but this issue is clearly stated in their documentation.  I've written several applications that use this function to calculate simple performance metrics, or give data transfer rates (I know there are better methods, but laziness usually wins), and even for these simple checks I still made sure to properly account for the time loop.  It's really not that hard.  How that ever past code reviews is beyond me (if there was one).

    I worked for Harris 20+ years ago for a couple of years. The one thing I NEVER EVER saw anyone do, request, demand, ask for or even joke about was a code review.



  • @Kev777 said:

    http://www.techworld.com/opsys/news/index.cfm?NewsID=2275

     



    best two sentences from the reading

    "The servers are timed to shut down after 49.7 days of use in order to prevent a data overload, a union official told the LA Times. To avoid this automatic shutdown, technicians are required to restart the system manually every 30 days"

    couldn't this be setup really easily to happen automatically? come on, who did they hire for this project Assenture? I could understand manually removing a backup tape or removing the trash from the office, but reboots?


  • @m0ffx said:


    When it comes to air traffic control, controller error caused by using rock-solid, but archaic and thus harder-to-use, hardware and software is possibly a greater risk than the failure of modern hardware and software that makes the controllers life as easy as possible.

    This is IMO a complete misconception. A full-fledged desktop system, like Winodw,s, KDE, Mac OS X etc., is well-suited for people who use it for many different tasks, probably simultaneously. You can resize Windows, hide them, close them etc. None of these operations are necessary when the computer and the display is used for just one application. Just like a game in full-screen mode, the only controls required are those of the application. Expensive dedicated hardware used for the  kind of appliation we are talking about has been able to provide such user interfaces for decades.



  • I agree that Microsoft isn't to blame - A programmer writing software that handles the lives of thousands of people every day should be smart enough to not use a function with such a well-documented bug.  And even if they wrote the software with the aforementioned bug, why  haven't they released a patch?  This is the fault of some stupid programmer writing software to handle people's lives, using an API function without reading the freely-available documentation or even caring about the well-known bug, and not releasing a patch in years.



  • Doesn't even the Windows EULA say you must not use it for such tasks?



  • @Albatross said:

    I agree that Microsoft isn't to blame - A programmer writing software that handles the lives of thousands of people every day should be smart enough to not use a function with such a well-documented bug.  And even if they wrote the software with the aforementioned bug, why  haven't they released a patch?  This is the fault of some stupid programmer writing software to handle people's lives, using an API function without reading the freely-available documentation or even caring about the well-known bug, and not releasing a patch in years.


    It's not a BUG, it's a FEATURE! ;-P



  • @Cotillion said:

    More questions/WTFs:
    ...
    4)  This problem must have been first noticed in 2001 (when they implemented the reboot routeen).  Why does it take 5 years to get a patch?
    <font size="5">I</font>t was never seen before 2001 because no MS OS: 95, 98, SE and NT4 could ever dream staying up without blue screening or crashing every 48 hours, let alone 48 days.



  • @mrsticks1982 said:

    @Kev777 said:

    http://www.techworld.com/opsys/news/index.cfm?NewsID=2275

     



    best two sentences from the reading

    "The servers are timed to shut down after 49.7 days of use in order to prevent a data overload, a union official told the LA Times. To avoid this automatic shutdown, technicians are required to restart the system manually every 30 days"

    couldn't this be setup really easily to happen automatically? come on, who did they hire for this project Assenture? I could understand manually removing a backup tape or removing the trash from the office, but reboots?


    I was about to say that the union official was a WTF for lying about the 49.7-days bug, i.e.

    "are timed to shut down" = "will crash"
    "in order to prevent a data overload" = "in order to cover up my computer illiteracy"

    But look what happens when you assume that the union official is as competent as can possibly be reconciled with the statement:

    "are timed to shut down after 49.7 days" = "are set up to shut down automatically right before the 49.7-days bug would hit"
    "data overload" = "32-bit overflow condition that causes the 49.7-days bug"

    So they do have an automatic reboot in place, but perhaps they'd prefer a manual reboot so that the technician can choose a good time of day based on that day's volume of data traffic, and/or keep an eye on it to make sure it comes back up again.



  • @Albatross said:

    I agree that Microsoft isn't to blame - A programmer writing software that handles the lives of thousands of people every day should be smart enough to not use a function with such a well-documented bug.  And even if they wrote the software with the aforementioned bug, why  haven't they released a patch?  This is the fault of some stupid programmer writing software to handle people's lives, using an API function without reading the freely-available documentation or even caring about the well-known bug, and not releasing a patch in years.

    I blame Microsoft entirely. Not because I have any evidence, or even a valid reason, I just like to so I do. (actually I reckon marketing may again be the culprit)

    As already posted, WhyTF do you replace a working system with cheap crap that is widely known to be unstable and insecure? Does NASA use windows servers for critical apps where lives are at stake? How many lives are they responsible for?

    And also as noted, what's with the radio needing a f'in pute to operate? I'd like to  suggest that from now on, all aircraft carry CB radios on board. They might not be officially sanctioned, but they'll be more reliable and I'm sure if a pilot was to ask nicely and explain that he/she's trying to get an enormous bloody great chunk of metal full of people safely to the ground, other users would be good enough to clear the channel, or at least just listen in quietly.

    I really don't mean to offend, but what happened to make the US keep getting stupider and stupider. This really is a classic WTF.



  • Yeah, thats not so much of a WTF to me, since I was facing this error far too often. On some cheap systems that had to collect data via DOS applications (well, thats a wtf on its own) we were running Win98(SE). There was a 32bit counter for milliseconds overflowing and BSODing the system after said 49.7 days. But it only happened happened in 99.995% of all cases. Guess how surprised I was, that NT4 and 2k also had this problem. Weren't they a complete different design from a completely different OS codebase? shrug. It seems that it was worked on there, since it only happened in like 5% of the cases where the counter overflowed, maybe less on some other systems. After I left, finally someone discovered a way to totally get around that BSOD on 2k. So of course we all want to blame M$ for that thing. And really, its a dorky design. But here those people are to blame who did not do enough research. And I think by 2k4 the problem was fixe in the latest service packs. But from all the worms shutting down power networks we know, that the pro software won't run on service packs. Really. Slap M$ in the face. hard. but blame those people deploying shitty server setups.



  • If the reboot cannot take place automatically (because the proper time to do so is hard to decide in advance) why there is no beep,visual flash, ringing, sms to admin, red flag, messagebox to remind to do so and another one to convince the unskilled technician (WTF? what was he doing there?) that he succeed/failed to reboot (do they really failed to reboot???WTF? maybe they were just restarting some system as in 'radio software system' , not the OS...whatever..)



  • @triso said:

    @Cotillion said:
    More questions/WTFs:
    ...
    4)  This problem must have been first noticed in 2001 (when they implemented the reboot routeen).  Why does it take 5 years to get a patch?
    <font size="5">I</font>t was never seen before 2001 because no MS OS: 95, 98, SE and NT4 could ever dream staying up without blue screening or crashing every 48 hours, let alone 48 days.



    When I get home tonight I'm going to find the screenshot. Three hundred some days on NT4 without a crash or reboot. It was NOT the os. It was the dumbass programmer not realizing that the 32bit number would be reset to 0 when the maximum was reached. I hate MS just as much as the next guy (probably more), but in this case it was the programmer's fault. 95, 98(SE), ME probably were crash machines that did literally crash every 49.7 days, but NT4 was not, nor was Windows 2000. Stop bashing MS for things that aren't true.



  • @GoatCheez said:



     95, 98(SE), ME probably were crash machines that did literally crash every 49.7 days


    I never had a problem with ME when I used it.

    But then again I was only 14 or 15 and just wanted to play the sims!!



  • @cronthenoob said:

    @GoatCheez said:


     95, 98(SE), ME probably were crash machines that did literally crash every 49.7 days


    I never had a problem with ME when I used it.

    But then again I was only 14 or 15 and just wanted to play the sims!!


    I have never installed ME on any machine I have owned, nor will I ever. I had enough sense to know it would be a piece of junk. I did use 98 and 98SE, although I routinely rebooted, and my memory just doesn't serve me well enough to remember if I had a 98 or 98SE machine up for longer than a month. Probably not though. I definitely remember the NT4 uptime because of how proud I was of it. I REALLY hope I find that screenshot. It was about 10 years ago though, so if I don't, oh well. It's not like you can't find any other screenshots of longer uptimes of NT4 out there. They do exist despite what some may think. I remember when I was bragging about it that someone sent me a screenshot of an uptime greater than a year. I'll ask friends too, I'm sure I have some friends with similar uptime shots.

    It really surprises me how many "intelligent" individuals think that NT4 or Win2k couldn't stay up more than 49.7 days.



  • @cronthenoob said:

    @GoatCheez said:


     95, 98(SE), ME probably were crash machines that did literally crash every 49.7 days


    I never had a problem with ME when I used it.

    But then again I was only 14 or 15 and just wanted to play the sims!!

    I doubt anyone could keep a Windows ME machine up for long enough to try it. (It was seriously buggy, IIRC - after my first experience with it, I avoided it in favour of the more stable - and much better tested - 98SE).



  • @makomk said:

    @cronthenoob said:
    @GoatCheez said:


     95, 98(SE), ME probably were crash machines that did literally crash every 49.7 days


    I never had a problem with ME when I used it.

    But then again I was only 14 or 15 and just wanted to play the sims!!

    I doubt anyone could keep a Windows ME machine up for long enough to try it. (It was seriously buggy, IIRC - after my first experience with it, I avoided it in favour of the more stable - and much better tested - 98SE).



    Amazingly, I had a machine running ME for roughly a year and a half and did not find it to be any less stable than 98SE. Though, the sheer volume of complaints people had about ME leads me to believe that I owned a magic computer.

    sincerely,
    Richard Nixon



  • @ammoQ said:

    Should I ever visit your continent, I will go by ship. Airtraffic controlled by a Windows Server... terrific.


    Ships are no safer - Microsoft is responsible for other stuff with unexpected behavior. What's left? Only other Microsoft products like this and this.



  • What about a different design where the servers would not crash/shutdown at the same time.

    Or where they would have a backup server with a
    different uptime cycle (let's say 23 days) so that their "critical application" would still be running in case of "human error".

    The WTF here is desinging such a critical system with no hardware or software failure tolerance at all ... (and getting caught on a known bug is just a bonus)
    or maybe they hired a consultant that only knew about Solaris ?


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.