The longest two minutes of my life



  • So I was trying to debug a phantom hardware issue (atapi controller error on ideport1, if you're curious - mobo already replaced, HD checks out fine, but I'm getting ahead of myself). I downloaded WD's diagnostic tools and ran a 'Quick' SMART test to see if my HD had been screaming for help all along and no one was around to hear it.

    Good thing I'd chosen the Quick Test.

    Wha?
    [http://img216.imageshack.us/img216/5266/wabliefzb5.png]

    This was not the estimated time until completion. This was the estimated time for the entire operation.



  • Maybe that's a sign your drive is going kaput. Yes? No?



  • There's two possibilities. You may have been running the test in offline mode rather than captive mode, and had a filesystem mounted on that drive so that Windows would keep accessing it, which would keep interrupting the test. The drive will quickly resume it, but constantly stopping and starting slows the whole process down to a crawl.

    Alternatively, you have an older drive that just doesn't implement the quick test option, and ran the long test instead.



  • Judging by the fact that it is running on vista, I have seen this before. I don't really know how to describe it, but it appears at random times, and sometimes once you see it it won't go away. What I'm referring to is file operation times. Sometimes you'd go for days with things blazing, but then they'll be a day where file operations take more than a few minutes to do simple tasks, even with low sizes. MS is aware of the problem, but it may be a little while before we have a fix. It happens more often on file copy, or delete operations.



  • @pitchingchris said:

    Judging by the fact that it is running on vista, I have seen this before. I don't really know how to describe it, but it appears at random times, and sometimes once you see it it won't go away. What I'm referring to is file operation times. Sometimes you'd go for days with things blazing, but then they'll be a day where file operations take more than a few minutes to do simple tasks, even with low sizes. MS is aware of the problem, but it may be a little while before we have a fix. It happens more often on file copy, or delete operations.

    I believe in another thread it was mentioned copy and move like to use the %TEMP% directory, which will of course double the time it takes, and raise the chance of failure.



  • DLGDIAG use a fixed calculation that isn't re-evaluated in case of delays/problems/whatever - same difference, really) for the Quick test. Usually, it'll be right, but if something is broken, the process may freeze every now and then - and that will happen multiple times. If you run the extended test, it'll re-calculate the estimate as it progresses.

    The time it's taken is unrelated to the operating system, though - I have a broken WD Passport that took ages to finish a scan on XP Pro (there's a story behind that drive, but I'm not going to bore you with it if no one wants to hear it).



  • Exactly. The diagnostic tool can't even know how long it takes. All it does is sending a command to the drive to perform the quick self test. After usually 2 minutes, the drive reports back that it is done. There is no indication of how far the drive is. Even worse, quite some IDE commands abort the self test in between (e.g. looking at the table of all SMART attributes like number of reallocated sectors, temperature, etc).

    "smartctl" for Linux makes the best out of it - it tells you that it initiated the self-test:

    === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
    Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
    Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
    Testing has begun.
    Please wait 1 minutes for test to complete.
    Test will complete after Wed Nov 14 10:17:35 2007

    Use smartctl -X to abort test.

    and then brings you back to the prompt, without waiting for completion. You can later check for results using "smartctl -l selftest":

    === START OF READ SMART DATA SECTION ===
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed without error       00%      5305         -

    The real WTF is that you can't see in the log if it contains the test you just ran or not, that is, whether there is a test still in progress. So you need to check the selftest log BEFORE starting a test so you can later see if it has grown by one entry.



  • @OperatorBastardusInfernalis said:

    Exactly. The diagnostic tool can't even know how long it takes.

    The diagnostic tools merely relays the data returned by the drive. The drive does know exactly how long the test should take, and provides this information via the SMART protocol. If the actual time does not match the expected time, something went wrong. Usually the test was interrupted.

     

    All it does is sending a command to the drive to perform the quick self test. After usually 2 minutes, the drive reports back that it is done. There is no indication of how far the drive is.

    Most modern drives do report their progress on all tests. Older drives may not.

     

    Even worse, quite some IDE commands abort the self test in between (e.g. looking at the table of all SMART attributes like number of reallocated sectors, temperature, etc).

    No, some drives abort tests when receiving a new command. A drive can implement one or more of three modes: in 'captive' mode, all inbound requests (except SMART status requests) are queued until the test completes. In 'offline' mode, a drive must either implement the "Abort Offline collection upon new command" capability, which means that the test stops when any new command is sent to the drive, or the "Suspend Offline collection upon new command" capability, which means that the test is briefly suspended while the command is executed.

    It has got nothing to do with what commands you send. It's merely a question of which drive you use. All the decent modern drives provide captive and offline modes, with the suspend capability.

     

    "smartctl" for Linux makes the best out of it - it tells you that it initiated the self-test:

    ... 

    and then brings you back to the prompt, without waiting for completion. You can later check for results using "smartctl -l selftest":

    And you can get the test duration estimate from smartctl -c. Test progress reporting is in the log.

     

    The real WTF is that you can't see in the log if it contains the test you just ran or not, that is, whether there is a test still in progress.

    This information is returned by smartctl -c:

    Self-test execution status:      (   0) The previous self-test routine completed
                                            without error or no self-test has ever
                                            been run.




  • @Brother Laz said:

    This was not the estimated time until completion. This was the estimated time for the entire operation.

      What part of "estimated" don't you get?

    ;-)



  • Thanks for the replies. It helped me figure out what was happening: Windows was attempting to defrag because I wasn't doing anything. >.< Second time it completed in 1 minute, 59 seconds. (Which begs the question: if there is an intermittent controller error, and the mobo has been replaced, and the HD is fine, what the hell is causing it?!)

    Still you'd think the tool would at least update the estimated time so it is longer than the actual elapsed time at all times...



  • @Brother Laz said:

    Which begs the question:

    NO IT DOESN'T. It may "invite" the question or "lead to" the question. "Begging" the question means that the statement is WRONG, in a specific way.

     

    if there is an intermittent controller error, and the mobo has been replaced, and the HD is fine, what the hell is causing it?

     

    Driver bugs. Bad cabling. Environmental electrical noise. Inadequate power supply. Memory or CPU errors generating bad ATA commands. Sunspots.



  • @asuffield said:

    Sunspots.

    What sunspots: ftp://ftp.ngdc.noaa.gov/STP/SOLAR_DATA/SUNSPOT_NUMBERS/2007

    fricking forum software is throwing a hissy fit at ftp links)


Log in to reply