:fa_database: [Old Forum is still alive] Important Data



  • Skywolf said:

    Say you have crucial, business data, that absolutely has to be safe.

    So you do the right thing, and invest in a RAID6 system with double parity.

    Which works great. So great that you don't even notice when the first hard drive fails.

    And not even when the second hard drive fails.

    Unfortunately you do notice when the third hard disk fails.

    flatdablet said:

    BACK THE FUCKER UP EVERY DAY.

    Repeat after me: Even if you use RAID, you still need backup.

    Sysadmins who behave as if they don't understand this are TRWTF.

    Skywolf said:
    If they didn't notice that 2 HDDs had failed in their RAID, would you be surprised if I told you that they also didn't notice that their backup had fallen over 2 months ago? It's called WTF for a reason ;)


  • I wouldn't be surprised if some of the less frequent posters don't even know the new forums exist.

    @apapadimoulis can you add some notice to the old forums' template leading people here?



  • Maybe it’s time to dig up the migration topic again...



  • I've found there are two kinds of sysadmins, ones who find the error light on the front of a piece of equipment to be physically painful, and ones who don't even know their equipment have error lights.


  • Discourse touched me in a no-no place

    @MiffTheFox said:

    I've found there are two kinds of sysadmins, ones who find the error light on the front of a piece of equipment to be physically painful, and ones who don't even know their equipment have error lights.

    As a software developer, I find even warnings to be close to physically painful…



  • @dkf said:

    As a software developer, I find even warnings to be close to physically painful…

    I don't think that's a general trait of a software developer based on the compiler warnings all of my inherited projects spit out.



  • And then you watch someone else compile their project.... "853 warnings, 0 errors".



  • @cartman82 said:

    I wouldn't be surprised if some of the less frequent posters don't even know the new forums exist.

    @apapadimoulis can you add some notice to the old forums' template leading people here?

    At least some of the few people still using the old forums do know the new ones exist, and choose to continue using the old ones because Discourse. When the old forums eventually cease to exist, they will cease to be forum users, because they consider no forum to be better than a Discourse forum.



  • Redundancy means that you will only notice it when n+1 components failed... (Of course not if you have working monitoring system and the monitoring emails are not sent to an ex-colleague who left the company 2 years ago...)



  • Didn't we have a contest on that a few weeks ago?



  • @dkf said:

    As a software developer, I find even warnings to be close to physically painful…

    As a Go programmer, all warnings are errors. There is no such thing as a warning.

    Example: http://play.golang.org/p/kAGONTxTQx



  • You are a Go programmer?



  • @ben_lubar said:

    As a Go programmer, all warnings are errors. There is no such thing as a warning.
    As someone who builds my stuff on kind of absurdly high warning levels when possible (and have submitted a patch or two to Boost in an attempt to make this more possible), I think this is stupid.

    The really nice thing about warnings is that there is lots of room for quality-of-implementation distinctions. Compiler writers can improve warnings -- adding new ones, making old ones more selective to provide fewer false positives -- from version to version. A "no warnings" policy means that either there's no real sense of "this is a valid program" (because whether a program is valid or not varies by compiler and by version) or there's no room for improving the warning analyses or adapting to potential problems as people figure out more potential misuses.



  • @flabdablet is one of those, I think. Ah, actually not.

    (new discoursistency discovered typing this post)


  • Discourse touched me in a no-no place

    @ben_lubar said:

    As a Go programmer, all warnings are errors. There is no such thing as a warning.

    The only warnings I tolerate are for deprecated APIs, and then only when there's nothing to migrate to.

    There's this one API on OSX in the dynamic library loader like this where the thing that Apple suggest you move to is so thoroughly neutered by comparison that it is impossible to reproduce the functionality on top of it without resorting to awful hacks. Other platforms just start with those dreadful hacks though, so it would be technically possible to be less precious…



  • @tufty said:

    @flabdablet is one of those, I think.

    Before the what.thedailywtf debacle, I used to think people making disparaging remarks about Jeff Atwood were merely trying to be fashionable.


  • Winner of the 2016 Presidential Election

    Yeah, it's cool to hate ditzy celebs, so I'm told.



  • @ben_lubar said:

    As a Go programmer,

    This is the first warning. Which is also an error.


  • Impossible Mission Players - A

    @HardwareGeek said:

    At least some of the few people still using the old forums do know the new ones exist, and choose to continue using the old ones because Discourse. When the old forums eventually cease to exist, they will cease to be forum users, because they consider no forum to be better than a Discourse forum.

    Anyone want to test this theory?



  • I won't laugh on this.

    When CSZone of KKCity was about to shutdown, the admins decided they'll move on to Google Groups, and they even spent effort to move all the postings to there.

    And for me, I decided to move on to another BBS instead.


  • Impossible Mission Players - A

    @cheong said:

    I won't laugh on this.
    wasn't trying to be funny, I'm genuinely curious.

    I'm even more curious if moving off discourse might entice those that left back on. As much fun as the active users are, it's always good to have flowing blood...



  • Funny you people would dig up this thread as my HDD is starting to fail.

    In all these years dealing with computers, I've never had an internal HDD fail on me. It was bound to happen one day, but the first time is always hard. :cry:

    I use a combination of options 1 and 2 (from @flabdablet's link in the OP) in case you're wondering. Data I don't care about I'll copy over (maybe) to the new disk once I acquire it; data I care somewhat I've copied to external storage already; data I really care about is kept in sync with more than one independent external store.

    What I take away from this experience is that I'll prefer to buy smaller drives (less than 1 TB) from now on. Drives this big "hurt" when they fail, not to mention they take forever to backup and to check for errors.


  • :belt_onion:

    Have a like out of sympathy.



  • @Zecc said:

    my HDD is starting to fail

    How many re-allocated sectors are showing in its SMART log?



  • @flabdablet said:

    How many re-allocated sectors are showing in its SMART log?

    According to gnome-disks, zero. *shrug* Perhaps its value was reset last time I ran a chkdsk? The value in "worst" is non-zero.

    But take your own conclusions:

    That read error rate does not look good.

    Here's what smartctl has to say:

    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       14111
      3 Spin_Up_Time            0x0027   171   170   021    Pre-fail  Always       -       2425
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       913
      5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
      9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       5299
     10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
     11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       879
    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       60
    193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       852
    194 Temperature_Celsius     0x0022   117   104   000    Old_age   Always       -       26
    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       84
    198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       87
    199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
    200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       98
    

    In any case, I've seen the OS struggle to talk to the disk several times already. Win7 in particular seems to hang completely for minutes before it gives up, when it happens.


  • Discourse touched me in a no-no place

    Does anyone actually know what the various numbers in SMART data even mean? My threshold is 140, but the value and worst are 200. Is that good, bad, meaningless?



  • @Zecc said:

    HDD is starting to fail

    Does the manufacturer's name rhyme with "pea gate"?


  • Discourse touched me in a no-no place

    @FrostCat said:

    Does anyone actually know what the various numbers in SMART data even mean?

    @FrostCat said:

    My threshold is 140, but the value and worst are 200. Is that good, bad, meaningless?

    Once it drops to 140, that's bad it would seem.



  • No, it rhymes with Destern Pigital.

    Windows tells me the drive is healthy (so does Linux until I tell it to run a short test actually). But surely the SMART data can't be wrong, can it?



  • I've never had SMART tell me a drive was unsatisfactory until AFTER it was OBVIOUSLY busted. (Endless series of new bad blocks, making nasty scraping sounds, etc.) Except perhaps in a server room, SMART seems pretty useless to me.



  • Okay. So you've got no sectors reallocated, but you have 87 uncorrectable and 84 pending reallocation. That's usually enough to make an OS run dog-slow. Sectors pending reallocation don't actually get reallocated until the OS happens to rewrite them.

    The read error rate is fine, don't worry about it; the raw number there is fairly meaningless, and the cooked number (199) is well above the failure threshold (51).

    If that were my hard drive, I'd be doing the following:

    1. Shut the computer down immediately.
    2. Buy another drive the same size or larger; has to have at least as many total sectors.
    3. Hook the second drive up to a spare power cable and SATA port. If none readily accessible, temporarily steal them from the optical drive.
    4. Boot a live environment (Knoppix is good) via USB and get a root terminal.
    5. Clone the old drive to the new one using ddrescue (it's included in Knoppix and most other recovery-oriented live environments; read the manual thoroughly if you've not used it before).
    6. Run partx -a against the destination disk to tell the kernel about its now-cloned partition table.
    7. Run e2fsck -p -f against each partition on the destination disk to fix any gross filesystem errors caused by holes corresponding to unreadable sectors on the original.
    8. Shut the machine down and swap the new drive into the old one's bay.

    Edit: if it were only a handful of unreadable/pending sectors, I wouldn't bother replacing the drive.

    Instead I'd ddrescue the drive to /dev/null, just to create a ddrescue log file that lists all the bad sectors; then map a raw device over the drive and ddrescue /dev/zero to that using the same ddrescue log file. That makes ddrescue "rescue" just the "bad" sectors from /dev/zero and write them to the flawed locations on the original drive, forcing the drive to reallocate them.

    I've seen drives run error-free for years after doing that. But you already have several tens of bad sectors, and they've gone bad quickly enough that the OS hasn't wanted to rewrite even one of them, and in my experience that's usually a sign that the drive hasn't long to live.

    SMART is very useful if you know how drives work and what the numbers mean. If you just expect it to be a magic oracle that tells you when your drive's last workable minute is coming, not so much.



  • Thanks for the help (really), but sounds like a lot of trouble for a disk which contents I care about are already backed up.

    This isn't the first time the drive has shown signs of failing either, so the read error rate hasn't escalated that quickly. A little over two weeks ago it has had lots of I/O errors, but after a couple of disk checks it's been holding on all right. Today it's done it again, once, but took like a champ and seems to be over it.

    All trust is lost, of course, and I expect it to die at any moment. Or to otherwise keep shrugging it off with a cough now and then. Who knows? Time will tell.



  • Disk checks are actually close to the worst thing you can do when a drive starts showing errors, because fsck touches so many important sectors - unless you're talking about the long slow kind of disk check where the OS combs the whole disk for bad sectors and marks them as off limits for the filesystem. But those are about as slow as a disk cloning operation anyway.

    Edit: disk cloning, when there are still not thousands of bad sectors, is also generally quicker than restoring a pile of stuff from more formal backups.



  • Both Firefox and Chrome were crashing on launch, Visual Studio was suspiciously slow...
    I changed HDDs before it was too late. Anything but losing my Terraria map! ;)

    FWP: I can't fit the old disk and the news disks inside my computer case at the same time because of the graphics card. Not that it'd matter, because I'm missing a SATA cable anyway. Plus, the old disk is quickly becoming nothing but a paperweight.

    Also FWP: I miss the reassuring sounds of a mechanical drive. :fa_database: :fa_cogs:



  • @blakeyrat said:

    I've never had SMART tell me a drive was unsatisfactory until AFTER it was OBVIOUSLY busted.

    I have.

    I kept using it and it didn't last much longer before Bad Things started happening. System corruption and the like. Most of my data was still okay though.

    Luckily I had a spare hard drive to swap in.


  • Impossible Mission Players - A

    @Zecc said:

    Also FWP: I miss the reassuring sounds of a mechanical drive. :fa_database: :fa_cogs:

    Certainly. Unless it's the repeating sounds of a Failed Sector Read. Then it's not so much reassuring but anxiety-inducing....



  • @flabdablet, post:31, topic:3163, full:false said:

    1. Shut the computer down immediately.

    Good for a computer that you just started up, but no-no for a disk that has been spinning continuously for some time (months, years . . .). Such disks are very likely not to start up again once they've spun down. A panic-prodded little tap on the side just as it is trying to start spinning can maybe save the day (or rather the hour necessary to do an ultimate save).


  • Winner of the 2016 Presidential Election

    I r dissapoint that this thread is not subtitled "a triumph of science" but I don't want to be one of the 108 inconsiderate assholes who indiscrimately renames topics.



  • @flabdablet said:

    The read error rate is fine, don't worry about it; the raw number there is fairly meaningless, and the cooked number (199) is well above the failure threshold (51).

    What does the cooked error rate represent, then? Intuitively I'd expect that a number associated with an error rate is good when it's below a threshold.


  • area_deu

    @Arantor said:

    I r dissapoint that this thread is not subtitled "a triumph of science" but I don't want to be one of the 108 inconsiderate assholes who indiscrimately renames topics.

    You could ask the OP if he is ok with you renaming the topic.



  • All of the SMART cooked values are (supposedly) good as long as they're above the associated threshold. This is consistent but in many cases unintuitive. It's exactly the kind of design I've come to expect from hardware people.

    Personally I think the cookery associated with the numbers of unrecoverable sectors, reallocated sectors and sectors pending reallocation is far too lenient. For those, and for temperature readings, I pay attention to the raw numbers.



  • @Lawrence said:

    no-no for a disk that has been spinning continuously for some time (months, years . . .). Such disks are very likely not to start up again once they've spun down.

    A good point well made.

    Luckily many SATA ports, even those that are not designed for hot swap, will recognize the initial connection of a new drive. Also, most BIOSes won't spin drives down during a restart. For a failing drive in a long-running server, I'd start the recovery process above from step 2. If your server still doesn't recognize the new blank drive, then it's USB enclosure time and a rather slower cloning step.


  • Winner of the 2016 Presidential Election

    Eh, it was implied :stuck_out_tongue: I was more making a reference that of the 108 inconsiderate assholes I try not to be one of them...

    I don't actually mind either way, I just thought 'still alive' reminded me of that song and that I was sad we hadn't had a reference. Now I've made the multi-year old joke, it's all good.



  • @Arantor said:

    I r dissapoint that this thread is not subtitled "a triumph of science"

    Go ahead. The meaning is lost in me though. I'm not following.


  • Winner of the 2016 Presidential Election



  • Ok. I was focussing on the "important data" part of the title, that's why I didn't make the connection.



  • @Lawrence said:

    A panic-prodded little tap on the side just as it is trying to start spinning can maybe save the day

    It's always fun when percussive maintenance actually works.

    I had an external USB hard drive that apparently came with a crappy USB cable because it didn't deliver quite enough juice to spin up the drive when you connected it. Holding it and giving it a little wrist flick while you plugged it in usually gave it just enough help to get it going. (I think I later confirmed that a different USB cable eliminated the problem entirely.)


  • Impossible Mission Players - A

    @anotherusername said:

    crappy USB cable

    :wave: This happened to me as well. I was halfway through writing a pretty upset review when I thought about using a different cable. :headdesk:


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.