No thread about the GitLab fuckup yet?
-
And straight from the horse's mouth:
https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-VCxIABGiryG7_z_6jHdVik/pub
Everything that could be fucked up, was fucked up. It's a miracle they didn't physically blow up the servers while they were at it.
-
Should have used Oracle.
-
-
@Maciejasjmj said in No thread about the GitLab fuckup yet?:
And straight from the horse's mouth
Removed a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account (causing high DB load)
-
SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries.
Which idiot upgraded the fucking system and didn't pay attention to YOUR FUCKING DATABASE TOOLING BEING UPDATED? Fuck knows if startup scripts even work.
-
TL;DR: The backup system I "designed" (hacked together) for a small web development company while in college is more reliable and provides better reporting than that of gitlab.com.
"Professionals" at work.
-
@boomzilla said in No thread about the GitLab fuckup yet?:
Should have used
OracleTFS.
-
I copied this to my new boss and team with "This is the best reason I've ever seen to run regular disaster drills"
-
@Yamikuronue said in No thread about the GitLab fuckup yet?:
I copied this to my new boss and team with "This is the best reason I've ever seen to run regular disaster drills"
Indeed. I mean, how can you even setup a backup system and then not even verify once that it actually works?
Our backups to S3 apparently don’t work either: the bucket is empty
-
@Rhywden Maybe I'm being charitable, but I assume it worked when they started, and they have a process to delete old backups... and somewhere along the way it stopped working, and deleted all the old backups. Which is why I said regular drills -- just because shit worked three years ago doesn't mean it still works.
-
That Google Doc mentioned in the last tweet notes: "This incident affected the database (including issues and merge requests) but not the git repos (repositories and wikis)."
So some solace there for users because not all is lost.
WHAT? That's so so much worse.
Git repos and actual files are strewn all over people's computers. Those can be replaced. Issues and chats and tickets are actually irreplaceable. When you pay gitlab, you are paying them to take care of those, not the actual code.
Reading that log, it's clear they were NOT ready to take on the responsibility of running this self-hosted service as opposed to letting Amazon handle it. They did the basics, but obviously haven't had enough time to test everything properly.
It's sad. I'd really like to see someone challenge github's monopoly.
The only redeeming light is that they are airing all their screwups into the open. If they survive this, it will turn them into a much more capable hosting company.
-
@Onyx said in No thread about the GitLab fuckup yet?:
Which idiot upgraded the fucking system and didn't pay attention to YOUR FUCKING DATABASE TOOLING BEING UPDATED? Fuck knows if startup scripts even work.
They are bundling all this stuff into their stupid "omnibus" package. It's pain in the ass managing all these services individually.
-
@Yamikuronue said in No thread about the GitLab fuckup yet?:
and somewhere along the way it stopped working
Which means they ran a shell script without
set -e
or checking the return values of the relevant commands. Or nobody gets notified when the script itself fails. Both of which is unacceptable; no capable SysAdmin would ever write scripts that don't notify anyone if they fail.
-
@Yamikuronue said in No thread about the GitLab fuckup yet?:
I copied this to my new boss and team with "This is the best reason I've ever seen to run regular disaster drills"
Good idea.
Sending this to mine.
-
@asdf Agreed. I've done it, but I've never claimed it was a good idea to make me to anything sysAdmin-y :D
-
@cartman82 They could give this guy/gal a proper title. Like "Master of Disaster".
-
@loopback0 said in No thread about the GitLab fuckup yet?:
Removed a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account (causing high DB load)
Fucking nazi mods always finding reasons to ban people. There's no rule against sharing your repository with 47,000 people!
-
@Yamikuronue
Redirecting root's mail to a mailing list which all sysadmins are subscribed to is literally the first thing I did after installing Linux when I set up our dev server back then. ;) And even if you forgot to do that, you should make sure your backup script reports failures somehow. A simple:#!/bin/bash set -e function cleanup { mail -s "backup failed" sysadmin@gitlab.com } trap cleanup EXIT # actual script here trap - EXIT
Would be a start.
-
@asdf Yeah, in my case it was deploy scripts; I wrapped the actual deploy script in a second script (for valid reasons I won't get into now), and forgot to return the inner return value. So the deploy "succeeded" with a screen full of errors. Whoops. Thankfully, people noticed pretty quick. Backup scripts I won't touch, for this exact reason.
-
@Yamikuronue said in No thread about the GitLab fuckup yet?:
forgot to return the inner return value
Oh, yeah, that's a classic. Everyone who's ever touched Shell scripts has made that mistake once or twice. ;)
-
@RaceProUK said in No thread about the GitLab fuckup yet?:
That's actually the most fascinating part of this for me. The Google Doc and the YouTube Live stream were an offer of transparency in emergency response to an extent I've never seen before, impressive for this being their first major response (that we know of, obviously). Once GitLab gets their shit together, that kind of makes me want to use their services because I know if something goes wrong, there won't be a disingenuous cover-up.
-
@heterodox said in No thread about the GitLab fuckup yet?:
Once GitLab gets their shit together, that kind of makes me want to use their services because I know if something goes wrong, there won't be a disingenuous cover-up.
You have a point, but I'd be more concerned about the fact they had five backup strategies, all of which failed. And no-one noticed until it was too late.
-
@RaceProUK said in No thread about the GitLab fuckup yet?:
they had five backup strategies, all of which failed
They didn't all fail, some of them weren't even setup.
-
@loopback0 said in No thread about the GitLab fuckup yet?:
some of them weren't even setup
...
...
...
Is it possible to facepalm so hard you bend time?
-
@RaceProUK As said by The Register...
The world doesn't contain enough faces and palms to even begin to offer a reaction to that sentence.
-
@loopback0 I think "weren't even setup" falls pretty heavily under the definition of "failed"
-
@Vault_Dweller It's a bit like that question "If a tree falls and there's no-one to hear it, does it make a sound?"
-
I now have a bet with a coworker that they'll survive a year. He thinks they'll go under because of this incident.
-
@anonymous234 said in No thread about the GitLab fuckup yet?:
There's no rule against sharing your repository with 47,000 people!
-
@Vault_Dweller said in No thread about the GitLab fuckup yet?:
@loopback0 I think "weren't even setup" falls pretty heavily under the definition of "failed"
It's hard to fail if you're not even attempting something.
-
@loopback0 It's a backup strategy, i.e. there was a strategy. Whether it was implemented is another matter.
-
@Vault_Dweller Or rather, the fact that it wasn't implemented was the point of failure
-
Rails
Mmmmm-hmmmm.
@Rhywden said in No thread about the GitLab fuckup yet?:
how can you even setup a backup system and then not even verify once that it actually works?
I worked for a place once that outsourced its backups. Paid hyoooooge money for those backups, we did.
-
@Yamikuronue said in No thread about the GitLab fuckup yet?:
Backup scripts I won't touch, for this exact reason.
I run my backup scripts by hand and eyeball their progress spew.
This approach is of course in no way scalable; I can get away with it because I'm only backing up the one VM host.
It's quite comforting to keep a really close eye on the backup process. Found a failing source drive once just because backup was running slower than I'd come to expect. SMART and RAID logs showed nothing untoward. Did read-speed tests on all the drives in the set individually (hurrah for software RAID), replaced the one running at a quarter of the speed it should, and slapped it into service at my house as a secondary backup; two months later it started reallocating sectors.
-
@Maciejasjmj Well not everything. Apparently the git repos were fine?
-
@RaceProUK said in No thread about the GitLab fuckup yet?:
It's a bit like that question "If a tree falls and there's no-one to hear it, does it make a sound?"
More like "if a tree falls but nobody had ever actually bothered to plant it", surely?
-
What a bunch of gits!
-
@RaceProUK said in No thread about the GitLab fuckup yet?:
You have a point, but I'd be more concerned about the fact they had five backup strategies, all of which failed. And no-one noticed until it was too late.
You only know that because of the transparency, though. The amount of detail you'd get from most other companies is a single, bland statement that, "Due to an unscheduled outage in production, about six hours of issues and pull requests were lost, but not any of your files! So you're all good. If you have a paid account or something and you really want some form of compensation, then fine, open a ticket with our support team and we'll figure something out (i.e., ignore you)."
@Yamikuronue said in No thread about the GitLab fuckup yet?:
He thinks they'll go under because of this incident.
It's possible, but I hope they don't.
@flabdablet said in No thread about the GitLab fuckup yet?:
This approach is of course in no way scalable; I can get away with it because I'm only backing up the one VM host.
Right. That'd be my counterargument to, "Why didn't anyone notice... x, y, or z?" Because they have a pretty large project as a GitHub competitor and appear to be, as 99% of companies are, operating on a shoestring Ops budget. (I think there were notations in the Google Doc from two, maybe three ops by initials? It's hardly a booming department.) Things fall through the cracks; as a lot of the opshugs said (what a cute concept), these things happen everywhere.
@JazzyJosh said in No thread about the GitLab fuckup yet?:
@Maciejasjmj Well not everything. Apparently the git repos were fine?
Yeah, those were stored on the filesystem, obviously, and not in Postgres.
-
@Maciejasjmj I'm thinking “thank god”. We use their stuff, but in our own deployment because it has some things in it we need to keep confidential by law and it's easier to do things ourselves than figure out if some random company out there is doing it right. At the very least, it means we own our disasters instead borrowing someone else's… ;)
-
@tufty said in No thread about the GitLab fuckup yet?:
95% of the data had been backed up to /dev/null
To be fair, you can write a shitload of data to /dev/null before it fills up.
-
@Rhywden I'm no DBA, but it seems like making a script that fetches the backups every week, restores them to a temporary database and then runs some sanity checks on that (i.e. make sure the amount of rows is approximately equal to the production database) would be an obvious first line of defense.
...actually, screw this. Databases and backups should (in most cases) be a solved problem by now. It should be an automatic, foolproof, one-click process.
-
@anonymous234 said in No thread about the GitLab fuckup yet?:
It should be an automatic, foolproof, one-click process.
and yet somehow it so very rarely is.
-
@flabdablet Is there anything that is? Every so often we get a thread moaning about how "It's
$current_year
why isn't this solved‽"
-
@boomzilla said in No thread about the GitLab fuckup yet?:
Is there anything that is?
Clicking a button is a one-click process. Does that count?
-
@flabdablet It's almost like EVERYTHING RELATED TO COMPUTERS IS COMPLETE SHIT.
-
@flabdablet said in No thread about the GitLab fuckup yet?:
To be fair, you can write a shitload of data to /dev/null before it fills up.
It's very quick too. Pity it's a write-only medium
-
@cartman82 said in No thread about the GitLab fuckup yet?:
It's sad. I'd really like to see someone challenge github's monopoly.
Has Bitbucket ever had a major catastrophe like this?
-
@Jaloopa said in No thread about the GitLab fuckup yet?:
Pity it's a write-only medium
I had a backup system like that. Back when I got my first computer, I had a cassette tape backup system (on MSDOS). Never tested restore. You can guess what happened when I needed it... (My backup now is manual - copy data files to multiple places. OS/Programs can be reinstalled)
-
@RaceProUK said in No thread about the GitLab fuckup yet?:
@boomzilla said in No thread about the GitLab fuckup yet?:
Is there anything that is?
Clicking a button is a one-click process. Does that count?
No: We've had threads (and I know I've started at least one of them) about buttons that didn't look like buttons so that you didn't know you could click them.
-
@Jaloopa said in No thread about the GitLab fuckup yet?:
It's very quick too
Infinite write capacity, read speed that's literally off the charts - what's not to like?