Development at Work

nonpartisan

I'm fortunate that there really aren't a lot of things at work that make me exclaim "WTF???" But I had one unfortunate experience on Tuesday that wasted four hours of my life. And yes, there are a couple of other WTFs in this post that I'm not going to expand on right now.

I got to work around 0800 and saw an e-mail from my manager. He was following up on a couple of issues, one of which I knew about. The first issue was that a power bump had taken out several closets on campus in spite of having redundant power equipment (UPSes and ATSes) and he wanted an explanation as to why. The second was a possible networking problem in the data center, which is my primary domain. One of the mail servers was having a problem and needed investigating.

I found Rod; he's one of our primary mail server engineers. He'd solved his issue; it turned out the problem was with a (Java) service that performed double-duty: it routed mail, but it was also responsible for sending out notifications when problems arose. The service died. The server kept accepting mail for delivery, but it just sat there queuing up. No notifications of problems since that service was responsible for sending the notifications. By the time it was found, 1+ million messages had queued.

So there was no network issue. But Rod said to me, "Have you heard about the FTP problems? Dean thinks there's a network problem."

Me: "No. What's going on?"

Rod: "They're getting intermittent FTP failures from an automated process. It started end of last week and they noticed it yesterday."

Me: "Forward me the e-mail and I'll look into it."

I got the e-mail with the servers involved and IP addresses. Looking at the interface graphs, I saw some anomalous readings but nothing that jumped out to me as being related. I e-mailed Dean and he replied. "Screen shot attached. Looks like sometime between 6:30AM and 9:00AM Friday." The screen shot of the log was from August 10 and showed 20 FTP failures and a few successes. All of the failures were for the same filename, but a couple of times the transfer had actually started but ended with an error. These hadn't occurred the day before.

There were four routers involved. I checked all of them. None of them had interface errors on or around the time of the transfers. No bandwidth issues. Nothing in their logs indicated any problems.

For the next four hours, Rod, Dean, Ben (a UNIX admin responsible for the system that generated the files), and I looked through FTP transaction information. Theories abound. A 500 error for the "FEAT" command. A 550 error on the SIZE command "550 filename.txt.zip: not a plain file." Except sometimes it worked. Then it would download the file, but when it came to DELEting it, "filename.txt.zip: file not found". I was convinced there was a race condition of some kind, but I didn't know the source.

After four hours of e-mail flying between everyone, Ben came up with a sensible suggestion:

Ben: "511 is available. Why don't we meet in the conference room?"

We gathered in the conference room about noon. Dean kept discounting the idea of a race condition; this was happening to multiple transfers, multiple files, and they'd all have to be encountering the same issue. The files were created daily and the file list changed daily. The job got the list using a wildcard, then spawned separate FTP sessions for each file to download. So the file had to exist at one point during the process.

Finally, the question that had been asked multiple times, but always dismissed, was asked again: "What changed last week?"

Dean: "Nothing changed. Rod set up a development server last week so that I could test changes before putting them in production, but I didn't have to test anything with these jobs."

Rod: "Yeah. I built it and . . . OHHHHHHhhhhhhhhhhh . . ."

He logged into the development server. Log entries showed it downloading the very same files. Where it succeeded, the production server failed. And vice versa. Rod's explanation: when he had done this on a Windows system before, he thought the import automatically disabled all of the user accounts and jobs. But he'd done an import to a Linux system this time . . . and it didn't disable the accounts.

I left the room with a "Fuckin' A!!" and went to lunch.

mott555

@nonpartisan said:

He logged into the development server. Log entries showed it downloading the very same files. Where it succeeded, the production server failed. And vice versa. Rod's explanation: when he had done this on a Windows system before, he thought the import automatically disabled all of the user accounts and jobs. But he'd done an import to a Linux system this time . . . and it didn't disable the accounts.

I'm dense today. What exactly was the issue? The dots aren't connecting in my brain.

nonpartisan

@mott555 said:

I'm dense today. What exactly was the issue? The dots aren't connecting in my brain.

Production and development were running the same jobs. The race condition was between them. But Rod never looked at development to see if this was even a possibility.

RichP

@nonpartisan said:

@mott555 said:
I'm dense today. What exactly was the issue? The dots aren't connecting in my brain.
Production and development were running the same jobs. The race condition was between them. But Rod never looked at development to see if this was even a possibility.

This is why you should never test things. Testing just breaks stuff.

Cad_Delworth

Rant template: Why is it that (group) never have any fucking clue (technology) fucking even works, and typically haven't even fucking heard of (technology)? Probably because people who actually know shit realize (technology) [is/are] (adjective).

FTFY.

Speakerphone_Dude

@nonpartisan said:

[snip long story]

The WTFs in a nutshell:

Using email instead of a ticket system
Having in-band monitoring for email
Emailing IP addresses to other people in IT
Having DEV and PROD machines on the same network
Working with single-syllable-named people only

Ben L.

@Speakerphone Dude said:

@nonpartisan said:
[snip long story]

The WTFs in a nutshell:

Using email instead of a ticket system

Having in-band monitoring for email

Emailing IP addresses to other people in IT

Having DEV and PROD machines on the same network

Working with single-syllable-named people only

• I have a job somehow

Speakerphone_Dude

@Ben L. said:

I have a job somehow

Ben there, done that

Ben L.

@Speakerphone Dude said:

@Ben L. said:
I have a job somehow

Ben there, done that

No I haven't!