Time Synching

snoofle

We have a bunch of production servers that are (allegedly) time synchronized. As such, I was tasked with writing something to show how long a message spent in various sections of our and other applications for the purpose of determining choke points.

Then we start getting messages with negative time spent between adjacent points along the way. They complain. I debug. My logic is correct.

Long story short, the servers are not as synchronized as they thought. Apparently, they get the actual time from <somewhere>, but only sync some of the servers some of the time.

They asked me if I could compensate for this in software, but in my spare time as we're kind of busy with "real work" right now.

I suspect it may be possible, but not worth it, and definitely not in my spare time.

PJH

@snoofle said:

I suspect it may be possible, but not worth it, and definitely not in my spare time.

Simply install ntpd?

Xyro

@snoofle said:

Then we start getting messages with negative time spent between adjacent points along the way.

Centralized logging server that is defined as the authoritative source of timestamps?

esoterik

Why not just have the servers sync against something like time.nist.gov?

snoofle

Both ntpd and time.nist.gov were both suggested (by others here) - it's just that they didn't do it everywhere, and where they did do it, the didn't do it correctly.

erikal

@snoofle said:

They asked me if I could compensate for this in software

My answer would be one of two lines:

1) If you try to automate chaos, you get automated chaos.

2) garbage in = garbage out

I this case I would probably pick #2 and keep it vague.

Cassidy

@snoofle said:

They asked me if I could compensate for this in software

"Yes. I can create a test file of servers that are time-synched and the software only interrogates servers listed in that file, since we know for definite that they will yield correct results. Once other servers are time-synched, the list can be updated."

@snoofle said:

but in my spare time as we're kind of busy with "real work" right now.

"so it's not really a priority then. Okay, back to my real work...."

DaedalusRaistlin

Would you believe NTP isn't always the answer?

Two of our main servers have 2500 IP addresses each. NTP tries to bind on every single address (eth0, eth1:0-2500) and runs out of file handles, then dies. Sure I found a few forum threads about the issue, even a patch to stop NTP trying to bind to everything, but it's not in the main branch yet. And the patch was from around 2006, but I'd rather not bother with it.

In the end, I decided to install a cron job that synced time with another server in the same datacenter that can run NTP. Does that make me TRWTF?

Cassidy

@DaedalusRaistlin said:

Two of our main servers have 2500 IP addresses each. NTP tries to bind on every single address (eth0, eth1:0-2500) and runs out of file handles, then dies.

Are you referring to NTP server or client? I didn't think the client needs to bind to an IP address.

@DaedalusRaistlin said:

In the end, I decided to install a cron job that synced time with another server in the same datacenter that can run NTP. Does that make me TRWTF?

Well, that makes two of us, then. Years back I cronned a script that periodically ran rdate to time-sync my server until someone pointed me to ntpd.

Either way, "NTP isn't always the answer" doesn't affect the original quesion - for reliable results, those servers need to be time-synched in some way, whether it be NTP/AboutTime/Windows Time/etc.

PJH

@DaedalusRaistlin said:

Would you believe NTP isn't always the answer?

Nope.
@DaedalusRaistlin said:

Two of our main servers have 2500 IP addresses each. NTP tries to bind on every single address (eth0, eth1:0-2500) and runs out of file handles, then dies.

ntpd might bind to all interfaces, but that's not what I was reccommending - I suggested using the client. If you want to run a local ntp server, then you probably want it on a machine that doesn't have 2500 aliases.

Purely out of interest, what problem was 2500 aliases the solution for?

DaedalusRaistlin

@PJH said:

@DaedalusRaistlin said:
Would you believe NTP isn't always the answer?
Nope.
@DaedalusRaistlin said:
Two of our main servers have 2500 IP addresses each. NTP tries to bind on every single address (eth0, eth1:0-2500) and runs out of file handles, then dies.
ntpd might bind to all interfaces, but that's not what I was reccommending - I suggested using the client. If you want to run a local ntp server, then you probably want it on a machine that doesn't have 2500 aliases.

Purely out of interest, what problem was 2500 aliases the solution for?

2500 IP addresses, of course ;)

wrack

@PJH said:

I suggested using the client. If you want to run a local ntp server, then you probably want it on a machine that doesn't have 2500 aliases.

With NTP, you have two options.

You can launch ntpdate from a cronjob. It starts, gets current time from a server, sets system clock, and exits. It may leads to such niceties as clock going backwards, which may cause scheduled jobs to run twice, stuff to get processed out of order, and so on. It is not recommended, except if you are sure your software can deal with it.

Or, you can run ntpd. It's a resident program. It CAN be a server to other machines, but it's optional. It indeed binds to all the interfaces by default (but it is configurable with the "interfaces" command in /etc/ntp.conf - at least on the version installed at my production server, which's dated from 2008 apparently), because with UDP you have to open a port to receive replies, as there's no concept of "established connection".

Ntpd connects to some peer servers, dropping unstable ones from time to time, calculates an average, and then "gently" adjusts the clock towards the right time by making "a second" take a bit more or a bit less time. But always guaranteeing there's 60 seconds in a minute, and that clock never goes backwards. That's why it's the recommended solution.

Cassidy

@DaedalusRaistlin said:

@PJH said:
Purely out of interest, what problem was 2500 aliases the solution for?
2500 IP addresses, of course ;)

You work in a datacentre, or a hosting company - something of that ilk?

I've seen large numbers of IP aliases bound to one (or several) NICs in hosting centres to separate out customer websites/virtual servers. Not sure if this is the case, mind.

DaedalusRaistlin

@bannedfromcoding said:

@PJH said:
I suggested using the client. If you want to run a local ntp server, then you probably want it on a machine that doesn't have 2500 aliases.

With NTP, you have two options.

You can launch ntpdate from a cronjob. It starts, gets current time from a server, sets system clock, and exits. It may leads to such niceties as clock going backwards, which may cause scheduled jobs to run twice, stuff to get processed out of order, and so on. It is not recommended, except if you are sure your software can deal with it.

Or, you can run ntpd. It's a resident program. It CAN be a server to other machines, but it's optional. It indeed binds to all the interfaces by default (but it is configurable with the "interfaces" command in /etc/ntp.conf - at least on the version installed at my production server, which's dated from 2008 apparently), because with UDP you have to open a port to receive replies, as there's no concept of "established connection".

Ntpd connects to some peer servers, dropping unstable ones from time to time, calculates an average, and then "gently" adjusts the clock towards the right time by making "a second" take a bit more or a bit less time. But always guaranteeing there's 60 seconds in a minute, and that clock never goes backwards. That's why it's the recommended solution.

Maybe it was just me, but I couldn't get it to work any differently with an interfaces line, one of the first things I checked.
Perhaps it ignores invalid ntpd.conf lines? Never gave me an error, still tried to bind to everything.

Aye, perhaps though I should have just run ntpdate - probably better than the answer I ended up using - a script along the lines of 'date `ssh mainserver date`' with a particular date mode.

@Cassidy said:

@DaedalusRaistlin said:
@PJH said:
Purely out of interest, what problem was 2500 aliases the solution for?

2500 IP addresses, of course ;)
You work in a datacentre, or a hosting company - something of that ilk?
I've seen large numbers of IP aliases bound to one (or several) NICs in hosting centres to separate out customer websites/virtual servers. Not sure if this is the case, mind.

Heh, nope - we used to have "only" 2200 IPs on one server. Now we've upgraded to 5000ish IPs on two servers. We have some very particular needs :P

Cassidy

@DaedalusRaistlin said:

Heh, nope - we used to have "only" 2200 IPs on one server. Now we've upgraded to 5000ish IPs on two servers. We have some very particular needs :P

Oh, of course - Russian spammer with large blocks of spoofed IPs to avoid receiving backscatter! I can see it now. I should have guessed!

D&RFC....

Sutherlands

@Cassidy said:

@DaedalusRaistlin said:
@PJH said:
Purely out of interest, what problem was 2500 aliases the solution for?
2500 IP addresses, of course ;)

You work in a datacentre, or a hosting company - something of that ilk?

I've seen large numbers of IP aliases bound to one (or several) NICs in hosting centres to separate out customer websites/virtual servers. Not sure if this is the case, mind.

Our hosting servers don't need multiple IPs to serve multiple clients.

Also, I blame Daedelus for us running out of IPv4 addresses.

DaedalusRaistlin

@Cassidy said:

@DaedalusRaistlin said:
Heh, nope - we used to have "only" 2200 IPs on one server. Now we've upgraded to 5000ish IPs on two servers. We have some very particular needs :P

Oh, of course - Russian spammer with large blocks of spoofed IPs to avoid receiving backscatter! I can see it now. I should have guessed!
D&RFC....

Err no, Australian, and not sending any emails at all. But whatever, you sound like you've made up your mind so I'll leave it at that.

@Sutherlands said:

@Cassidy said:
@DaedalusRaistlin said:
@PJH said:
Purely out of interest, what problem was 2500 aliases the solution for?
2500 IP addresses, of course ;)

You work in a datacentre, or a hosting company - something of that ilk?

I've seen large numbers of IP aliases bound to one (or several) NICs in hosting centres to separate out customer websites/virtual servers. Not sure if this is the case, mind.

Our hosting servers don't need multiple IPs to serve multiple clients.

Also, I blame Daedelus for us running out of IPv4 addresses.

We're still using a very tiny amount of IPv4 addresses - according to a few estimates, we still have a while to go until IPv4 exhaustion. The addresses we've obtained have been through standard requests - we haven't used any illegal or obscure means to get these addresses, it's all within the current guidelines. You may be surprised as to how many IPs are still available.

Sutherlands

Your sarcasm detector is in SERIOUS need of repair.

Cassidy

@Sutherlands said:

Our hosting servers don't need multiple IPs to serve multiple clients.

I was thinking more a case of "Our infrastructure doesn't need multiple servers to serve on multiple IP addresses" - situations where customers have bought/leased IP address blocks with a hosting provider and the provider don't want physically different hardware per client.

@Sutherlands said:

Also, I blame Daedelus for us running out of IPv4 addresses.

I vote a midnight raid to reclaim some. "Im in ur NIC steeln ur IPs!"

@Sutherlands said:

Your sarcasm detector is in SERIOUS need of repair.

Mmm... in his defence, maybe I was too obscure with the satire and he missed it. Doesn't look I've caused much offence.

PJH

@DaedalusRaistlin said:

@PJH said:
Purely out of interest, what problem was 2500 aliases the solution for?
2500 IP addresses, of course ;)

That's not answering my question. That's re-affirimg the current solution as being 'correct.' Feel free to tell me to mind my own business (which, I suppose, that non-answer did,) but you didn't answer the question: why are 2500 aliases required?

Centos 5.x and blatant lack of IP_TRANSPARENT perhaps? (Something that bit me recently...)