Perl Just Can't Compare

nonpartisan

Last November, we moved over to a new set of DNS and DHCP servers. I set up monitoring in an open-source monitoring system called Cricket. It's written in Perl and uses an RRD backend. It's simple, it's not fancy, but it gets the job done.

We already had some scripts in place that would monitor the number of DHCP objects available in each subnet, both manual DHCP and dynamic DHCP. When the number of used dynamic DHCP objects broke a 90% threshold, we would get an alert. I rewrote those scripts so that the monitoring would be a bit more dynamic -- the original author just had it use a static set of subnets and didn't update them regularly.

The new scripts are feeding Cricket with the number of dynamic DHCP objects in use, total dynamic DHCP objects in the subnet, and calculating the percentage rounded to the nearest percent. I configured Cricket so that it maintained the 90% threshold like it had before.

Soon after, we started getting notifications that, for all intents and purposes, amounted to the DHCP monitoring flapping. One particular subnet would show up frequently, but it wasn't fixed on that subnet -- it did change. After tearing my hair out and not seeing anything wrong, I started debugging the Cricket code.

The code is as follows:

 
sub monValue {
    my($self,$target,$ds,$type,$args) = @_;
    my($min,$max,$minOK,$maxOK);
    my(@Thresholds) = split(/\s*:\s*/, $args);

    my($value) = $self->rrdFetch(
                                 $target->{'rrd-datafile'},
                                 $self->getDSNum($target, $ds), 0
                                 );

    if (!defined($value)) {
        Warn("Monitor: Couldn't fetch last $ds value from " .
             $target->{'rrd-datafile'}.".");
        return 1;
    }

    return ($nanErr, 'NaN') if isNaN($value);

    $min = shift(@Thresholds);
    $min = 'n' if (! defined($min));

    if (lc($min) eq 'n') {
        $minOK = 1;
    } else {
        $minOK = ($value > $min) ? 1 : 0;
    }

    $max = shift(@Thresholds);
    $max = 'n' if (! defined($max));

    if (lc($max) eq 'n') {
        $maxOK = 1;
    } else {
        $maxOK = ($value < $max) ? 1 : 0;
    }
    Debug ("Value is $value; min is $min; max is $max; return value " . ($maxOK && $minOK));
    return ($maxOK && $minOK,$value);
}

I added the Debug statement at the end. When the process ran, I came up with the following:

[09-Dec-2010 11:10:48 ] Processing /DHCP-info/10.194.247.0...
[09-Dec-2010 11:10:48 ] in rrdFetch: file is /app00/cricket/cricket-config/../cricket-data//DHCP-info/10.194.247.0.rrd
[09-Dec-2010 11:10:48 ] in rrdFetch: skipping RRA
[09-Dec-2010 11:10:48 ] in rrdFetch: rraNum is 0 rowNum is 0 dsNum is 4
[09-Dec-2010 11:10:48 ] in rrdFetch: return is 90
[09-Dec-2010 11:10:48 ] Value is 90; min is n; max is 90; return value 1
[09-Dec-2010 11:10:48 ] /DHCP-info/10.194.247.0 - PercentDynamicDHCP:value:n:90:MAIL:/usr/bin/mailx:example@domain.edu passed.
[09-Dec-2010 11:10:48 ] Triggering recovery for PercentDynamicDHCP:value:n:90:MAIL:/usr/bin/mailx:example@domain.edu.
[09-Dec-2010 11:10:48 ] |/usr/bin/mailx -s 'Cricket CLEAR: /DHCP-info/10.194.247.0' example@domain.edu

[09-Dec-2010 11:10:48 ] Monitor: Email sent to: example@domain.edu
type:           value -- threshold:     PercentDynamicDHCP:value:n:90:MAIL:/usr/bin/mailx:example@domain.edu -- target:   /DHCP-i
nfo/10.194.247.0 -- ds:         PercentDynamicDHCP -- val:              90

Five minutes later, it came up with this:

[09-Dec-2010 11:15:48 ] Processing /DHCP-info/10.194.247.0...
[09-Dec-2010 11:15:48 ] in rrdFetch: file is /app00/cricket/cricket-config/../cricket-data//DHCP-info/10.194.247.0.rrd
[09-Dec-2010 11:15:48 ] in rrdFetch: skipping RRA
[09-Dec-2010 11:15:48 ] in rrdFetch: rraNum is 0 rowNum is 0 dsNum is 4
[09-Dec-2010 11:15:48 ] in rrdFetch: return is 90
[09-Dec-2010 11:15:48 ] Value is 90; min is n; max is 90; return value 0
[09-Dec-2010 11:15:48 ] /DHCP-info/10.194.247.0 -  1291922148 - PercentDynamicDHCP:value:n:90:MAIL:/usr/bin/mailx:example@domain.
edu failed.
[09-Dec-2010 11:15:48 ] Triggering alarm for PercentDynamicDHCP:value:n:90:MAIL:/usr/bin/mailx:example@domain.edu.
[09-Dec-2010 11:15:48 ] |/usr/bin/mailx -s 'Cricket ADD: /DHCP-info/10.194.247.0' example@domain.edu

[09-Dec-2010 11:15:48 ] Monitor: Email sent to: example@domain.edu
type:           value -- threshold:     PercentDynamicDHCP:value:n:90:MAIL:/usr/bin/mailx:example@domain.edu -- target:   /DHCP-i
nfo/10.194.247.0 -- ds:         PercentDynamicDHCP -- val:              90

Notice the two lines where the values are exactly the same, but $maxOK && $minOK returns 1 in one instance, then $maxOK && $minOK returns 0 in the next instance.

The solution? I changed the monitoring threshold to 89.9. Problem solved.

delta5341

I can assume value and max are both floating point numbers correct. If so then there is no WTF, and I would like you to spend some time reading about how floating point numbers are stored in memory and all the pitfalls about floating point numbers.

nonpartisan

Perl is not a strictly-typed language. $max and $min can be integer, or string, or floating point, or whatever you happen to need at the time. And the old monitoring had the same threshold but didn't have this problem. Only difference was that it was on a different server.

The_Assimilator

@delta534 said:

I can assume value and max are both floating point numbers correct. If so then there is no WTF, and I would like you to spend some time reading about how floating point numbers are stored in memory and all the pitfalls about floating point numbers.

Congratulations, you understand how floating-point numbers work. The WTF is that whoever wrote Cricket's threshold monitoring code apparently doesn't.

delta5341

I don't do perl and it just looked like it might have been using floating point since floating point is the usual suspect when 2 comparisons that should be equal are not.

nonpartisan

@delta534 said:

I don't do perl and it just looked like it might have been using floating point since floating point is the usual suspect when 2 comparisons that should be equal are not.

To the best of my knowledge, Perl uses the most applicable comparison that it can in the situation. The percentage that I feed to Cricket is an integer -- I don't feed it 39.84 or 41.77. The value with which I originally programmed the threshold was an integer. I'm well aware of the inaccuracies of floating point. If I couldn't rely on Perl performing a correct integer comparison, then there would be no point to the language. There were other subnets that worked just fine and their alerts worked perfectly. It looked like this would happen when the alert came toward the end of the list of subnets, but I have no explanation for it except that it was straight out doing the comparison incorrectly.

TGV

@delta534 said:

I can assume value and max are both floating point numbers correct. If so then there is no WTF, and I would like you to spend some time reading about how floating point numbers are stored in memory and all the pitfalls about floating point numbers.

Isn't it unlikely that in DHCP you would actually get that close to 90, such as 89.999999 or 90.00001?

Iago

@nonpartisan said:

The percentage that I feed to Cricket is an integer -- I don't feed it 39.84 or 41.77. The value with which I originally programmed the threshold was an integer.

Yes, but just because you used integers doesn't mean everyone else will.

In practice, it looks like $cricket->rrdFetch() calls RRD::File::getDSRowValue(), which reads a block from a binary data file and calls unpack() on it with the format parameter taken by calling RRD::Format::format and requesting the "element" format, which is "d", which means that the value you are comparing is floating-point.

The only WTF is that you assumed you would be dealing with integers instead of actually, you know, checking. I have many times been faced with a strange problem in my code that I initially suspected of being a bug in the language. Of all those times, only once was that suspicion correct, and it involved an extremely uncommon edge case way out at the extremes of what floating-point numbers can represent. Every single other time? PEBKAC.

pjt33

@Iago said:

The only WTF is that you assumed you would be dealing with integers instead of actually, you know, checking.

Do I sense an impending flame war over static vs dynamic type systems?

derula

@pjt33 said:

@Iago said:
The only WTF is that you assumed you would be dealing with integers instead of actually, you know, checking.
Do I sense an impending flame war over static vs dynamic type systems?

(>^◡⁠^)> poyo, poyo!

topspin

@pjt33 said:

Do I sense an impending flame war over static vs dynamic type systems?

Aren't the arguments for that already made?

error

@topspin said:

@pjt33 said:
Do I sense an impending flame war over static vs dynamic type systems?

Aren't the arguments for that already made?

Hey, everybody, look at this dead horse lying over here.

nonpartisan

@Iago said:

@nonpartisan said:
The percentage that I feed to Cricket is an integer -- I don't feed it 39.84 or 41.77. The value with which I originally programmed the threshold was an integer.
Yes, but just because you used integers doesn't mean everyone else will.
In practice, it looks like $cricket->rrdFetch() calls RRD::File::getDSRowValue(), which reads a block from a binary data file and calls unpack() on it with the format parameter taken by calling RRD::Format::format and requesting the "element" format, which is "d", which means that the value you are comparing is floating-point.
The only WTF is that you assumed you would be dealing with integers instead of actually, you know, checking. I have many times been faced with a strange problem in my code that I initially suspected of being a bug in the language. Of all those times, only once was that suspicion correct, and it involved an extremely uncommon edge case way out at the extremes of what floating-point numbers can represent. Every single other time? PEBKAC.

It's easy to say that and I'm not disputing that floating point is an inexact science. However:

The integer value of 90 has an exact representation in floating point (that is, no rounding is necessary),
I'm feeding it integer values only, and
It's thoroughly inconsistent -- if this was a floating point representation issue, I'd expect the same problem to occur with all of my subnets that are hitting a value of 90.

This is what gets fed as input:

10
18
30
59
50

In order: number of M-DHCP objects in use, number of M-DHCP objects in the subnet, number of D-DHCP objects in use, number of D-DHCP objects in the subnet, and percentage of objects in use. 30/59 is 50.847457somethingsomethingsomething%, but I just feed in the integer portion of it, just truncating it. I don't even round it.

There's no reason I can see that Perl is choking in this comparison. If it chokes on this comparison, it should be choking on them all.

Xyro

@nonpartisan said:

It's easy to say that and I'm not disputing that floating point is an inexact science.

Wha?
@nonpartisan said:

There's no reason I can see that Perl is choking in this comparison. If it chokes on this comparison, it should be choking on them all.

Whe?
@nonpartisan said:

but I just feed in the integer portion of it, just truncating it.

Perl's not doing anything fancy. You're comparing $value with $max, right? Where does $value get truncated? Maybe misunderstanding you here.

nonpartisan

@Xyro said:

@nonpartisan said:
It's easy to say that and I'm not disputing that floating point is an inexact science.
Wha? @nonpartisan said:
There's no reason I can see that Perl is choking in this comparison. If it chokes on this comparison, it should be choking on them all.
Whe? @nonpartisan said:
but I just feed in the integer portion of it, just truncating it.
Perl's not doing anything fancy. You're comparing $value with $max, right? Where does $value get truncated? Maybe misunderstanding you here.

Floating point, in general, is an inexact science. But there are floating point values that have a definite representation in binary (i.e. no rounding is needed).

The simplest background is that Cricket can read values from SNMP OIDs or from text files. (Pedantic: there are other options, but these are the only options that matter to me at this time.) So I have one script that runs every 30 minutes and generates new statistics files. This script runs on the :16 and the :46 of the hour.

The first script creates a generic text file with the previously-listed fields (let's say the file is 10.191.247.0_255.255.255.0.txt):

10
18
30
59
50

That script is in no way, shape, or form related to Cricket. Cricket doesn't force it to run; this first script is run by a cron job and its only function in life is to update the DHCP text files.

Every 5 minutes, Cricket comes along and reads the latest (well, at least last updated by the first script) DHCP values. (Why have Cricket run every 5 minutes when the values are only updated every 30 minutes? It was just simpler that way and I was throwing together the updated monitoring pretty quickly.) Cricket is programmed to read that text file and record the values into the different fields (M-DHCP in use, M-DHCP total objects, D-DHCP in use, D-DHCP total objects, percent D-DHCP in use).

So it is a separate script that is actually gathering the data and writing it into text files. And when it writes the data into text files, it calculates the percent D-DHCP in use so that Cricket doesn't need to. When Cricket comes around and reads the data, the percentage is already in an integer form and is read straight from the file. It's not doing any calculations -- just reading the data and recording it. So when Cricket reads the data, it is getting a pristine integer "90" out of the file. (In the example above, it's getting a pristine "50" out of the file.)

So Cricket doesn't truncate $value anywhere. The information I feed into $value is an integer. My argument is: even if Cricket (Perl) is somehow reading and treating my "90" as a floating point value, (a) there's no fractional part to the number I'm feeding it, (b) there is an exact (non-rounded) floating point representation of the value "90", and (c) I had other subnets that were reporting as "90" and their comparisons weren't getting screwed up and flapping like I described in the original post.

boomzilla

@nonpartisan said:

Floating point, in general, is an inexact science.

I think you're confusing the words "inexact science" with "commonly misunderstood and easily misused."

Xyro

If $value is a whole number, then what you're comparing it to must not be, or else the other way around. What I mean is, there's no way Perl or any other major language will be making a fundamental mistake on a simple comparison operation. You can compare 90 < 90 a billion times and always get false 100% of the time, so if you're seeing a true then it's far more reasonable to conclude that those 90 values aren't quite what you expect. I don't have any additional insight into what's going on without more debugging, but I can assure you, floating points or otherwise, the inequality comparison is not using quantum mechanics math.

nonpartisan

@Xyro said:

If $value is a whole number, then what you're comparing it to must not be, or else the other way around. What I mean is, there's no way Perl or any other major language will be making a fundamental mistake on a simple comparison operation. You can compare 90 < 90 a billion times and always get false 100% of the time, so if you're seeing a true then it's far more reasonable to conclude that those 90 values aren't quite what you expect. I don't have any additional insight into what's going on without more debugging, but I can assure you, floating points or otherwise, the inequality comparison is not using quantum mechanics math.

And that, my friend, is the whole source of the WTF. If you go back and look at my original output, the debug line, I am explicitly outputting $value, $min, and $max after Cricket has read $value from the text file. $min and $max are defined by the threshold line (you can see that line where it says "PercentDynamicDHCP:value:n:90"). None of them are anything different than I expect. In the description of the system I posted yesterday, I noted that the files only get rewritten on the :16 and :46 of the hour. The example log entries were from 1100 and 1105, meaning the file was not rewritten during that time. Nothing changed, and yet Perl is coming back with a different answer. And yet there were other comparisons, same threshold, other subnets reporting 90% even since that's what the first script wrote to the file, and those worked just fine. Best I can tell it happened on the last subnet that reported 90%, and that subnet would change periodically (so this wasn't failing on the same subnet every time), but not necessarily the last subnet it checked.

I have no explanation, which is why I posted the WTF.

Xyro

@nonpartisan said:

And that, my friend, is the whole source of the WTF.

I know, but I demand an explanation!
@nonpartisan said:

I have no explanation, which is why I posted the WTF.

I just said I demand one!

No, really, it looks like this WTF is confined enough to be isolated to a small patch of code. We must test it! Enlightenment will follow.

thosrtanner

You could always try outputting the result of $value - $max.

I suspect some confirmation of floatingpointedness would follow.

tgape

@nonpartisan said:

Perl is not a strictly-typed language. $max and $min can be integer, or string, or floating point, or whatever you happen to need at the time. And the old monitoring had the same threshold but didn't have this problem. Only difference was that it was on a different server.

Thus speaks one who does not really know perl. I don't *really* know it, but I believe I know it a bit better.

$max and $min are string, boolean, and floating point, since no magic has been worked to eliminate any of them, and integer context has not been explicitly specified. However, when it prints out the value, it uses the string version, if it's available. (And, as $value was set with a mechanism that would sets both, and not adjusted further, and $max got a string version, and was then evaluated in numeric context to give it float, both float and string versions are available.)

I have noticed a few processors over the years that have a problem like this - either less than or greater than (so far, not ever both - at least, not for the same numbers) would make inconsistent floating point evaluations with certain numbers. I have not yet found less than or equal to or greater than or equal to have this issue.

That having been said, while I do think it's a processor issue, perl seems to be more susceptible to it than most languages. Triggering it in C, for example, is usually quite tricky. However, since I've seen it appear and disappear with processor upgrades, I don't see how the instance I investigated could be something that wasn't processor related.

(Btw, yes, the multiple simultaneous values thing means with a little magic you can make a 0 value which is true. However, it's usually simpler to just use '0e0', which is also true. I've also seen code that used the string '0 but true'. Additionally, $! contains a numeric errno in numeric context, and the corresponding system error string in string context.)

nonpartisan

@tgape said:

@nonpartisan said:
Perl is not a strictly-typed language. $max and $min can be integer, or string, or floating point, or whatever you happen to need at the time. And the old monitoring had the same threshold but didn't have this problem. Only difference was that it was on a different server.

Thus speaks one who does not really know perl. I don't *really* know it, but I believe I know it a bit better.
$max and $min are string, boolean, and floating point, since no magic has been worked to eliminate any of them, and integer context has not been explicitly specified. However, when it prints out the value, it uses the string version, if it's available. (And, as $value was set with a mechanism that would sets both, and not adjusted further, and $max got a string version, and was then evaluated in numeric context to give it float, both float and string versions are available.)

Ummm, no.

See [url=http://perldoc.perl.org/perlguts.html#Variables]perlguts[/url].

Quoting one paragraph:

@perlguts said:

An SV can be created and loaded with one command. There are five types of values that can be loaded: an integer value (IV), an unsigned integer value (UV), a double (NV), a string (PV), and another scalar (SV).

Boolean is not one of those. And it is not stored as all of those representations simultaneously. But it can convert between them as needed.

There are several macros that I just found out about that can tell me what kind of value is stored in $value, $min, and $max. I'll add those to my debug statement and see what comes out.

nonpartisan

@nonpartisan said:

There are several macros that I just found out about that can tell me what kind of value is stored in $value, $min, and $max. I'll add those to my debug statement and see what comes out.

Crap. While the information about the types it stores is still correct, it turns out those macros are for C and not within the Perl language itself. Still looking . . .

boomzilla

@nonpartisan said:

@nonpartisan said:
There are several macros that I just found out about that can tell me what kind of value is stored in $value, $min, and $max. I'll add those to my debug statement and see what comes out.

Crap. While the information about the types it stores is still correct, it turns out those macros are for C and not within the Perl language itself. Still looking . . .

I've only dabbled in perl, but can't you pack the value as a floating point? Then you should be able to see if / when you're getting something a little different than exactly 90.

Lord_abletran

Is there some reason int() isn't good enough?

boomzilla

@Lord abletran said:

Is there some reason int() isn't good enough?

Well, that would presumably fix the problem, but you wouldn't be able to find it. Maybe use printf along with %g to detect if rounding is going on.

nonpartisan

@Lord abletran said:

Is there some reason int() isn't good enough?

Because I am supposed to be able to use floating point values as thresholds and I hate to Band-Aid(tm) a problem without knowing the source of the issue. One of the capabilities it has is to be able to say something like "if this value is 30% higher than the previous value, then send an alert." The 30% is specified as a floating point if I recall correctly. No, I don't currently use it, but I'd like to be able to retain that capability should I ever find a need for it without putting an artificial limitation on it (masking all values as ints).

I found a wonderful module, Devel::Peek, that outputs the SV information as referenced by perlguts. I'm currently having it log that data to a text file and I will be perusing it shortly.