Last November, we moved over to a new set of DNS and DHCP servers. I set up monitoring in an open-source monitoring system called Cricket. It's written in Perl and uses an RRD backend. It's simple, it's not fancy, but it gets the job done.
We already had some scripts in place that would monitor the number of DHCP objects available in each subnet, both manual DHCP and dynamic DHCP. When the number of used dynamic DHCP objects broke a 90% threshold, we would get an alert. I rewrote those scripts so that the monitoring would be a bit more dynamic -- the original author just had it use a static set of subnets and didn't update them regularly.
The new scripts are feeding Cricket with the number of dynamic DHCP objects in use, total dynamic DHCP objects in the subnet, and calculating the percentage rounded to the nearest percent. I configured Cricket so that it maintained the 90% threshold like it had before.
Soon after, we started getting notifications that, for all intents and purposes, amounted to the DHCP monitoring flapping. One particular subnet would show up frequently, but it wasn't fixed on that subnet -- it did change. After tearing my hair out and not seeing anything wrong, I started debugging the Cricket code.
The code is as follows:
sub monValue {
my($self,$target,$ds,$type,$args) = @_;
my($min,$max,$minOK,$maxOK);
my(@Thresholds) = split(/\s*:\s*/, $args);
my($value) = $self->rrdFetch(
$target->{'rrd-datafile'},
$self->getDSNum($target, $ds), 0
);
if (!defined($value)) {
Warn("Monitor: Couldn't fetch last $ds value from " .
$target->{'rrd-datafile'}.".");
return 1;
}
return ($nanErr, 'NaN') if isNaN($value);
$min = shift(@Thresholds);
$min = 'n' if (! defined($min));
if (lc($min) eq 'n') {
$minOK = 1;
} else {
$minOK = ($value > $min) ? 1 : 0;
}
$max = shift(@Thresholds);
$max = 'n' if (! defined($max));
if (lc($max) eq 'n') {
$maxOK = 1;
} else {
$maxOK = ($value < $max) ? 1 : 0;
}
Debug ("Value is $value; min is $min; max is $max; return value " . ($maxOK && $minOK));
return ($maxOK && $minOK,$value);
}
I added the Debug statement at the end. When the process ran, I came up with the following:
[09-Dec-2010 11:10:48 ] Processing /DHCP-info/10.194.247.0...
[09-Dec-2010 11:10:48 ] in rrdFetch: file is /app00/cricket/cricket-config/../cricket-data//DHCP-info/10.194.247.0.rrd
[09-Dec-2010 11:10:48 ] in rrdFetch: skipping RRA
[09-Dec-2010 11:10:48 ] in rrdFetch: rraNum is 0 rowNum is 0 dsNum is 4
[09-Dec-2010 11:10:48 ] in rrdFetch: return is 90
[09-Dec-2010 11:10:48 ] Value is 90; min is n; max is 90; return value 1
[09-Dec-2010 11:10:48 ] /DHCP-info/10.194.247.0 - PercentDynamicDHCP:value:n:90:MAIL:/usr/bin/mailx:example@domain.edu passed.
[09-Dec-2010 11:10:48 ] Triggering recovery for PercentDynamicDHCP:value:n:90:MAIL:/usr/bin/mailx:example@domain.edu.
[09-Dec-2010 11:10:48 ] |/usr/bin/mailx -s 'Cricket CLEAR: /DHCP-info/10.194.247.0' example@domain.edu
[09-Dec-2010 11:10:48 ] Monitor: Email sent to: example@domain.edu
type: value -- threshold: PercentDynamicDHCP:value:n:90:MAIL:/usr/bin/mailx:example@domain.edu -- target: /DHCP-i
nfo/10.194.247.0 -- ds: PercentDynamicDHCP -- val: 90
Five minutes later, it came up with this:
[09-Dec-2010 11:15:48 ] Processing /DHCP-info/10.194.247.0...
[09-Dec-2010 11:15:48 ] in rrdFetch: file is /app00/cricket/cricket-config/../cricket-data//DHCP-info/10.194.247.0.rrd
[09-Dec-2010 11:15:48 ] in rrdFetch: skipping RRA
[09-Dec-2010 11:15:48 ] in rrdFetch: rraNum is 0 rowNum is 0 dsNum is 4
[09-Dec-2010 11:15:48 ] in rrdFetch: return is 90
[09-Dec-2010 11:15:48 ] Value is 90; min is n; max is 90; return value 0
[09-Dec-2010 11:15:48 ] /DHCP-info/10.194.247.0 - 1291922148 - PercentDynamicDHCP:value:n:90:MAIL:/usr/bin/mailx:example@domain.
edu failed.
[09-Dec-2010 11:15:48 ] Triggering alarm for PercentDynamicDHCP:value:n:90:MAIL:/usr/bin/mailx:example@domain.edu.
[09-Dec-2010 11:15:48 ] |/usr/bin/mailx -s 'Cricket ADD: /DHCP-info/10.194.247.0' example@domain.edu
[09-Dec-2010 11:15:48 ] Monitor: Email sent to: example@domain.edu
type: value -- threshold: PercentDynamicDHCP:value:n:90:MAIL:/usr/bin/mailx:example@domain.edu -- target: /DHCP-i
nfo/10.194.247.0 -- ds: PercentDynamicDHCP -- val: 90
Notice the two lines where the values are exactly the same, but $maxOK && $minOK returns 1 in one instance, then $maxOK && $minOK returns 0 in the next instance.
The solution? I changed the monitoring threshold to 89.9. Problem solved.