Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Managing a large complex environment with ever changing operational states is challenging, to assist, NMIS as a Network Management System which is performing performance management and fault management simultaneously monitors the health and operational status of devices and creates several individual metrics as well as an over all metric for each device.  This article explains what those metrics are and what they mean.  

Table of Contents

Summary

Consider this in the context that a network device offers a service, the service it offers is connectivity, while a router or switch is up and all the interfaces are available, it is truly up, and when it has no CPU load it is healthy, as the interfaces get utilised and the CPU is busy, it has less capacity remaining.  The following statistics are considered part of the health of the device:

...

Code Block
'metrics' => {
  'weight_cpuavailability' => '0.1',
  'weight_availabilitycpu' => '0.12',
  'weight_int' => '0.23',
  'weight_mem' => '0.1',
  'weight_response' => '0.2',
  'weight_reachability' => '0.31',
  'metric_health' => '0.4',
  'metric_availability' => '0.2',
  'metric_reachability' => '0.4',
  'average_decimals' => '2',
  'average_diff' => '0.1',
},

...

If more weight should be given to interface utilisation and less to interface availability, these metrics can be tuned, so for example weight_availability could become 0.05 and weight_int could become 0.25, the resulting weights (weight_*) should add up to 100.

Other Metrics Configuration Options

Introduced in NMIS 8.5.2G are some additional configuration options to help how this all works, and to make it more or less responsive. The first two options are metric_comparison_first_period and metric_comparison_second_period, which are by default -8 hours and -16 hours.  

These are the two main variables which control the comparisons you see in NMIS, the real time health baselining.  These two options will be calculations made from time now to time metric_comparison_first_period (8 hours ago) to calculations made from metric_comparison_first_period (8 hours ago) to metric_comparison_second_period (16 hours ago).

This means NMIS is comparing in realtime data from the last hour 8 hours to the 8 hour period before that.  You can make this smaller or longer periods of time.  In the lab I am running -4 hours and -8 hours, which makes the metrics a little more responsive to load and change.

The other new configuration option is metric_int_utilisation_above which is -1 by default.  This means that interfaces with 0 (zero) utilisation will be counted into the overall interface utilisation metrics.  So if you have a switch with 48 interfaces all active but basically no utilisation, and two uplinks with 5 to 10% load, the average utilisation of the 48 interfaces is very low, so now we pick the highest of input and output utilisation and only add interfaces with utilisation above this configured amount, setting to 0.5 should produce more dynamic health metrics.

Metric Calculations Examples

...

weight_cpu * 90 + weight_availability * 90 + weight_int * 90 + weight_mem * 60 + weight_response * 100 + weight_reachability * 100

which becomes "0.1 2 * 90 + 0.1 * 90 + 0.2 3 * 90 + 0.1 * 60 + 0.2 * 100 + 0.3 1 * 100" resulting in 92% 90% for the health metric

The calculations can be seen in the collect debug, nmis.pl type=collect node=<NODENAME> debug=true

...

So based on our example before, the node would have the following values:

  • Health = 92%90%
  • Availability = 90%
  • Reachability = 100%

The formula would become, "metric_health * 92 90 + metric_availability * 90 + metric_reachability * 100", resulting in "0.4 * 92 90 + 0.2 * 90 + 0.4 * 100 = 94.8", So a metric of 94 .8 for this node, which is averaged with all the other nodes in this group, or the whole network to result in the metric for each group and the entire network.

...

.

Interface Availability Reporting

How NMIS reports interface availability can be a little confusing for some people, as some people see that it should be 0 when the node is unreachable or Undefined when the node is unreachable.  NMIS introduced an option to give this control to the user of the system.  The configuration option is interface_availability_value_when_down, it is U (undefined) by default.

The reason that U is used by default, is because when the node is down, is is not possible to observe the metrics from the node, the scientific method states that you should record "unobservable" or nothing when you do not have a valid observation for that time period.

How this works is that when a node is DOWN (unreachable) and interface_availability_value_when_down = U, NMIS will save U to the overall interface availability, which will mean that the node could be down for 2 hours, and the interface availability metric for the node will be 100%.  In the same scenario if interface_availability_value_when_down = 0, the interface availability metric will be ( 1 - 2/48 ) * 100 = 95.83% available.

During normal operation and a node is UP, interfaces will be polled the the operational state (ifOperStatus) will change the availability of an interface, when ifOperStatus is up the result is 100, when down it is 0.

When a node is DOWN, NO interface specific data is processed so nothing is saved for the interface for that period of time, this is treated by default as U when a graph or calculation is made the result will be 100% available.