You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Managing a large complex environment with ever changing operational states is challenging, to assist, NMIS as a Network Management System which is performing performance management and fault management simultaneously monitors the health and operational status of devices and creates several individual metrics as well as an over all metric for each device.  This article explains what those metrics are and what they mean.  

Summary

Consider this in the context that a network device offers a service, the service it offers is connectivity, while a router or switch is up and all the interfaces are available, it is truly up, and when it has no CPU load it is healthy, as the interfaces get utilised and the CPU is busy, it has less capacity remaining.  The following statistics are considered part of the health of the device:

  • Reachability - is it up or not;
  • Availability - interface availability of all interface which are supposed to be up;
  • Response Time;
  • CPU;
  • Memory;

All of these metrics are weighted and a health metric is created.  This metric when compared over time should always indicate the relative health of the device.  Interfaces which aren't being used should be shutdown so that the health metric remains realistic.  The exact calculations can be seen in the runReachability subroutine.

Metric Details

Many people wanted network availability and many tools generated availability based on ping statistics and claimed success.  This however was a poor solution, for example, the switch running the management server would down and the management server would report that the whole network was down, which of course it wasn't.  OR worse, a device would be responding to a PING but many of its interfaces were down, so while it was reachable, it wasn't really available.

So, it was determined that NMIS would use Reachability, Availability and Health to represent the network.  Reachability being the pingability of device, Availability being (in the context of network gear) the interfaces which should be up, being up or not, e.g. interfaces which are "no shutdown" (ifAdminStatus = up) should be up, so a device with 10 interfaces of ifAdminStatus = up and ifOperStatus = up for 9 interfaces, the device would be 90% available.

Health is a composite metric, made up of many things depending on the device, router, CPU, memory.  Something interesting here is that part of the health is made up of an inverse of interface utilisation, so an interface which has no utilisation will have a high health component, an interface which is highly utilised will reduce that metric.  So the health is a reflection of load on the device, and will be very dynamic.

The overall metric of a device is a composite metric made up of weighted values of the other metrics being collected.  The formula for this is based is configurable, so you can have weight Reachability to be higher than it currently is, or lower, your choice.

Availability, ifAdminStatus and ifOperStatus

Availability is the interface availability, which is reflected in the SNMP metric ifOperStatus, if an interface is ifAdminStatus = up and the ifOperStatus = up that is 100% for that interface, if a device has 10 interface and all are ifAdminStatus = up and the ifOperStatus = up that is 100% for the device

If a device has 9 interfaces ifAdminStatus = up and the ifOperStatus = up and 1 interface ifAdminStatus = up and the ifOperStatus = down, that is 90% availability it is availability of the network services which the router/switch offers

Configuring Metrics Weights

In the NMIS configuration, Config.nmis there are several configuration items for the these are as follows:

'metrics' => {
  'weight_cpu' => '0.1',
  'weight_availability' => '0.1',
  'weight_int' => '0.2',
  'weight_mem' => '0.1',
  'weight_response' => '0.2',
  'weight_reachability' => '0.3',
  'metric_health' => '0.4',
  'metric_availability' => '0.2',
  'metric_reachability' => '0.4',
  'average_decimals' => '2',
  'average_diff' => '0.1',
},

The health metric uses items starting with "weight_" to weight the values into the health metric.  The overall metric combines health, availability and reachability into a single metric for each device and for each group and ultimately the entire network.  

  • No labels