Related Articles

NMIS 8 Processes

The main NMIS 8 process is called from different cron jobs to run different operations: collect, update, summary, clean jobs, etc. As an example:

* * * * * root /usr/local/nmis8/bin/nmis.pl type=collect abort_after=60 mthread=true ignore_running=true;

The cron configuration can be found in /etc/crond.d/nmis. 

For a collect or an update, the main thread is set up by default to fork worker processes to perform the requested operations using threads and improving performance. One of each operation will run every minute (by default), and will process as many nodes as the collect polling cycle is set up to process. 

Configurations that affect performance

There are some important configurations that affect performace:

Gaps in Graphs

If the server takes a long time to collect and cannot complete any operation, an useful tool is nmis8/admin/polling_summary. Here we can see how many nodes have any late collect, and a summary of nodes being collected and not collected:

nmis8/admin> ./polling_summary.pl 

An example output:

node                     attempt   status    ping  snmp  policy     delta  snmp avgdel  poll   update  pollmessage   
u18_poller               23:55:02  pingonly  down  down  default    ---    300  0.00    0.00   0.03    no snmp collect 
uburnto                  13:14:03  pingonly  down  down  default    ---    300  0.00    0.00   0.03    no snmp collect 
unreachablenode          23:55:02  demoted   down  down  default    ---    300  0.00    0.00   0.01    snmp polling demoted
virtual_elf              23:56:03  pingonly  down  down  default    ---    300  0.00    0.00   0.02    no snmp collect 
vrouter-host             16:44:01  ontime    up    up    default    299    300  300.04  1.44   1.59                    
vyos-p1                  16:44:02  ontime    up    up    default    299    300  300.04  1.27   2.79                    
vyos-p2                  16:44:01  ontime    up    up    default    299    300  300.04  1.99   1.91                    
vyos-p3                  16:44:04  ontime    up    up    default    299    300  300.05  1.79   1.86                    
vyos-p4                  16:44:03  ontime    up    up    default    300    300  300.04  1.81   1.84                    
vyos-pe1                 16:44:04  ontime    up    up    default    300    300  300.05  1.81   1.91                    
vyos-pe2                 16:44:04  ontime    up    up    default    299    300  300.05  1.78   1.90                    
vyos-rr1                 16:47:02  ontime    up    up    default    300    300  300.00  2.23   2.22                    
vyos-rr2                 16:44:01  ontime    up    up    default    299    300  300.05  1.95   1.79                    
wifi                     16:46:02  ontime    up    up    default    300    300  300.00  0.56   0.34                    

totalNodes=59 totalPoll=52 ontime=38 pingOnly=14 1x_late=0 3x_late=0 12x_late=0 144x_late=0


A symptom of an overloaded server can be gaps in the graphs.

Below is an example about how these parameters can impact in the performance of the server, in a server with 64 CPUs and more than 3700 nodes: 

When

abort_after

(seconds)

demote_faulty_nodesCPUNodes Not CollectedOther
Initial ConfigurationDefault (60)false<50% (Aprox.)1100 ~
totalPoll=3713 ontime=891 1x_late=1460 3x_late=41 12x_late=56 144x_late=1265
Test 1120true<50% (Aprox.)500 ~N/A
Test 2240true<60% (Aprox.)240 ~
totalPoll=1229 ontime=998 no_snmp=14 demoted=0 1x_late=217 3x_late=0 12x_late=0 144x_late=0
Test 30 (Disabled)trueAround 100% (Aprox.)0Took 7 minutes. Processed >3000 nodes. Disabled cron
Test 40 (Disabled)true100% (Aprox.)N/ACommented while (wait for children) in nmis.pl
Test 50 (Disabled)false100% (Aprox.)N/AN/A

Note that problems in the modelling that throw errors in the logs can also make the system slow. The polling time for each node will be increased, hence the polling cycle will take longer to run, and depending on the configuration options, the process can be aborted with some nodes not being polled. 

(Internal case reference: SUPPORT-6976)