Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

NMIS 9

There are lots of factors that determine the system health of a server. The hardware capabilities - CPU, memory or disk - is an important one, but also the server load - number of devices (Nodes to be polled, updated, audited, synchronised), number of products (NMIS, OAE, opCharts, opHA - each running different processes), number of concurrent users. 

...

Stressed system 
Status
colourGreen
titlepoller-nine

System information:

NameValue
nmisd_max_workers10
omkd_workers4
omkd_max_requests500
Nodes406
Active Nodes507
OS

Ubuntu 18.04.3 LTS

rolepoller

This is how the server memory graphs looks in a stressed system - We will be focused on the memory as it is where the bottleneck is: 

...

Healthy system 
Status
colourGreen
titlemaster-nine

System information:

NameValue
nmisd_max_workers5
omkd_workers10
omkd_max_requestsundef
Nodes2
Poller Nodes536
OS

Ubuntu 18.04.3 LTS

rolemaster

This is how the server memory graphs looks in a normal system: 

...

Daemons graphs:

omk:

mongo:

NMIS 8

The main NMIS 8 process is called from different cron jobs to run different operations: collect, update, summary, master, ... 

For a collect or an update, the main thread creates forks to perform the operation requested. 

Configurations that affect performance

There are some important configuration that affects performace:

  • abort_afterFrom NMIS 8.6.8G there is a new command line option, abort_after, that prevents the main thread to run for a long time, preventing it to collide from the next cron job. By default, this parameter is 60 seconds, as the cron job is set to run every 60 minutes. 

    This option always needs to have also the option mthreads=true. 

    Code Block
    nmis8/bin/nmis.pl type=collect abort_after=60 mthread=true ignore_running=true;


  • max_thread: The other important configuration option is max_thread, that will prevent the number of children of the main process be too big. Considerations:
    • If the collect has a lot of nodes to process, the number of children won't reach the limit instantly. While the main thread is forking, the children complete their jobs and will exit. Also, the main process will wait for them to change their state so the number will increase slowly.
    • NMIS can have more than one instance of the main process running, and the number of children could be higher that max_threads, as the limit is only per instance. 

Gaps in Graphs

If the server takes a long time to collect and cannot complete any operation, an useful tool is nmis8/admin/polling_summary. Here we can see how many nodes have any late collect, and a summary of nodes being collected and not collected. 

A symptom of an overloaded server can be the gaps in the graphs. 

This is an example about how these parameters can impact in the performance of the server, in a server with 64 CPUs and more than 3700 nodes: 

Whenabort_afterdemote_faulty_nodesCPUNodes No CollectedOther
Initial ConfigurationDefault (60 seg)false<50% (Aprox.)1100 ~
totalPoll=3713 ontime=891 1x_late=1460 3x_late=41 12x_late=56 144x_late=1265
1 Test120true<50% (Aprox.)500 ~N/A
2 Test240true<60% (Aprox.)240 ~
totalPoll=1229 ontime=998 no_snmp=14 demoted=0 1x_late=217 3x_late=0 12x_late=0 144x_late=0
3 Test0 (Disabled)trueAround 100% (Aprox.)0Took 7 minutes. Processed >3000 nodes. Disabled cron
4 Test0 (Disabled)true100% (Aprox.)N/ACommented while (wait for children) in nmis.pl
5 Test0 (Disabled)false100% (Aprox.)N/AN/A

Note that problems in the modelling that throughs errors in the logs can also make the system slow. 

SUPPORT-6976