1
0
-1

I'm having issue with collecting data.  Manually force it run then it stopped collecting data itself after couple minute.  And I kept receiving NMIS runtime exceeded warning on all the nodes. Please help

  1. Tung Nguyen

    this is really odd situation. Node Collect Time (s) Update Time (s) CRPHQ-MBC-R01 0.300 0.738 Cisco CUCM N/A 14.341 CoreSwitch-SW01-01 N/A 2.551 CoreSwitch-SW01-02 N/A 2.563 CoreSwitch-SW01-03 N/A 2.542 MSD024 N/A 4.752 MSD037 N/A 2.253 SDC-R01 N/A 0.789 localhost N/A 3.491 msd-sw1 N/A 14.415 msd-sw2 N/A 15.202 pfSense DMZ firewall N/A 0.937 vCenter Server MSD041 N/A 1.227 Node CoreSwitch-SW01-02, MSD024, MSD037, localhost, and pfSense DMZ firewall are working properly, collecting data has no problem. But the rest of the nodes do not collect data. I have to manually run update on an individual device then it collects some data but it stopped after few minutes.

CommentAdd your comment...

2 answers

  1.  
    1
    0
    -1

    Thanks Mark for responding my questions.  Under NMIS log, I found these errors but i'm not sure what that means.

    12-Jun-2017 07:55:07 nmis.pl::doCollect#986 WARNING collect lock exists for msd-sw1 which has not finished!
    12-Jun-2017 07:55:07 nmis.pl::doCollect#986 WARNING collect lock exists for localhost which has not finished!
    12-Jun-2017 07:55:07 nmis.pl::doCollect#986 WARNING collect lock exists for SDC-R01 which has not finished!
    12-Jun-2017 07:55:07 nmis.pl::doCollect#986 WARNING collect lock exists for MSD037 which has not finished!
    12-Jun-2017 07:55:07 nmis.pl::doCollect#986 WARNING collect lock exists for MSD024 which has not finished!
    12-Jun-2017 07:55:06 nmis.pl::doCollect#986 WARNING collect lock exists for CoreSwitch-SW01-03 which has not finished!
    12-Jun-2017 07:55:06 nmis.pl::doCollect#986 WARNING collect lock exists for CoreSwitch-SW01-02 which has not finished!
    12-Jun-2017 07:55:06 nmis.pl::doCollect#986 WARNING collect lock exists for CoreSwitch-SW01-01 which has not finished!
    12-Jun-2017 07:55:06 nmis.pl::doCollect#986 WARNING collect lock exists for Cisco CUCM which has not finished!

    12-Jun-2017 07:55:06 nmis.pl::doCollect#986 WARNING collect lock exists for CRPHQ-MBC-R01 which has not finished!

     

    Does this error cause to stop collect data?

     

    1. Mark Henry

      Tung, From our wiki - The Polling Lock feature was added in NMIS 8.5.10. Polling locks work that when an update or collect poll runs on a node, a lock is created, preventing another NMIS process from starting an update or collect. This is done because an update on a node with high interface counts can go for quite a while, we don't want another update running on the node at the same time. We also don't want NMIS running a collect on a node which will start an update if an update has never been run, and the server might get caught with too many blocked processes. It sounds like your having issues with collect time. In NMIS, select Reports->Current->Collect/Update Times. When the widget opens sort on the Collect Time column - what are your average and high collect times? Resort on Update time, what are the average and high Update times. Anything that exceeds your polling cycle (default 5m = 300s) needs to be investigated. Why are those devices taking so long to respond? Are they responding at all? If you run an update and collect on an individual device does it complete?

    2. Tung Nguyen

      Node Collect Time (s) Update Time (s) CRPHQ-MBC-R01 0.300 0.738 Cisco CUCM N/A 14.341 CoreSwitch-SW01-01 N/A 2.551 CoreSwitch-SW01-02 N/A 2.563 CoreSwitch-SW01-03 N/A 2.542 MSD024 N/A 4.752 MSD037 N/A 2.253 SDC-R01 N/A 0.789 localhost N/A 3.491 msd-sw1 N/A 14.415 msd-sw2 N/A 15.202 pfSense DMZ firewall N/A 0.937 vCenter Server MSD041 N/A 1.227 Node CoreSwitch-SW01-02, MSD024, MSD037, localhost, and pfSense DMZ firewall are working properly, collecting data has no problem. But the rest of the nodes do not collect data. I have to manually run update on an individual device then it collects some data but it stopped after few minutes.

    3. Mark Henry

      Hello Tung, Is this NMIS server Opmantek's VM or something you built in-house? I suggest you pick one node that is not collecting and run a manual collect from the command line (/usr/local/nmios8/bin/nmis.pl) using debug=9. This should give you sufficient diagnostic information regarding why that device is not completing the collection cycle.

    4. Tung Nguyen

      Server built in house. I will run manuall collect then let you know. Thanks

    5. Mark Henry

      OK, start by enabling the SNMP Agent on your Linux server, then get localhost collecting properly. You should also make sure you have completed the Basic Setup (Setup->Basic Setup). Once that's done you can check the built-in diagnostics available under System->Host Diagnostics for additional information about the NMIS server's state and health.

    6. Tung Nguyen

      The local host has been working properly. Collecting data and showing graph. Health Avg 98.137% Reachability Avg 99.996% Availability Avg 99.996% Ping_loss Avg 0.000% This is where i got struggle with. Some nodes are able collecting data but some nodes are unable run by collecting policy.

    CommentAdd your comment...
  2.  
    1
    0
    -1

    Hello Tung,

    Check the nmis log file (/usr/local/nmis8/logs/nmis.log) for errors related to the polling. You might also check to ensure all disks have adequate space (df -h), check cpu and memory (top), and finally run fixperms.pl (/usr/local/nmis8/admin) to make sure file/folder ownership is correct.

    If all these things are working properly, I would then check network permissions. Can the NMIS server reach the devices? Is ICMP and SNMP enabled on teh devices? Any chance ICMP and/or SNMP is blocked?

    r/mark H

      CommentAdd your comment...