Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Device Troubleshooting Process


snmpd daemon status validation

If the snmpd daemon terminates as soon as it is invoked, the following are possible reasons for failure and probable solutions:

  • The reason the snmpd daemon terminated will be logged in the snmpd log file or the configured syslogd log file. Check the log file to see the FATAL error message.

Solution: Correct the problem and restart the snmpd daemon.

...

Solution: Switch to the root user and restart the snmpd daemon.

...

Solution: Make sure you are the root user, change the ownership of the configuration file to the root user, and restart the snmpd daemon.

...

General Troubleshooting

Start with these basic checks:

  • Ensure that SNMPd is running.
  • If you’re using SNMPv1 or v2: Is the device configured with the correct community string in LogicMonitor (either at the global, group or device level)? If no community string is set, LogicMonitor defaults to using public. Note: Some Linux distributions significantly restrict which metrics are exposed if the community string is set to “public”. Therefore, we recommend you set your community string to something else.  See the section below to verify that your device has the correct community string set.
  • If you’re using SNMPv3: Is the device configured with the correct authpass, privpass and username (either at the global, group or device level)? See the section below to verify that your device has the correct v3 credentials set.
  • Can queries from the collector device reach the monitored device? You can check this by running tcpdump on the monitored host. If the queries are not reaching the device, there may be a firewall issue.
  • Is the monitored device replying to the queries from the collector?

If the queries are reaching the host, but the host is not replying, things to check are:

  • The access restrictions in snmpd.conf may not allow queries from the collector, or the community string is wrong.
    • The simplest SNMPd v1/v2 configuration would be the single line: rocommunity [community]
    • Note that SNMPd must be restarted after changing the configuration file contents. (/etc/init.d/snmpd restart)
  • SNMPd may only be listening on a loopback address.
    • On some distributions of Debian and Redhat, by default SNMPd only listens on 127.0.0.1. You can correct this in /etc/default/snmpd or /etc/syconfig/snmpd.options and restart SNMPd.
    • If you see this line: agentAddress  udp:127.0.0.1:161, it means the host is only listening on the loopback address for SNMP queries. Please comment that line.
  • IP Access restrictions may be blocking the SNMP requests from being accepted.
    • /etc/hosts.allow may be restricting the IP addresses that SNMP will respond to (you will see syslog messages about “Connection Refused”). Ensure the collector is listed in this file for SNMP access, if the file exists.
    • IPTables rules may be preventing the reception of SNMP packets from the collector.

Lexicographic order issues:

  • If you are receiving the common error message “Agent did not return variable bindings in lexicographic order”, set the snmp.ignore.lexicographic.order Collector setting to TRUE. As discussed in Editing the Collector Config Files, this setting must be updated from the Collector’s agent.conf file.

snmpd daemon status validation

...




NMIS server snmp configuration

Tutorial on how to configure SNMP to monitor our server, we will focus on CentOS as it is one of the most widespread distributions for servers. Except for the installation, the rest is similar in other distributions.

configuration steps.






  1. Identify the problem. The first step in troubleshooting a device issue is to identify the problem, you have to consider if the issue is in NMIS8 or NMIS9 products.
    1. Add to the support the case the product version and the servers/devices/models involved.
  2. What kind of problem are you observing. A device issue can be affected for the next reasons.
    1. Network performance, latency in the network, layer 1,2, and 3 issues.
    2. Device configuration, connectivity, SNMP configuration, and others. 
    3. Server hardware requirements, high resource utilization parameters in the server.
    4. Server configuration options, missing configuration items for server tunning.
    5. Disk performance, slow write/read times for the device collection. 
  3. Gather information, collect all the graphs, images, behaviors that can explain what the problem is.
    1. Collect support tool files The Opmantek Support Tool
      1. Execute the collect command for the support tool

        Code Block
        #General collection.
        /usr/local/nmis8/admin/support.pl action=collect  
        
        #If the file is big, we can add the next parameter.
        /usr/local/nmis8/admin/support.pl action=collect maxzipsize=900000000
        
        #Device collection.
        /usr/local/nmis8/admin/support.pl action=collect node=<node_name> 


    2. If you are using NMIS8, provide the /usr/local/nmis8/var files
      1. go to /usr/local/nmis8/var directory and collect the next files

        Code Block
        -rw-rw----   1 nmis   nmis    4292 Apr  5 18:26 <node_name>-node.json
        -rw-rw----   1 nmis   nmis    2695 Apr  5 18:26 <node_name>-view.json


      2. obtain update/collect outputs this information will upload to the support case:

        Code Block
        /usr/local/nmis8/bin/nmis.pl type=update node=<node_name> model=true debug=9 force=true > /tmp/node_name_update_$(hostname).log
        /usr/local/nmis8/bin/nmis.pl type=collect node=<node_name> model=true debug=9 force=true > /tmp/node_name_collect_$(hostname).log


    3. If you are using NMIS9, include the dump files.


      Code Block
      /usr/local/nmis9/admin/node_admin.pl act=dump
      
      {node=nodeX|uuid=nodeUUID}
      file=<MY PATH> everything=1


  4. Replicate the problem. If possible you have to define, what the steps are to replicate the problem.
  5. Identify symptoms. To this point, you are able to see a specific problem and what the symptoms are.
  6. Determinate if something has changed, is important to verify with your team if something has changed, a good way to see this behavior is monitoring the performance graph for devices and server
  7. It is an individual problem? verify if this behavior is happening in a single device/server.

...