Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Identify the problem. The first step in troubleshooting a device issue is to identify the problem, you have to consider if the issue is in NMIS8 or NMIS9 products.
    1. Add to the support the case the product version and the servers/devices/models involved.
  2. What kind of problem are you observing. A device issue can be affected for the next reasons.
    1. Network performance, latency in the network, layer 1,2, and 3 issues.
    2. Device configuration, connectivity, SNMP configuration, and others. 
    3. Server hardware requirements, high resource utilization parameters in the server.
    4. Server configuration options, missing configuration items for server tunning.
    5. Disk performance, slow write/read times for the device collection. 
  3. Gather information, collect all the graphs, images, behaviors that can explain what the problem is.
    1. Collect support tool files The Opmantek Support Tool
      1. Execute the collect command for the support tool

        Code Block
        #General collection.
        /usr/local/nmis8/admin/support.pl action=collect  
        
        #If the file is big, we can add the next parameter.
        /usr/local/nmis8/admin/support.pl action=collect maxzipsize=900000000
        
        #Device collection.
        /usr/local/nmis8/admin/support.pl action=collect node=<node_name> maxzipsize=900000000


    2. If you are using NMIS8, provide the /usr/local/nmis8/var files
      1. go to /usr/local/nmis8/var directory and collect the next files

        Code Block
        -rw-rw----   1 nmis   nmis    4292 Apr  5 18:26 <node_name>-node.json
        -rw-rw----   1 nmis   nmis    2695 Apr  5 18:26 <node_name>-view.json


      2. obtain update/collect outputs this information will upload to the support case:

        Code Block
        /usr/local/nmis8/bin/nmis.pl type=update node=<node_name> model=true debug=9 force=true > /tmp/node_name_update_$(hostname).log
        /usr/local/nmis8/bin/nmis.pl type=collect node=<node_name> model=true debug=9 force=true > /tmp/node_name_collect_$(hostname).log


  4. Replicate the problem. If possible you have to define, what the steps are to replicate the problem.
  5. Identify symptoms. To this point, you are able to see a specific problem and what the symptoms are.
  6. Determinate if something has changed, is important to verify with your team if something has changed, a good way to see this behavior is monitoring the performance graph for devices and serverImage Removed

Image RemovedImage Added


  1. It is an individual problem?, verify if this behavior is happening in a single device/server.

...

This section is crucial to identify or resolve device issues, you need to review some considerations depending on the number of nodes you will manage, the number of users that will be accessing the GUI's, how often does your data need to be updated? If updates are required every 5 minutes, then you will need to have the hardware to be able to accomplish these requirements, also the OS Requirements need to be well defined a good rule of thumb is to reserve 1 GB of RAM for the OS by default, High-speed drives for the data (SAN is ideal) with separate storage for mongo database, and temp files. Anywhere between 4-8 cores with a high-performing processor(s), 16-64 GB RAM should be performing well for 1k+ Nodes.

Using top/htop command

Top The top command shows all running processes in the server. It shows you the system information and the processes information just like up-time, average load, tasks running, no. of users logged in, no. of CPU processes, RAM utilisation utilization and it lists all the processes running/utilised utilized by the users in your server.

...

  • Total Processes running (412 total)
  • Running Processes (1 running)
  • Sleeping Processes (411 sleeping)
  • Stopped Processes (0 stopped)
  • Processes waiting to be stopped from the parent process (15 zombie)zombies)  ####### This is not good for the manager
    Zombie Process: A process that has completed execution, but still has an entry in the process table. This entry still needs to allow the parent process to read its child exit status.

3. Estados de la CPU section.

Code Block
Cpu(s):  6.8%us,  3.8%sy,  0.2%ni, 74.4%id, 28.2%wa,  0.1%hi,  0.5%si,  0.0%st

...

Code Block
Swap:  6669720k total,  3536428k used,  3133292k free, 10767256k cached  

5. Process List

There is a the last row to discuss CPU usage which were was running currently

Code Block
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                     
26306 root      20   0  478m 257m 1900 S  3.9  1.3   0:08.21 nmis.pl                                                                     
15522 root      20   0  626m 373m 2776 S  2.0  1.9  71:45.09 opeventsd.pl                                                                
27285 root      20   0 15280 1444  884 R  2.0  0.0   0:00.01 top                                                                         

...

NMIS 9 - Configuration Options for Server Performance Tuning

Disk performance

...

review.

This section is dedicated to identifying when the server is not writing all the data for the devices, this can have as a result graph with interruptions, so this causes level 2 problems (Severe impact - Unreliable production system) or even in some occasions level 1 (Critical for the business, complete loss of service, loss of data) to the client, so it is essential to determine what is happening and provide a diagnosis.

Server status at Service level.
The monitoring service is affected slowly when accessing the GUI, and its main impact is centered on the failure to execute collect and updates to the nodes, the CPUs are saturated and the monitoring system executes the collection of information every minute or 5 minutes, the system being overloaded is forced to kill the processes affecting the storage of the information of the nodes in the RRD's files

Image Added

Node View in NMIS:

You will be able to visualize device graphs with gaps, this is an example of how to recognize this behavior.

 Image Added

 

...

Introducción.

Este procedimiento está dedicado a determinar la causa y proporcionar posibles soluciones al visualizar graficas cortadas en NMIS, el problema raíz es la visualización de graficas con interrupciones, por lo que esto ocasiona problemas de nivel 2 (Impacto severo - Sistema de producción poco confiable) o incluso en algunas ocaciones nivel 1 (Critica para el negocio, perdida completa del servicio, perdida de datos) al cliente, por lo que es fundamental determinar que sucede y proporcionar un diagnóstico.

Proceso.

OMK deberá solicitar evidencia del caso, esto para poder elaborar un diagnóstico inmediato y asi dar seguimiento, por lo que se requiere determinar los síntomas que presenta el servidor, para esto nos apoyaremos de los siguientes puntos:

Vista del Nodo en la GUI:

Se puede observar que él nodo presenta cortes en las gráficas, pero el nodo se muestra alcanzable y no presenta fallas en perdida de paquetes.

Lo siguiente que se validara es la GUI de nmis, ver el comportamiento de las siguientes graficas:

 Image Removed

 

NMIS Runtime Graph (menú: System > Host Diagnostics> NMIS Runtime Graph )

En esta grafica revisamos que los valores de Rutime, collect time, Nmis processes y max parallel processes sean normales (referente al histórico y a los alores configurados, tambien tener en cuenta los recursos del servidor.)

 Image Removed

NMIS Polling Summary (menú: System > Host Diagnostics> NMIS Polling Summary)

...

Dónde:
if = /dev/zero (if=/dev/input.file): El nombre del archivo de entrada desde el que desea leer.
of = /data/omkTestFile (of=/path/to/output.file): El nombre del archivo de salida en el que desea escribir el archivo de entrada.
bs =  10M (bs=block-size): establezca el tamaño del bloque que desea que use dd. Tenga en cuenta que Linux necesitará espacio libre en RAM. Si su sistema de prueba no tiene suficiente RAM disponible, use un parámetro más pequeño para bs (como 128 MB o 64 MB, etc. o incluso puede probar con 1, 2 o hasta 3 gigabyte).
count = 1 (count=number-of-blocks): El número de bloques que desea que dd lea.
oflag = dsync (oflag=dsync): utilice E / S sincronizadas para los datos. No omita esta opción. Esta opción elimina el almacenamiento en caché y le brinda resultados buenos y precisos
conv = fdatasyn: De nuevo, esto le dice a dd que requiera una “sincronización” completa una vez, justo antes de salir. Esta opción es equivalente a oflag=dsync.

...



Ejecución de script /usr/local/nmis8/admin/

...

polling_summary.pl


Code Block
totalNodes=985 totalPoll=962 ontime=850 pingOnly=0 1x_late=111 3x_late=0 12x_late=1 144x_late=0
time=17:53:14 pingDown=14 snmpDown=250 badSnmp=23 noSnmp=0 demoted=0
[root@SRVLXLIM32 ~]#







  1. Code Block
    #General collection.
    /usr/local/nmis8/admin/support.pl action=collect  
    
    #If the file is big, we can add the next parameter.
    /usr/local/nmis8/admin/support.pl action=collect maxzipsize=900000000
    
    #Device collection.
    /usr/local/nmis8/admin/support.pl action=collect node=<node_name> maxzipsize=900000000


...