You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Next »

This page is intended to provide a NMIS Device Troubleshooting Process to Identify bad behaviors in collection for NMIS8/9 products, you can break it down into clear steps that anyone can follow and identify what's wrong with the device collection also if we have Gaps In Graphs.


Network-Management-Information-System

Device Troubleshooting Process


Diagramas de Flujo



  1. Identify the problem. The first step in troubleshooting a device issue is to identify the problem, you have to consider if the issue is in NMIS8 or NMIS9 products.
    1. Add to the support the case the product version and the servers/devices/models involved.
  2. What kind of problem are you observing. A device issue can be affected for the next reasons.
    1. Network performance, latency in the network, layer 1,2, and 3 issues.
    2. Device configuration, connectivity, SNMP configuration, and others. 
    3. Server hardware requirements, high resource utilization parameters in the server.
    4. Server configuration options, missing configuration items for server tunning.
    5. Disk performance, slow write/read times for the device collection. 
  3. Gather information, collect all the graphs, images, behaviors that can explain what the problem is.
    1. Collect support tool files The Opmantek Support Tool
      1. Execute the collect command for the support tool

        #General collection.
        /usr/local/nmis8/admin/support.pl action=collect  
        
        #If the file is big, we can add the next parameter.
        /usr/local/nmis8/admin/support.pl action=collect maxzipsize=900000000
        
        #Device collection.
        /usr/local/nmis8/admin/support.pl action=collect node=<node_name> maxzipsize=900000000
    2. If you are using NMIS8, provide the /usr/local/nmis8/var files
      1. go to /usr/local/nmis8/var directory and collect the next files

        -rw-rw----   1 nmis   nmis    4292 Apr  5 18:26 <node_name>-node.json
        -rw-rw----   1 nmis   nmis    2695 Apr  5 18:26 <node_name>-view.json


      2. obtain update/collect outputs this information will upload to the support case:

        /usr/local/nmis8/bin/nmis.pl type=update node=<node_name> model=true debug=9 force=true > /tmp/node_name_update_$(hostname).log
        /usr/local/nmis8/bin/nmis.pl type=collect node=<node_name> model=true debug=9 force=true > /tmp/node_name_collect_$(hostname).log
  4. Replicate the problem. If possible you have to define, what the steps are to replicate the problem.
  5. Identify symptoms. To this point, you are able to see a specific problem and what the symptoms are.
  6. Determinate if something has changed, is important to verify with your team if something has changed, a good way to see this behavior is monitoring the performance graph for devices and server


  1. It is an individual problem?, verify if this behavior is happening in a single device/server.

Network performance - Server.

Introduction.

This section is focused on performing the review and validation of the server status in general, we will focus on verifying the historical behavior of the main metrics for the server, it is important to review all the metrics related to the good performance between the server and devices

Verifying Health Metrics

  • Metrics are important for the server,  NMIS would use Reachability, Availability and Health to represent the network. 
  • Reachability being the pingability of device,

  • Availability being (in the context of network gear) the interfaces which should be up, being up or not, e.g. interfaces which are "no shutdown" (ifAdminStatus = up) should be up, so a device with 10 interfaces of ifAdminStatus = up and ifOperStatus = up for 9 interfaces, the device would be 90% available.


  • Health is a composite metric, made up of many things depending on the device, router, CPU, memory. Something interesting here is that part of the health is made up of an inverse of interface utilisation, so an interface which has no utilisation will have a high health component, an interface which is highly utilised will reduce that metric. So the health is a reflection of load on the device, and will be very dynamic.


  • The overall metric of a device is a composite metric made up of weighted values of the other metrics being collected. The formula for this is based is configurable, so you can have weight Reachability to be higher than it currently is, or lower, your choice.


For more references go to NMIS Metrics, Reachability, Availability and Health

  • It is important to validate the localhost heath, including the overall reachability, availability, and Health you will be able to see data not following the historical data pattern that can give us a clue where the problem can be happening or even if the abnormal behavior has started before a change request In the early hours.

  • Viewing the graphs referring to the network performance as (Response Time in milliseconds, IP Utilization, TCP Connection, TCP Segments) will help us to identify the behavior of the server/network in a period of 2 days, we can modify this period time to see more data if needed.

Device configuration.

It is important to validate if the problem occurs in the network or is something related to the device configuration, in order to identify what's happening we need to validate the next commands from the console server.

  1. Ping test, The Ping tool is used to test whether a particular host is reachable across an IP network. A Ping measures the time it takes for packets to be sent from the local host to a destination computer and back. 

    ping x.x.x.x #add the ip address you need to reach
  2. Traceroute, is a network diagnostic tool used to track in real-time the pathway taken by a packet on an IP network from source to destination, reporting the IP addresses of all the routers it pinged in between

    traceroute <ip_Node>  #add the ip address you need to reach
  3. MTR, Mtr(my traceroute) is a command-line network diagnostic tool that provides the functionality of both the ping and traceroute commands


    sudo mtr -r 8.8.8.8
    
        [sample results below]
    
        HOST: endor                       Loss%   Snt   Last   Avg  Best  Wrst StDev
         1. 69.28.84.2                    0.0%    10    0.4   0.4   0.3   0.6   0.1
         2. 38.104.37.141                 0.0%    10    1.2   1.4   1.0   3.2   0.7
         3. te0-3-1-1.rcr21.dfw02.atlas.  0.0%    10    0.8   0.9   0.8   1.0   0.1
         4. be2285.ccr21.dfw01.atlas.cog  0.0%    10    1.1   1.1   0.9   1.4   0.1
         5. be2432.ccr21.mci01.atlas.cog  0.0%    10   10.8  11.1  10.8  11.5   0.2
         6. be2156.ccr41.ord01.atlas.cog  0.0%    10   22.9  23.1  22.9  23.3   0.1
         7. be2765.ccr41.ord03.atlas.cog  0.0%    10   22.8  22.9  22.8  23.1   0.1
         8. 38.88.204.78                  0.0%    10   22.9  23.0  22.8  23.9   0.4
         9. 209.85.143.186                0.0%    10   22.7  23.7  22.7  31.7   2.8
        10. 72.14.238.89                  0.0%    10   23.0  23.9  22.9  32.0   2.9
        11. 216.239.47.103                0.0%    10   50.4  61.9  50.4  92.0  11.9
        12. 216.239.46.191                0.0%    10   32.7  32.7  32.7  32.8   0.1
        13. ???                          100.0    10    0.0   0.0   0.0   0.0   0.0
        14. google-public-dns-a.google.c  0.0%    10   32.7  32.7  32.7  32.8   0.0
  4. snmpwalk, is a Simple Network Management Protocol (SNMP) application present on the Security Management System (SMS) CLI that uses SNMP GETNEXT requests to query a network device for information. An object identifier (OID) may be given on the command line.
    The following example CLI command will return the IPS temperature information:
    
    Command:snmpwalk -v 2c -c tinapc <IP address> 1.3.6.1.4.1.10734.3.5.2.5.5
    
    Command Explanation:
    
    In this case the CLI command breaks down as following;
    
    snmpwalk                             = SNMP application
    -v 2c                                     = specifies what SNMP version to use (1, 2c, 3)
    -c tinapc                               = specifies the community string. Note: The IPS has the SNMP read-only community string of "tinapc"
    <IP address>                       = specifies the IP address of the IPS device
    1.3.6.1.4.1.10734.3.5.2.5.5 = OID parameter for the IPS temperature information
    
    Results:
    
    SNMPv2-SMI::enterprises.10734.3.5.2.5.5.1.0 = INTEGER: 27
    SNMPv2-SMI::enterprises.10734.3.5.2.5.5.2.0 = INTEGER: 50
    SNMPv2-SMI::enterprises.10734.3.5.2.5.5.3.0 = INTEGER: 55
    SNMPv2-SMI::enterprises.10734.3.5.2.5.5.4.0 = INTEGER: 0
    SNMPv2-SMI::enterprises.10734.3.5.2.5.5.5.0 = INTEGER: 85
    
    Results Explanation:
    
    SNMPv2-SMI::enterprises.10734.3.5.2.5.5.1.0 = INTEGER: 27 = The chassis temperature (27° Celsius / 80.6° Fahrenheit)
    SNMPv2-SMI::enterprises.10734.3.5.2.5.5.2.0 = INTEGER: 50 = The major threshold value for chassis temperature (50° Celsius / 122° Fahrenheit)
    SNMPv2-SMI::enterprises.10734.3.5.2.5.5.3.0 = INTEGER: 55 = The critical threshold value of chassis temperature (55° Celsius / 131° Fahrenheit)
    SNMPv2-SMI::enterprises.10734.3.5.2.5.5.4.0 = INTEGER: 0   = The minimum value of the chassis temperature range ( 0° Celsius / 32° Fahrenheit)
    SNMPv2-SMI::enterprises.10734.3.5.2.5.5.5.0 = INTEGER: 85 = The maximum value of the chassis temperature range (85° Celsius / 185° Fahrenheit)
    It is important to see that the device is pingable, does not have latency, packet loss, and the SNMP data is been collected.

Service performance

NMIS is using some important services to make the solution work, sometimes devices stop working due to some of these services are interrupted, It is always a good idea to validate if those are running, to validate this you need to execute the next commands. This in order to provide even more security, as some of these services are crucial for the operation of the operating system. On the other hand, in systems like Unix or Linux, the services are also known as daemons. In this case, it is essential to validate the services that make up the OPMANTEK monitoring system (nmis).

service mongod status
service omkd status
service nmisd status
service httpd status
service opchartsd restart
service opeventsd status
service opconfigd status
service opflowd status
service crond status

#if someone of this daemons is stopped, you need to execute same commands with start/restart options.

Server hardware requirements.

This section is crucial to identify or resolve device issues, you need to review some considerations depending on the number of nodes you will manage, the number of users that will be accessing the GUI's, how often does your data need to be updated? If updates are required every 5 minutes, then you will need to have the hardware to be able to accomplish these requirements, also the OS Requirements need to be well defined a good rule of thumb is to reserve 1 GB of RAM for the OS by default, High-speed drives for the data (SAN is ideal) with separate storage for mongo database, and temp files. Anywhere between 4-8 cores with a high-performing processor(s), 16-64 GB RAM should be performing well for 1k+ Nodes.

Using top/htop command

The top command shows all running processes in the server. It shows you the system information and the processes information just like up-time, average load, tasks running, no. of users logged in, no. of CPU processes, RAM utilization and it lists all the processes running/utilized by the users in your server.

top


top - 12:50:01 up 62 days, 22:56,  5 users,  load average: 4.76, 8.03, 4.34
Tasks: 412 total,   1 running, 411 sleeping,   0 stopped,   15 zombie
Cpu(s):  6.8%us,  3.8%sy,  0.2%ni, 74.4%id, 28.2%wa,  0.1%hi,  0.5%si,  0.0%st
Mem:  20599548k total, 18622368k used,  1977180k free,   375212k buffers
Swap:  6669720k total,  3536428k used,  3133292k free, 10767256k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                     
26306 root      20   0  478m 257m 1900 S  3.9  1.3   0:08.21 nmis.pl                                                                     
15522 root      20   0  626m 373m 2776 S  2.0  1.9  71:45.09 opeventsd.pl                                                                
27285 root      20   0 15280 1444  884 R  2.0  0.0   0:00.01 top                                                                         
    1 root      20   0 19356  308  136 S  0.0  0.0   1:07.65 init                                                                        
    2 root      20   0     0    0    0 S  0.0  0.0   0:02.14 kthreadd                                                                    
    3 root      RT   0     0    0    0 S  0.0  0.0  17359:19 migration/0                                                                 
    4 root      20   0     0    0    0 S  0.0  0.0 252:25.86 ksoftirqd/0                                                                 
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 stopper/0                                                                   
    6 root      RT   0     0    0    0 S  0.0  0.0   2233:33 watchdog/0                                                                  
    7 root      RT   0     0    0    0 S  0.0  0.0 340:35.60 migration/1                                                                 
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 stopper/1                                                                   
    9 root      20   0     0    0    0 S  0.0  0.0   5:23.87 ksoftirqd/1                                                                 
   10 root      RT   0     0    0    0 S  0.0  0.0 214:57.35 watchdog/1    

1.First line

​The very first line of the top command indicates in the order below.

top - 12:50:01 up 62 days, 22:56,  5 users,  load average: 4.76, 8.03, 4.34
  • current time (12:50:01)
  • uptime of the machine (up 62 days, 22:56)
  • users sessions logged in (5 users)
  • average load on the system (load average: 4.76, 8.03, 4.34) the 3 values refer to the last minute, five minutes and 15 minutes

2. Second Row : task

The ​second row provides you the following information.

Tasks: 412 total,   1 running, 411 sleeping,   0 stopped,   15 zombie
  • Total Processes running (412 total)
  • Running Processes (1 running)
  • Sleeping Processes (411 sleeping)
  • Stopped Processes (0 stopped)
  • Processes waiting to be stopped from the parent process (15 zombies)  ####### This is not good for the manager
    Zombie Process: A process that has completed execution, but still has an entry in the process table. This entry still needs to allow the parent process to read its child exit status.

3. CPU section.

Cpu(s):  6.8%us,  3.8%sy,  0.2%ni, 74.4%id, 28.2%wa,  0.1%hi,  0.5%si,  0.0%st
  • User processes of CPU in percentage(6.8%us)

  • System processes of CPU in percentage(3.8%sy)

  • Priority upgrade nice of CPU in percentage(0.2%ni)

  • Percentage of the CPU not used (74.4%id)

  • Processes waiting for I/O operations of CPU in percentage(28.2%wa) ####### This is not good for the server performance.

  • Serving hardware interrupts of CPU in percentage(0.1% hi — Hardware IRQ

  • Percentage of the CPU serving software interrupts (0.0% si — Software Interrupts

The amount of CPU ‘stolen’ from this virtual machine by the hypervisor for other tasks (such as running another virtual machine) will be 0 on desktop and server without Virtual machine. (0.0%st — Steal Time)

4. Memory

These rows will provide you the information about RAM usage. It shows you total memory in use, free, buffers cached.


Mem:  20599548k total, 18622368k used,  1977180k free,   375212k buffers
Swap:  6669720k total,  3536428k used,  3133292k free, 10767256k cached  

5. Process List

There is the last row to discuss CPU usage which was running currently

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                     
26306 root      20   0  478m 257m 1900 S  3.9  1.3   0:08.21 nmis.pl                                                                     
15522 root      20   0  626m 373m 2776 S  2.0  1.9  71:45.09 opeventsd.pl                                                                
27285 root      20   0 15280 1444  884 R  2.0  0.0   0:00.01 top                                                                         
  • PID – ID of the process(26306)
  • USER – The user that is the owner of the process (root)
  • PR – priority of the process (20)
  • NI – The “NICE” value of the process (0)
  • VIRT – virtual memory used by the process (478m)
  • RES – physical memory used from the process (3.3g)
  • SHR – shared memory of the process (1900)
  • S – indicates the status of the process: S=sleep R=running Z=zombie (S)
  • %CPU – This is the percentage of CPU used by this process (3.9)
  • %MEM – This is the percentage of RAM used by the process (1.3)
  • TIME+ –This is the total time of activity of this process (0:08.21)
  • COMMAND – And this is the name of the process (exim)

It is important to monitor this commando to see if the server is working properly executing all the internal processes need.

Server configuration options.

In order to tell the server, how to manage the devices configured we need to validate that all the configuration items are well set, you can see the server performances while collecting information going to the section, system>Host Diagnostics> NMIS Runtime Graph

if the total runtime/collect time is too high, we need to adjust the collect parameters depending on the manager version you are using.

NMIS8

NMIS 8 - Configuration Options for Server Performance Tuning

NIMS9

NMIS 9 - Configuration Options for Server Performance Tuning

Disk performance review.

This section is dedicated to identifying when the server is not writing all the data for the devices, this can have as a result graph with interruptions, so this causes level 2 problems (Severe impact - Unreliable production system) or even in some occasions level 1 (Critical for the business, complete loss of service, loss of data) to the client, so it is essential to determine what is happening and provide a diagnosis.

Server status at Service level.
The monitoring service is affected slowly when accessing the GUI, and its main impact is centered on the failure to execute collect and updates to the nodes, the CPUs are saturated and the monitoring system executes the collection of information every minute or 5 minutes, the system being overloaded is forced to kill the processes affecting the storage of the information of the nodes in the RRD's files

Node View in NMIS:

You will be able to visualize device graphs with gaps, this is an example of how to recognize this behavior.

 

 

NMIS Polling Summary (menú: System > Host Diagnostics> NMIS Polling Summary)

El resumen de poleo que proporciona Nmis es muy útil, ya que en el podremos ver los detalles de tiempo de colección de los nodos, nodos activos, nodos colectados, etc,. Estos valores deben de estar acorde a los numero de nodos monitoreados, asi mismo el tiempo de colección debe estar entre el rango de minutos configurados en el cron de nmis.

Network Metrics and Health (menú: Network Status > Network Metrics and Health.)

Aquí podremos validar la disponibilidad del servidor, estado de salud y alcance, estos valores debe de estar por encima de 60 para considerar que se esta trabajando bien, aquí se ven cortes por lo que indica que no se colecto información en ese lapso de tiempo.

Configuracion de archivo Cron (nmis) y Config.nmis

Aquí se procederá a verificar la configuracion de colección de datos hacia los dispositivos, por lo que validamos los parámetros de Collect, maxthreads y mthread.

En el archivo Cron de nmis vemos lo siguiente:


Crond NMIS
######################################################
# NMIS8 Config
######################################################

# Run Full Statistics Collection
*/5 * * * *     root     /usr/local/nmis8/bin/nmis.pl type=collect maxthreads=100 mthread=true
*/5 * * * *     root     /usr/local/nmis8/bin/nmis.pl type=services mthread=true
# ######################################################

# Optionally run a more frequent Services-only Collection
# */3 * * * *   root     /usr/local/nmis8/bin/nmis.pl type=services mthread=true

######################################################

# Run Summary Update every 2 minutes
*/2 * * * *     root     /usr/local/nmis8/bin/nmis.pl type=summary



Procedemos a verificar que el valor del mthread este activado y que el maxthreads tenga en mismo valor en el archivo Config.nmis


Sección Config.nmis
    'nmis_group' => 'nmis',
    'nmis_host' => 'nmissTest_OMK.omk.com',
    'nmis_host_protocol' => 'http',
    'nmis_maxthreads' => '100',
    'nmis_mthread' => 'false',
    'nmis_summary_poll_cycle' => 'false',
    'nmis_user' => 'nmis',



Podemos ver que el valor mthread esta desactivado y que el valor maxthreads si corresponde al mismo declarado en el cron de nmis, por lo que se procede a activarlo y a realizar un update y collect al nodo.

Update_Collect
/usr/local/nmis8/bin/nmis.pl type=update node=<Name_Node> force=true

/usr/local/nmis8/bin/nmis.pl type=collect node=<Name_Node> force=true


Nota: Si estos valores declarados en el cron y en el archivo Conf.nmis no funcionan se recomienda realizar lo siguiente:

Example Crond
# Ejemplo 1:
/usr/local/nmis8/bin/nmis.pl type=collect abort_after=300 mthread=true maxthreads=100 ignore_running=true

# Ejemplo 2
/usr/local/nmis8/bin/nmis.pl type=collect abort_after=240 mthread=true maxthreads=100 ignore_running=true

El valor del parámetro maxthreads (se recomienda probar entre 50, 80 y 100) debe ser el mismo en ambos archivos (cron nmis y conf.nmis)

Aplicar los comandos de Update y Collect al termino de cada prueba y verificar el comportamiento en la GUI de NMIS, esto consiste en revisar las gráficas de NMIS Runtime Graph, Network_summary y Polling_summary.


Check de servidor

Se requiere realizar una revisión general del hardware del servidor, para obtener lo siguiente:

Sistema Operativo.

Verificación del sistema operativo en donde se encuentra trabajando el sistema de monitoreo.

Comando:

cat /etc/*release* && uname -osmr && uname -v

Resultado:

[root@opmantek ~]# cat /etc/*release* && uname -osmr && uname -v
CentOS release 6.10 (Final)
CentOS release 6.10 (Final)
CentOS release 6.10 (Final)
cpe:/o:centos:linux:6:GA
Linux 2.6.32-754.28.1.el6.x86_64 x86_64 GNU/Linux
#1 SMP Wed Mar 11 18:38:45 UTC 2020
[root@opmantek ~]#


Detalles de almacenamiento:

Verificación del estado de almacenamiento, memoria libre y detalles de las particiones del S.O.

Comando:

echo -e "\n \e[31m Información de espacio en el disco \e[0m" && df -h && echo -e "\n\n \e[31m Información de uso de RAM \e[0m" && free -m && echo -e "\n\n \e[31m Detalle de discos \e[0m" && fdisk -l

Resultado:

[root@opmantek ~]# echo -e "\n \e[31m Información de espacio en el disco \e[0m" && df -h && echo -e "\n\n \e[31m Información de uso de RAM \e[0m" && free -m && echo -e "\n\n \e[31m Detalle de discos \e[0m" && fdisk -l

  Información de espacio en el disco

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_nmis64-lv_root
                       59G  2.7G   54G   5% /
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/sda1             477M  109M  343M  25% /boot
/dev/mapper/vg_nmis64_data-lv_data
                      321G   11G  295G   4% /data
/dev/mapper/vg_nmis64-lv_var
                      147G  1.5G  138G   2% /var

  Información de uso de RAM
             total       used       free     shared    buffers     cached
Mem:          7984       6891       1093          0        216       1077
-/+ buffers/cache:       5596       2387
Swap:         4071       1589       2482

  Detalle de discos
Disk /dev/sda: 536.9 GB, 536870912000 bytes
255 heads, 63 sectors/track, 65270 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0008cec3

Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          64      512000   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              64        5222    41430016   8e  Linux LVM
/dev/sda3            5222       42570   299997810   8e  Linux LVM
/dev/sda4           42570       65256   182225295   8e  Linux LVM

Disk /dev/sdb: 42.9 GB, 42949672960 bytes
255 heads, 63 sectors/track, 5221 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/vg_nmis64-lv_root: 64.4 GB, 64432898048 bytes
255 heads, 63 sectors/track, 7833 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/vg_nmis64-lv_swap: 4269 MB, 4269801472 bytes
255 heads, 63 sectors/track, 519 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/vg_nmis64_data-lv_data: 350.1 GB, 350140497920 bytes
255 heads, 63 sectors/track, 42568 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/vg_nmis64-lv_var: 160.3 GB, 160314687488 bytes
255 heads, 63 sectors/track, 19490 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

[root@opmantek ~]#

Análisis de TOP.

Se ejecuta el comando top y se visualiza que el servidor presenta lentitud, alto uso de CPU, valores elevados en el load average y iowait, este valor muestra cuánto tiempo pierde su CPU mientras espera que se completen las operaciones de E / S.


 


Análisis de operaciones lentas de lectura / escritura en disco, red, IPC

El comando dd es muy sensible respecto a los parámetros que maneja, ya que puede ocasionar serios problemas en su servidor, OMK emplea este comando para obtener y medir el rendimiento del servidor y la latencia, por lo que con esto determinamos que la velocidad de escritura y lectura del disco.


[root@SRVLXLIM32 ~]# dd if=/dev/zero of=/data/omkTestFile bs=10M count=1 oflag=direct
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0.980106 s, 15.0 MB/s
[root@SRVLXLIM32 ~]# dd if=/data/omkTestFile of=/dev/null 2>&1
20480+0 records in
20480+0 records out
10485760 bytes (10 MB) copied, 6.23595 s, 1.7 MB/s
[root@SRVLXLIM32 ~]#


Tenga en cuenta que se escribió 10 Megabytes para la prueba y 47 MB / s fue el rendimiento y el tiempo que le tomo escribir el bloque fue de 0.223301 segundos del servidor para esta prueba.

Dónde:
if = /dev/zero (if=/dev/input.file): El nombre del archivo de entrada desde el que desea leer.
of = /data/omkTestFile (of=/path/to/output.file): El nombre del archivo de salida en el que desea escribir el archivo de entrada.
bs =  10M (bs=block-size): establezca el tamaño del bloque que desea que use dd. Tenga en cuenta que Linux necesitará espacio libre en RAM. Si su sistema de prueba no tiene suficiente RAM disponible, use un parámetro más pequeño para bs (como 128 MB o 64 MB, etc. o incluso puede probar con 1, 2 o hasta 3 gigabyte).
count = 1 (count=number-of-blocks): El número de bloques que desea que dd lea.
oflag = dsync (oflag=dsync): utilice E / S sincronizadas para los datos. No omita esta opción. Esta opción elimina el almacenamiento en caché y le brinda resultados buenos y precisos
conv = fdatasyn: De nuevo, esto le dice a dd que requiera una “sincronización” completa una vez, justo antes de salir. Esta opción es equivalente a oflag=dsync.



Ejecución de script /usr/local/nmis8/admin/polling_summary.pl


totalNodes=985 totalPoll=962 ontime=850 pingOnly=0 1x_late=111 3x_late=0 12x_late=1 144x_late=0
time=17:53:14 pingDown=14 snmpDown=250 badSnmp=23 noSnmp=0 demoted=0
[root@SRVLXLIM32 ~]#





  1. #General collection.
    /usr/local/nmis8/admin/support.pl action=collect  
    
    #If the file is big, we can add the next parameter.
    /usr/local/nmis8/admin/support.pl action=collect maxzipsize=900000000
    
    #Device collection.
    /usr/local/nmis8/admin/support.pl action=collect node=<node_name> maxzipsize=900000000





  • No labels