Hardware monitoring

We have introduced the usual monitoring data source in section of Data Collection. As a monitoring frame, open-falcon can collect monitoring index data in any system and it just need to organize the monitoring data to the normative format of open-falcon.

The data collection of hardware can be done by HWCheck.

HWCheck

Rvadmin hardware monitoring needs to install falcon-agent, only dell machines supported, and the monitoring index: CPU, memory, array card, magnetic disk, virtual disk, array card battery, BIOS, mainboard battery, fan, voltage, mainboard temperature, CPU temperature.

Install:

1.Deploy dell official repo, install srvadmin and other dependecies. You may also pack rpm to simplify the deployment.

#参考: http://linux.dell.com/repo/hardware/latest/
wget -q -O - http://linux.dell.com/repo/hardware/latest/bootstrap.cgi | bash

yum install srvadmin-omacore srvadmin-omcommon srvadmin-storage-cli smbios-utils-bin lm_sensors dmidecode cronie
# 启动srvadmin服务
/opt/dell/srvadmin/sbin/srvadmin-services.sh enable
/opt/dell/srvadmin/sbin/srvadmin-services.sh restart
# 配置lm-sensors
echo yes | /usr/sbin/sensors-detect

How to use

Parameter specification:

Direct execution hwcheck with no parameters will print out the detailed monitoring data by default.

hwcheck -d      # print metrics information, ie. data pushed to falcon-agent
        -p      # push data to falcon-agent
        -s      # set the value of STEP in push data,referring to monitoring frequency, 600s by default 
        -m      # single metric

Deploy crontab

Deploy cron to detect on a regular basis, for example:

cat /etc/cron.d/hwcheck
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/dell/srvadmin/sbin:/opt/dell/srvadmin/bin
SHELL=/bin/bash

18 * * * * root /usr/bin/hwcheck -s 3600 -p >/dev/null 2>&1 &

referring to detecting per hour, the corresponding STEP value is set 3600.

Configure alarm strategy in falcon-portal

The metric pushed to falcon-agent by hwcheck all begin with hw, such as hw.cpu_temp. Except for the actual temperature value, the value 0 in metric means fault, 1 warning, 2 OK. For example, deploy the following strategy in portal:

metric/tags/note condition max P
hw.bios [C1E/Cstate is not forbidden in BIOS] all(#2)<2 1 4
hw.board_temp [Motherboard temperature is too high] all(#3)>=35 1 4
hw.cmos_bat [Motherboard battery has a problem] all(#3)<2 1 4
hw.cpu [CPU possible faults] all(#2)==1 1 4
hw.cpu [Major: CPU major fault] all(#2)==0 2 0
hw.fan [fan failure] all(#3)<2 1 4
hw.memory [Memory may be failure] all(#1)==1 1 4
hw.memory [Major: major fault memory] all(#1)==0 2 0
hw.pdisk [Major: magnetic disk major fault] all(#1)==0 2 0
hw.raidcard [Array card warnings] all(#2)==1 1 4
hw.raidcard [Major: array card major fault] all(#1)==0 2 0
hw.raidcard_bat [Array card battery warnings] all(#2)==1 1 4
hw.raidcard_bat [Major: array card battery major fault] all(#2)==0 2 0
hw.vdisk [Disk array warnings] all(#2)==1 1 4
hw.vdisk [Major: disk array major fault] all(#2)==0 2 0
Copyright 2015 - 2018 Xiaomi Inc. all right reserved,powered by Gitbook该文件修订时间: 2017-10-31 11:08:03