Hardware monitoring

1.1. Hardware monitoring

We have introduced the usual monitoring data source in section of Data Collection. As a monitoring frame, open-falcon can collect monitoring index data in any system and it just need to organize the monitoring data to the normative format of open-falcon.

The data collection of hardware can be done by HWCheck.

1.2. HWCheck

Rvadmin hardware monitoring needs to install falcon-agent, only dell machines supported, and the monitoring index: CPU, memory, array card, magnetic disk, virtual disk, array card battery, BIOS, mainboard battery, fan, voltage, mainboard temperature, CPU temperature.

1.3. Install:

1.Deploy dell official repo, install srvadmin and other dependecies. You may also pack rpm to simplify the deployment.

#参考: http://linux.dell.com/repo/hardware/latest/
wget -q -O - http://linux.dell.com/repo/hardware/latest/bootstrap.cgi | bash

yum install srvadmin-omacore srvadmin-omcommon srvadmin-storage-cli smbios-utils-bin lm_sensors dmidecode cronie
# 启动srvadmin服务
/opt/dell/srvadmin/sbin/srvadmin-services.sh enable
/opt/dell/srvadmin/sbin/srvadmin-services.sh restart
# 配置lm-sensors
echo yes | /usr/sbin/sensors-detect

1.4. How to use

Parameter specification:

Direct execution hwcheck with no parameters will print out the detailed monitoring data by default.

hwcheck -d      # print metrics information, ie. data pushed to falcon-agent
        -p      # push data to falcon-agent
        -s      # set the value of STEP in push data，referring to monitoring frequency, 600s by default 
        -m      # single metric

1.5. Deploy crontab

Deploy cron to detect on a regular basis, for example:

cat /etc/cron.d/hwcheck
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/dell/srvadmin/sbin:/opt/dell/srvadmin/bin
SHELL=/bin/bash

18 * * * * root /usr/bin/hwcheck -s 3600 -p >/dev/null 2>&1 &

referring to detecting per hour, the corresponding STEP value is set 3600.

1.6. Configure alarm strategy in falcon-portal

The metric pushed to falcon-agent by hwcheck all begin with hw, such as hw.cpu_temp. Except for the actual temperature value, the value 0 in metric means fault, 1 warning, 2 OK. For example, deploy the following strategy in portal:

metric/tags/note	condition	max	P
hw.bios [C1E/Cstate is not forbidden in BIOS]	all(#2)<2	1	4
hw.board_temp [Motherboard temperature is too high]	all(#3)>=35	1	4
hw.cmos_bat [Motherboard battery has a problem]	all(#3)<2	1	4
hw.cpu [CPU possible faults]	all(#2)==1	1	4
hw.cpu [Major: CPU major fault]	all(#2)==0	2	0
hw.fan [fan failure]	all(#3)<2	1	4
hw.memory [Memory may be failure]	all(#1)==1	1	4
hw.memory [Major: major fault memory]	all(#1)==0	2	0
hw.pdisk [Major: magnetic disk major fault]	all(#1)==0	2	0
hw.raidcard [Array card warnings]	all(#2)==1	1	4
hw.raidcard [Major: array card major fault]	all(#1)==0	2	0
hw.raidcard_bat [Array card battery warnings]	all(#2)==1	1	4
hw.raidcard_bat [Major: array card battery major fault]	all(#2)==0	2	0
hw.vdisk [Disk array warnings]	all(#2)==1	1	4
hw.vdisk [Major: disk array major fault]	all(#2)==0	2	0