Zabbix template & script to monitor ECC errors on Linux.
It uses kernel's EDAC infrastructure to read the info directly from SysFS.
Single Python script that emits all information needed for discovery & data gathering in a single JSON.
All items are defined as Dependent and extract relevant data using JSONPath queries.
Click to expand JSON example
{
"mc0_dimm0": {
"dimm_ce_count": 0,
"dimm_label": "CPU_SrcID#0_MC#0_Chan#0_DIMM#0",
"dimm_location": "channel 0 slot 0",
"dimm_mem_type": "Unbuffered-DDR4",
"dimm_ue_count": 0,
"name": "mc0_dimm0",
"size": 16384
},
"mc0_dimm2": {
"dimm_ce_count": 0,
"dimm_label": "CPU_SrcID#0_MC#0_Chan#1_DIMM#0",
"dimm_location": "channel 1 slot 0",
"dimm_mem_type": "Unbuffered-DDR4",
"dimm_ue_count": 0,
"name": "mc0_dimm2",
"size": 16384
},
"mc1_dimm0": {
"dimm_ce_count": 0,
"dimm_label": "CPU_SrcID#0_MC#1_Chan#0_DIMM#0",
"dimm_location": "channel 0 slot 0",
"dimm_mem_type": "Unbuffered-DDR4",
"dimm_ue_count": 0,
"name": "mc1_dimm0",
"size": 16384
},
"mc1_dimm2": {
"dimm_ce_count": 0,
"dimm_label": "CPU_SrcID#0_MC#1_Chan#1_DIMM#0",
"dimm_location": "channel 1 slot 0",
"dimm_mem_type": "Unbuffered-DDR4",
"dimm_ue_count": 0,
"name": "mc1_dimm2",
"size": 16384
}
}- Low level discovery of:
- DIMMs/Ranks
- Items:
- Correctable Errors
- Uncorrectable Errors
- Location
- Size
- Triggers:
- Correctable Errors
- Uncorrectable Errors
- Zabbix agent passive checks. Can be converted to active if needed.
{$EDAC_CE_THRESH}- Number of correctable errors to trigger on (>0 by default){$EDAC_UE_THRESH}- Number of uncorrectable errors to trigger on (>0 by default)
- Tested on Zabbix 5.2, but should work on 4.2+
- Python3
- Place
edac.confin/etc/zabbix/zabbix_agentd.d - Place
edac.pyin/etc/zabbix/scriptsYou can put it into any other place, but then you'll have to adjustedac.conf - Restart
zabbix-agentd - Import
template_edac.xml - You're good to go