Skip to content

Conversation

@TProhofsky
Copy link

@TProhofsky TProhofsky commented Jan 27, 2021

Managing disk metrics indexed by a temporary block device name currently causes many issue:

  1. Difficult or impossible to associate metrics of multiple drives before and after a reboot
  2. No methods to retain drive life status as drives are swapped between servers
  3. Difficult to aggregate statistics from nodes or virtual machines simultaneously accessing the same drive

This change solves these issues by using udevadm to look up the drive serial number to replace the block device handle if available. Before the change the output looks like this:
' # HELP node_disk_read_bytes_total The total number of bytes read successfully.'
' # TYPE node_disk_read_bytes_total counter'
node_disk_read_bytes_total{device="sda"} 2.2736896e+07
node_disk_read_bytes_total{device="sdb"} 2.2736896e+07
node_disk_read_bytes_total{device="dm-0"} 1.630419968e+09
node_disk_read_bytes_total{device="sr0"} 0
node_disk_read_bytes_total{device="vda"} 1.682236416e+09
'# HELP node_disk_read_time_seconds_total The total number of seconds spent by all reads.'
'# TYPE node_disk_read_time_seconds_total counter'
node_disk_read_time_seconds_total{device="sda"} 0.499
node_disk_read_time_seconds_total{device="sdb"} 0.518
node_disk_read_time_seconds_total{device="dm-0"} 130.733
node_disk_read_time_seconds_total{device="sr0"} 0
node_disk_read_time_seconds_total{device="vda"} 132.032
`
After the change the output looks like this:
'# HELP node_disk_read_bytes_total The total number of bytes read successfully.'
'# TYPE node_disk_read_bytes_total counter'
node_disk_read_bytes_total{device="HLN03000"} 2.2736896e+07
node_disk_read_bytes_total{device="HLN03002"} 2.2736896e+07
node_disk_read_bytes_total{device="dm-0"} 1.630419968e+09
node_disk_read_bytes_total{device="sr0"} 0
node_disk_read_bytes_total{device="vda"} 1.682236416e+09
'# HELP node_disk_read_time_seconds_total The total number of seconds spent by all reads.'
'# TYPE node_disk_read_time_seconds_total counter'
node_disk_read_time_seconds_total{device="HLN03000"} 0.499
node_disk_read_time_seconds_total{device="HLN03002"} 0.518
node_disk_read_time_seconds_total{device="dm-0"} 130.733
node_disk_read_time_seconds_total{device="sr0"} 0
node_disk_read_time_seconds_total{device="vda"} 132.032

Updating the test fixtures is not included yet waiting for general agreement to proceed with change.

@discordianfish
Copy link
Member

discordianfish commented Jan 29, 2021

Seems like a reasonable request but:

  • It's a breaking change
  • Many people care about the block device name, not the serial number
  • We don't allow to execute commands in the node-exporter, so we'd need to gather the serial number some other way (which doesn't require root)

So instead it would propose:

  • Try getting the serial number as non-root by using procfs or syscalls (if it requires parsing etc, it should go into https://github.com/prometheus/procfs)
  • If that's possible, expose it as a info timeseries (e.g node_disk_info{device="sda", serial_number="HLN03000"}), this will allow you to join other disk metrics with this to filter based on the serial number

@SuperQ
Copy link
Member

SuperQ commented Feb 3, 2021

There are also a bunch of different udev aliases worth considering.

There was a proposal a while back to map a node_disk_info metric that would allow for joins against the disk stats, depending on if people wanted /dev/disk by-label, by-parlabel, by-partuuid, by-uuid, etc.

Original issue: #304

We should probably open a new proposal issue to continue on the ideas from that thread.

@SuperQ
Copy link
Member

SuperQ commented Feb 5, 2021

Sadly, this isn't something we can use due to the issues above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants