We have daemon script that collects correctable/uncorrectable errors from EDAC sysfs and reports to Amazon service that allow us to take action on specific error thresholds.
Yap, I think we're in agreement here. I believe the important question
is whether you need to get error information from multiple sources
together in order to do proper recovery or doing it per error source
suffices.
And I think the actual use cases could/should dictate our
drivers/orchestrators design.
Thus my question how you guys are planning on tying all that error info
the drivers report, into the whole system design?