Re: [RFC 07/10] platform/x86/intel/ifs: Create kthreads for online cpus for scan test

From: Kok, Auke
Date: Thu Mar 10 2022 - 16:42:11 EST



On 3/7/22 09:46, Luck, Tony wrote:
These are software(driver) defined error codes. Rest of the error codes are supplied by
the hardware. Software defined error codes were kept at the other end to provide ample space
in case (future) hardware decides to provide extend error codes.
Why put them in the same number space? Separate software results from
the raw hardware results and have a separate mechanism to convey each.
We wanted to include in the "details" file, which is otherwise a direct copy of
the SCAN_STATUS MSR. Making sure the software error codes didn't overlap
with any h/w generated codes seemed like a good idea.

But maybe we should have done this with additional string values in the status
file:

Current:

pass
untested
fail

Add a couple of new options for the s/w cases:

sw_timeout
sw_retries_exceeded


We've made a userspace implementation for this API already as part of opendcdiag that uses it:

https://github.com/opendcdiag/opendcdiag/commit/0cbfcee30e0666b0f79a2e452d7f8167d2a0cb90

What I really like is that with this proposed API, we can unambiguously determine whether "the core failed" or "everything is fine, for now" by reading a single file. I hate to see this file become unusable because its content changes from "pass" to "sw_timeout" or, even worse, it changes from "fail" to "sw_timeout". That would render it useless for the purpose that I think our users will be looking at it.

So, my preference would be to keep this file functioning as-is in this patch series.

I would think that some sort of expandable "statistics" file would be a better way to output various metrics:

```

sw_timeout: 0

sw_retries_exceeded: 2

runs: 42

first_run: 1405529347

last_run: 1646948140

<etc..>

```

just as a suggested alternative for more/incompatble output values or a complex, dynamic format.

I don't have any use in opendcdiag for these values and data. If someone does, they should want to chime in perhaps.


Auke