Re: [RFC PATCH v8 01/10] ras: scrub: Add scrub subsystem

From: Borislav Petkov
Date: Mon May 27 2024 - 05:22:57 EST


On Mon, May 20, 2024 at 12:58:57PM +0100, Jonathan Cameron wrote:
> > Following are some of the use cases of generic scrub control
> > subsystem as given in the cover letter. Request please add any
> > other use cases, which I missed.
> >
> > 1. There are several types of interfaces to HW memory scrubbers
> > identified such as ACPI NVDIMM ARS(Address Range Scrub), CXL
> > memory device patrol scrub, CXL DDR5 ECS, ACPI RAS2 memory
> > scrubbing features and software based memory scrubber(discussed
> > in the community Reference [5] in the cover letter). Also some
> > scrubbers support controlling (background) patrol scrubbing(ACPI
> > RAS2, CXL) and/or on-demand scrubbing(ACPI RAS2, ACPI ARS).
> > However the scrub controls varies between memory scrubbers. Thus
> > there is a need for a standard generic ABI and sysfs scrub
> > controls for the userspace tools, which control HW and SW
> > scrubbers in the system, for the easiness of use.

This is all talking about what hw functionality there is. I'm more
interested in the "there is a need" thing. What need? How?

In order to support something like this upstream, I'd like to know how
it is going to be used and whether the major use cases are covered. So
that everyone can benefit from it - not only your employer.

> > 2. Scrub controls in user space allow the user space tool to disable
> > and enable the feature in case disabling of the background patrol
> > scrubbing and changing the scrub rate are needed for other
> > purposes such as performance-aware operations which requires the
> > background operations to be turned off or reduced.

Who's going to use those scrub controls? Tools? Admins? Scripts?

> > 3. Allows to perform on-demand scrubbing for specific address range
> > if supported by the scrubber.
> > 4. User space tools controls scrub the memory DIMMs regularly at
> > a configurable scrub rate using the sysfs scrub controls
> > discussed help, - to detect uncorrectable memory errors early
> > before user accessing memory, which helps to recover the detected
> > memory errors. - reduces the chance of a correctable error
> > becoming uncorrectable.

Yah, that's not my question: my question is, how is this new thing,
which is exposed to userspace and which then means, this will be
supported forever, how is this thing going to be used?

And the next question is: is that interface sufficient for those use
cases?

Are we covering the majority of the usage scenarios?

> Just to add one more reason a user space interface is needed.
> 5. Policy control for hotplugged memory. There is not necessarily
> a system wide bios or similar in the loop to control the scrub
> settings on a CXL device that wasn't there at boot. What that
> setting should be is a policy decision as we are trading of
> reliability vs performance - hence it should be in control of
> userspace.
> As such, 'an' interface is needed. Seems more sensible to try and
> unify it with other similar interfaces than spin yet another one.

Yes, I get that: question is, let's say you have that interface. Now
what do you do?

Do you go and start a scrub cycle by hand?

Do you have a script which does that based on some system reports?

Do you automate it? I wanna say yes because that's miles better than
having to explain yet another set of knobs to users.

And so on and so on...

I'm trying to get you to imagine the *full* solution and then ask
yourselves whether that new interface is adequate.

Makes more sense?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette