Re: [RFC PATCH v8 01/10] ras: scrub: Add scrub subsystem

From: Jonathan Cameron
Date: Mon May 20 2024 - 07:59:20 EST


On Mon, 20 May 2024 11:54:50 +0100
Shiju Jose <shiju.jose@xxxxxxxxxx> wrote:

> >-----Original Message-----
> >From: Borislav Petkov <bp@xxxxxxxxx>
> >Sent: 11 May 2024 11:17
> >To: Dan Williams <dan.j.williams@xxxxxxxxx>
> >Cc: Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; Shiju Jose
> ><shiju.jose@xxxxxxxxxx>; linux-cxl@xxxxxxxxxxxxxxx; linux-
> >acpi@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; dave@xxxxxxxxxxxx;
> >dave.jiang@xxxxxxxxx; alison.schofield@xxxxxxxxx; vishal.l.verma@xxxxxxxxx;
> >ira.weiny@xxxxxxxxx; linux-edac@xxxxxxxxxxxxxxx; linux-
> >kernel@xxxxxxxxxxxxxxx; david@xxxxxxxxxx; Vilas.Sridharan@xxxxxxx;
> >leo.duran@xxxxxxx; Yazen.Ghannam@xxxxxxx; rientjes@xxxxxxxxxx;
> >jiaqiyan@xxxxxxxxxx; tony.luck@xxxxxxxxx; Jon.Grimm@xxxxxxx;
> >dave.hansen@xxxxxxxxxxxxxxx; rafael@xxxxxxxxxx; lenb@xxxxxxxxxx;
> >naoya.horiguchi@xxxxxxx; james.morse@xxxxxxx; jthoughton@xxxxxxxxxx;
> >somasundaram.a@xxxxxxx; erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx;
> >duenwen@xxxxxxxxxx; mike.malvestuto@xxxxxxxxx; gthelen@xxxxxxxxxx;
> >wschwartz@xxxxxxxxxxxxxxxxxxx; dferguson@xxxxxxxxxxxxxxxxxxx;
> >wbs@xxxxxxxxxxxxxxxxxxxxxx; nifan.cxl@xxxxxxxxx; tanxiaofei
> ><tanxiaofei@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>;
> >kangkang.shen@xxxxxxxxxxxxx; wanghuiqiang <wanghuiqiang@xxxxxxxxxx>;
> >Linuxarm <linuxarm@xxxxxxxxxx>; Greg Kroah-Hartman
> ><gregkh@xxxxxxxxxxxxxxxxxxx>; Jean Delvare <jdelvare@xxxxxxxx>; Guenter
> >Roeck <linux@xxxxxxxxxxxx>; Dmitry Torokhov <dmitry.torokhov@xxxxxxxxx>
> >Subject: Re: [RFC PATCH v8 01/10] ras: scrub: Add scrub subsystem
> >
> >On Fri, May 10, 2024 at 10:13:41AM -0700, Dan Williams wrote:
> >> In fact this question matches my reaction to the last posting [1], and
> >> led to a much improved cover letter and the "Comparison of scrubbing
> >> features". To your point there are scrub capabilities already in the
> >> kernel and we would need to make a decision about what to do about them.
> >
> >The answer to that question is whether this new userspace usage is going to
> >want to control those too.
> >
> >So
> >
> >"Use case of scrub control feature"
> >
> >from the cover letter is giving two short sentences about what one would do but
> >I'm still meh. A whole subsystem needing a bunch of effort would need a lot
> >more justification.
> >
> >So can anyone please elaborate more on the use cases and why this new thing is
> >needed?
>
> Following are some of the use cases of generic scrub control subsystem as given in the cover letter.
> Request please add any other use cases, which I missed.
>
> 1. There are several types of interfaces to HW memory scrubbers identified such as ACPI NVDIMM ARS(Address Range Scrub), CXL memory device patrol scrub, CXL DDR5 ECS, ACPI RAS2 memory scrubbing features and software based memory scrubber(discussed in the community Reference [5] in the cover letter). Also some scrubbers support controlling (background) patrol scrubbing(ACPI RAS2, CXL) and/or on-demand scrubbing(ACPI RAS2, ACPI ARS). However the scrub controls varies between memory scrubbers. Thus there is a need for a standard generic ABI and sysfs scrub controls for the userspace tools, which control HW and SW scrubbers in the system, for the easiness of use.
> 2. Scrub controls in user space allow the user space tool to disable and enable the feature in case disabling of the background patrol scrubbing and changing the scrub rate are needed for other purposes such as performance-aware operations which requires the background operations to be turned off or reduced.
> 3. Allows to perform on-demand scrubbing for specific address range if supported by the scrubber.
> 4. User space tools controls scrub the memory DIMMs regularly at a configurable scrub rate using the sysfs scrub controls discussed help,
> - to detect uncorrectable memory errors early before user accessing memory, which helps to recover the detected memory errors.
> - reduces the chance of a correctable error becoming uncorrectable.

Just to add one more reason a user space interface is needed.
5. Policy control for hotplugged memory. There is not necessarily a system wide bios
or similar in the loop to control the scrub settings on a CXL device that wasn't
there at boot. What that setting should be is a policy decision as we are trading
of reliability vs performance - hence it should be in control of userspace.
As such, 'an' interface is needed. Seems more sensible to try and unify it with
other similar interfaces than spin yet another one.

>
> Regards,
> Shiju
>