Re: [RFC PATCH 1/2] ARM: mm: support memory-failure

From: David Hildenbrand (Red Hat)
Date: Mon Nov 03 2025 - 11:54:44 EST


On 23.09.25 06:10, Xie Yuanbin wrote:
Arnd Bergmann wrote:
It would be helpful to be more specific about what you
want to do with this.

Are you working on a driver that would actually make use of
the exported interface?

Thanks for your reply.

Yes, In fact, we have developed a hardware component to detect DDR bit
transitions (software does not sense the detection behavior). Once a bit
transition is detected, an interrupt is reported to the CPU.

On the software side, we have developed a driver module ko to register
the interrupt callback to perform soft page offline to the corresponding
physical pages.

In fact, we will export `soft_offline_page` for ko to use (we can ensure
that it is not called in the interrupt context), but I have looked at the
code and found that `memory_failure_queue` and `memory_failure` can also
be used, which are already exported.

Ok

I see only a very small number of
drivers that call memory_failure(), and none of them are
usable on Arm.

I think that not all drivers are in the open source kernel code.
As far as I know, there should be similar third-party drivers in other
architectures that use memory-failure functions, like x86 or arm64.
I am not a specialist in drivers, so if I have made any mistakes,
please correct me.

I'm not familiar with the memory-failure support, but this sounds
like something that is usually done with a drivers/edac/ driver.
There are many SoC specific drivers, including for 32-bit Arm
SoCs.

Have you considered adding an EDAC driver first? I don't know
how the other platforms that have EDAC drivers handle failures,
but I would assume that either that subsystem already contains
functionality for taking pages offline,

I'm very sorry, I tried my best to do this,
but it seems impossible to achieve.
I am a kernel developer rathder than a driver developer. I have tried to
communicate with driver developers, but open source is very difficult due
to the involvement of proprietary hardware and algorithms.

or this is something
that should be done in a way that works for all of them without
requiring an extra driver.

Yes, I think that the memory-failure feature should not be associated with
specific architectures or drivers.

I have read the memory-failure's doc and code,
and found the following features, which are user useable,
are not associated with specific drivers:

1. `/sys/devices/system/memory/soft_offline_page`:
see https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-memory-page-offline

This interface only exists when CONFIG_MEMORY_HOTPLUG is enabled, but
ARM cannot enable it.
However, I have read the code and believe that it should not require a
lot of effort to decouple these two, allowing the interface to exist
even if mem-hotplug is disabled.

It's all about the /sys/devices/system/memory/ directory, which traditionally only made sense for memory hotplug. Well, still does to most degree.

Not sure whether some user space (chmem?) senses for /sys/devices/system/memory/ to detect memory hotplug capabilities.

But given soft_offline_page is a pure testing mechanism, I wouldn't be too concerned about that for now.


2. The syscall madvise with `MADV_SOFT_OFFLINE/MADV_HWPOISON` flags:

According to the documentation, this interface is currently only used for
testing. However, if the user program can map the specified physical
address, it can actually be used for memory-failure.

It's mostly a testing-only interface. It could be used for other things, but really detecting MCE and handling it properly is kernel responsibility.


3. The CONFIG_HWPOISON_INJECT which depends on CONFIG_MEMORY_FAILURE:
see https://docs.kernel.org/mm/hwpoison.html

It seems to allow input of physical addresses and trigger memory-failure,
but according to the doc, it seems to be used only for testing.

Right, all these interfaces are testing only.



Additionally, I noticed that in the memory-failure doc
https://docs.kernel.org/mm/hwpoison.html, it mentions that
"The main target right now is KVM guests, but it works for all kinds of
applications." This seems to confirm my speculation that the
memory-failure feature should not be associated with specific
architectures or drivers.

Can you go into more details which exact functionality in memory-failure.c you would be interested in using?

Only soft-offlining or also the other (possibly architecture-specific) handling?

--
Cheers

David