Re: [RFC PATCH -next V2 7/7] arm64: add pagecache reading to machine check safe

From: Robin Murphy
Date: Fri Apr 08 2022 - 07:11:19 EST

Next message: Christian König: "Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some games stopped working."
Previous message: Krzysztof Kozlowski: "[PATCH] PM: opp: simplify with dev_err_probe()"
Next in thread: Tong Tiangen: "Re: [RFC PATCH -next V2 7/7] arm64: add pagecache reading to machine check safe"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2022-04-08 03:43, Tong Tiangen wrote:

在 2022/4/7 23:53, Robin Murphy 写道:

On 2022-04-07 15:56, Tong Tiangen wrote:

在 2022/4/6 19:27, Mark Rutland 写道:

On Wed, Apr 06, 2022 at 09:13:11AM +0000, Tong Tiangen wrote:

When user process reading file, the data is cached in pagecache and
the data belongs to the user process, When machine check error is
encountered during pagecache reading, killing the user process and
isolate the user page with hardware memory errors is a more reasonable
choice than kernel panic.

The __arch_copy_mc_to_user() in copy_to_user_mc.S is largely borrows
from __arch_copy_to_user() in copy_to_user.S and the main difference
is __arch_copy_mc_to_user() add the extable entry to support machine
check safe.

As with prior patches, *why* is the distinction necessary?

This patch adds a bunch of conditional logic, but *structurally* it doesn't
alter the handling to be substantially different for the MC and non-MC cases.

This seems like pointless duplication that just makes it harder to maintain
this code.

Thanks,
Mark.

Agreed, The implementation here looks a little ugly and harder to maintain.

The purpose of my doing this is not all copy_to_user can be recovered.

A memory error is consumed when reading pagecache using copy_to_user.
I think in this scenario, only the process is affected because it can't read
pagecache data correctly. Just kill the process and don't need the whole
kernel panic.

So I need two different copy_to_user implementation, one is existing __arch_copy_to_user,
this function will panic when consuming memory errors. The other one is this new helper
__arch_copy_mc_to_user, this interface is used when reading pagecache. It can recover from
consume memory error.

OK, but do we really need two almost-identical implementations of every function where the only difference is how the exception table entries are annotated? Could the exception handler itself just figure out who owns the page where the fault occurred and decide what action to take as appropriate?

Robin.

Thank you, Robin.

I added this call path in this patchset: do_sea() -> fixup_exception(), the purpose is to provide a chance for sea fault to fixup (refer patch 3/7).

If fixup successful, panic can be avoided. Otherwise, panic can be eliminated according to the original logic.

fixup_exception() will set regs->pc and jump to regs->pc when the context recovery of an exception occurs.

If mc-safe-fixup added to __arch_copy_to_user(), in *non pagecache reading* scenario encount memory error when call __arch_copy_to_user() , do_sea() -> fixup_exception() will not return fail and will miss the panic logic in do_sea().

So I add new helper __arch_copy_mc_to_user() and add mc-safe-fixup to this helper, which can be used in the required scenarios. At present, there is only one pagecache reading scenario, other scenarios need to be developed.

This is my current confusion. Of course, I will think about the solution to solve the duplicate code problem.

Right, but if the point is that faults in pagecahe pages are special, surely __do_kernel_fault() could ultimately figure out from the address whether that's the case or not?

In general, if the principle is that whether a fault is recoverable or not depends on what was being accessed, then to me it seems fundamentally more robust to base that decision on details of the fault that actually occurred, rather than what the caller thought it was supposed to be doing at the time.

Thanks,
Robin.

Next message: Christian König: "Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some games stopped working."
Previous message: Krzysztof Kozlowski: "[PATCH] PM: opp: simplify with dev_err_probe()"
Next in thread: Tong Tiangen: "Re: [RFC PATCH -next V2 7/7] arm64: add pagecache reading to machine check safe"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]