Re: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

From: Jue Wang
Date: Thu Jul 22 2021 - 23:48:01 EST


On Thu, Jul 22, 2021 at 5:14 PM Luck, Tony <tony.luck@xxxxxxxxx> wrote:
>
> I'm not aware of, nor expecting to find, places where the kernel
> tries to access user address A and hits poison, and then tries to
> access user address B (without returrning to user between access
> A and access B).
This seems a reasonablely easy scenario.

A user space app allocates a buffer of xyz KB/MB/GB.

Unfortunately the dimms are bad and multiple cache lines have
uncorrectable errors in them on different pages.

Then the user space app tries to write the content of the buffer into some
file via write(2) from the entire buffer in one go.

We have some test cases like this repros reliably with infinite MCE loop.

I believe the key here is that in the real world this will happen,
in particular the bit flips tend to be clustered physically -
same dimm row, dimm column, or same rank, same device etc.
>
> -Tony