Re: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

From: Luck, Tony
Date: Thu Jul 22 2021 - 11:21:39 EST


On Thu, Jul 22, 2021 at 06:54:37AM -0700, Jue Wang wrote:
> This patch assumes the UC error consumed in kernel is always the same UC.
>
> Yet it's possible two UCs on different pages are consumed in a row.
> The patch below will panic on the 2nd MCE. How can we make the code works
> on multiple UC errors?
>
>
> > + int count = ++current->mce_count;
> > +
> > + /* First call, save all the details */
> > + if (count == 1) {
> > + current->mce_addr = m->addr;
> > + current->mce_kflags = m->kflags;
> > + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> > + current->mce_whole_page = whole_page(m);
> > + current->mce_kill_me.func = func;
> > + }
> > ......
> > + /* Second or later call, make sure page address matches the one from first call */
> > + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
> > + mce_panic("Machine checks to different user pages", m, msg);

The issue is getting the information about the location
of the error from the machine check handler to the "task_work"
function that processes it. Currently there is a single place
to store the address of the error in the task structure:

current->mce_addr = m->addr;

Plausibly that could be made into an array, indexed by
current->mce_count to save mutiple addresses (perhaps
also need mce_kflags, mce_ripv, etc. to also be arrays).

But I don't want to pre-emptively make such a change without
some data to show that situations arise with multiple errors
to different addresses:
1) Actually occur
2) Would be recovered if we made the change.

The first would be indicated by seeing the:

"Machine checks to different user pages"

panic. You'd have to code up the change to have arrays
to confirm that would fix the problem.

-Tony