Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user recovery

From: Luck, Tony
Date: Tue Jan 19 2021 - 18:59:49 EST


On Tue, Jan 19, 2021 at 11:56:32AM +0100, Borislav Petkov wrote:
> On Fri, Jan 15, 2021 at 03:23:46PM -0800, Luck, Tony wrote:
> > On Fri, Jan 15, 2021 at 12:51:03PM -0800, Luck, Tony wrote:
> > > static void kill_me_now(struct callback_head *ch)
> > > {
> > > + p->mce_count = 0;
> > > force_sig(SIGBUS);
> > > }
> >
> > Brown paper bag time ... I just pasted that line from kill_me_maybe()
> > and I thought I did a re-compile ... but obviously not since it gives
> >
> > error: ‘p’ undeclared (first use in this function)
> >
> > Option a) (just like kill_me_maybe)
> >
> > struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
> >
> > Option b) (simpler ... not sure why PeterZ did the container_of thing
> >
> > current->mce_count = 0;
>
> Right, he says it is the canonical way to get it out of callback_head.
> I don't think current will change while the #MC handler runs but we can
> adhere to the design pattern here and do container_of() ...

Ok ... I'll use the canonical way.

But now I've run into a weird issue. I'd run some basic tests with a
dozen machine checks in each of:
1) user access
2) kernel copyin
3) futex (multiple accesses from kernel before task_work())

and it passed my tests before I posted.

But the real validation folks took my patch and found that it has
destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few
more times). System either hangs or panics. Generally before 100
injection/conumption cycles.

Their tests are still just doing one at a time (i.e. complete recovery
of one machine cehck before injecting the next error). So there aren't
any complicated race conditions.

So if you see anything obviously broken, let me know. Otherwise I'll
be poking around at the patch to figure out what is wrong.

-Tony