Re: [PATCH v13] x86, mce: Add memcpy_trap()

From: Linus Torvalds
Date: Thu Feb 25 2016 - 19:58:25 EST


On Thu, Feb 25, 2016 at 2:11 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
> do_machine_check uses IST, the memory failure code can sleep, and you
> can't sleep in IST context. There's a special escape that lets
> memory_failure sleep *if* it came from user mode.

So?

Just save it away in a list (we've got the NMI-safe lists and
everything), and make sure to cause a work to be scheduled eventually
or something. People do things in NMI's, we can do this in a machine
check.

That's what we always do for any other interrupt-time thing that want
to sleep, it's not even like this is unusual or special.

But at NO point does it make sense to try to handle it from
"copy_from_user()". There are so many reason *not* to do it there, but
the most basic one is that the error doesn't necessarily happen at
copy_to_user() time in the first place. If user space has a piece of
RAM that can cause machine checks, then normal user memory access may
have caused the MC. And that's much _much_ more likely than
"copy_from_user()", which is such a rare little special case.

copy_from_user() is simply not that special. Having it do some odd
magic error handling that nobody else does is just insane.

> Here are different some ideas I don't like.:
>
> 1. The machine check does an IPI-to-self and the failure code runs in
> IRQ context.

I really think that's way too over-engineered.

> 2. The machine check code rewrites the return stack to inject a
> function call. I don't love this.

We already *have* this, for special cases like user accesses. That's
what our exception tables are all about.

So for our bog-standard normal copy_from_user(), we already can abort
the copy in the middle. Absolutely no new infrastructure needed. No
change to copy_from_user(), no nothing.

For other cases, when we don't have an exception table entry? Do the
same thing the page fault code does. If it's in the kernel, we're
going to be in trouble. If it's in user space, send a SIGBUS.

But again, none of that has anything to do with changing the existing
copy_from_user(). The only change I see that makes sense is to have
some way to change the error code.

> 3. Drop the idea of sending an immediate sigbus and do it with
> task_work. Maybe this is bad for some reason other than code
> messiness.

For copy_from_user(), we get the "immediate" reaction exactly thanks
to the whole exception table mechanism that that function already
uses.

For everything else, it's going to have to be delayed or killed some
way. And yes, the SIGBUS will be delayed until it gets back to user
space.

The kernel getting a machine check while it's accessing kernel data
structures? It's going to be messy. It's going to fail. Tough. That's
kind of inevitable. There is no sane way to recover. The only way to
recover is to have reliable hardware, or just killing the machine
entirely and starting again (in a data center model). In many other
situations, you're likely going to just force an oops and kill the
machine. Or maybe force an oops, and decide to just hope for the best,
and ignore the error entirely.

Because what choice do you have?

None of this is new.

> 4. Change the entry code so machine check runs on the normal stack if
> it hits with IRQs on.

I really think you concentrate too much on the small details. I don't
see why an IST couldn't use the exception handling mechanism, for
example. I don't see why an IST couldn't just add the list entry and
make some later thing happen. We use ist's for debug traps already,
it's not like sending a signal or looking up the exception table is
fundamentally hard there, or queueing a list or whatever.

Linus