Re: [tip: x86/fpu] x86/fpu: Deactivate FPU state after failure during state load

From: Sebastian Andrzej Siewior
Date: Tue Jan 07 2020 - 16:14:18 EST


On 2020-01-07 10:41:52 [-1000], Andy Lutomirski wrote:
> Wow, __fpu__restore_sig is a mess. We have __copy_from... that is
> Obviously Incorrect (tm) even though itâs not obviously exploitable.
> (Itâs wrong because the *wrong pointer* is checked with access_ok().).
> We have a fast path that will execute just enough of the time to make
> debugging the slow path really annoying. (We should probably delete
> the fast path.) There are pagefault_disable() call in there mostly to
> confuse people. (So we take a fault and sleep â big deal. We have
> temporarily corrupt state, but no one will ever read it. The retry
> after sleeping will clobber xstate, but lazy save is long gone and
> this should be fine now. The real issue is that, if weâre preempted
> after a successful a successful restore, then the new state will get
> lost.)

There is preempt_disable() as part of FPU locking since we can't change
the content of the FPU registers (CPU's or within task's state) and get
interrupted by task preemption. With disabled preemption we can't take a
page fault.

We need to load the page from userland which may fault. The context
switch saves _current_ FPU state only if TIF_NEED_FPU_LOAD is cleared.
This needs to happen atomic.

The fast path may fail if stack is not faulted-in (custom stack,
madvise(,, MADV_DONTNEED))

> So either we should delete the fast path or we should make it work
> reliably and delete the slow path. And we should get rid of the
> __copy. And we should have some test cases.

without the fastpath the average case is too slow.
People-complained-about-this-slow. That is why we ended up with the
fastpath in the last revision of the series.

The go people contirbuted a testcase. Maybe I should hack up it up so
that we trigger each path and post since it obviously did not happen.
Boris, do you remember why we did not include their testcase yet?

> BTW, how was the bug in here discovered? It looks like it only
> affects signal restore failure, which is usually not survivable unless
> the user program is really trying.

The glibc test suite.

Sebastian