Re: [PATCH v2] x86/nmi: Fix some races in NMI uaccess

From: Andy Lutomirski
Date: Wed Aug 29 2018 - 12:14:26 EST


On Wed, Aug 29, 2018 at 8:49 AM, Rik van Riel <riel@xxxxxxxxxxx> wrote:
> On Wed, 2018-08-29 at 08:36 -0700, Andy Lutomirski wrote:
>> On Wed, Aug 29, 2018 at 8:17 AM, Rik van Riel <riel@xxxxxxxxxxx>
>> wrote:
>> > On Tue, 2018-08-28 at 20:46 -0700, Andy Lutomirski wrote:
>> > > On Tue, Aug 28, 2018 at 10:56 AM, Rik van Riel <riel@xxxxxxxxxxx>
>> > > wrote:
>> > > > On Mon, 27 Aug 2018 16:04:16 -0700
>> > > > Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>> > > >
>> > > > > The 0day bot is still chewing on this, but I've tested it a
>> > > > > bit
>> > > > > locally
>> > > > > and it seems to do the right thing.
>> > > >
>> > > > Hi Andy,
>> > > >
>> > > > the version of the patch below should fix the bug we talked
>> > > > about
>> > > > in email yesterday. It should automatically cover kernel
>> > > > threads
>> > > > in lazy TLB mode, because current->mm will be NULL, while the
>> > > > cpu_tlbstate.loaded_mm should never be NULL.
>> > > >
>> > >
>> > > That's better than mine. I tweaked it a bit and added some
>> > > debugging,
>> > > and I got this:
>> > >
>> > >
>> >
>> >
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/fixes&id=dd956eba16646fd0b15c3c0741269dfd84452dac
>> > >
>> > > I made the loaded_mm handling a little more conservative to make
>> > > it
>> > > more obvious that switch_mm_irqs_off() is safe regardless of
>> > > exactly
>> > > when it gets called relative to switching current.
>> >
>> > I am not convinced that the dance of writing
>> > cpu_tlbstate.loaded_mm twice, with a barrier on
>> > each end, is useful or necessary.
>> >
>> > At the time switch_mm_irqs_off returns, nmi_uaccess_ok()
>> > will still return false, because we have not switched
>> > "current" to the task that owns the next mm_struct yet.
>> >
>> > We just have to make sure to:
>> > 1) Change cpu_tlbstate.loaded_mm before we manipulate
>> > CR3, and
>> > 2) Change "current" only once enough of the mm stuff has
>> > been switched, __switch_to seems to get that right.
>> >
>> > Between the time switch_mm_irqs_off() sets cpu_tlbstate
>> > to the next mm, and __switch_to moves() over current,
>> > nmi_uaccess_ok() will return false.
>>
>> All true, but I think it stops working as soon as someone starts
>> calling switch_mm_irqs_off() for some other reason, such as during
>> text_poke(). And that was the original motivation for this patch.
>
> How does calling switch_mm_irqs_off() for text_poke()
> change around current->mm and cpu_tlbstate.loaded_mm?
>
> Does current->mm stay the same throughout the entire
> text_poke() chain, while cpustate.loaded_mm is the
> only thing that is changed out?

This is exactly what happens. It seemed considerably more complicated
and error-prone to fiddle with current->mm. Instead the idea was to
turn off IRQs, get NMI to stay out of the way, and put everything back
the way it was before the scheduler, any syscalls, etc could notice
that we were messing around.

--Andy