Re: [PATCH RFC V2 17/17] x86/entry: Preserve PKRS MSR across exceptions

From: Ira Weiny
Date: Thu Jul 23 2020 - 18:04:39 EST


On Thu, Jul 23, 2020 at 09:53:20PM +0200, Thomas Gleixner wrote:
> Ira,
>
> ira.weiny@xxxxxxxxx writes:
>
> > ...
> > // ref == 0
> > dev_access_enable() // ref += 1 ==> disable protection
> > irq()
> > // enable protection
> > // ref = 0
> > _handler()
> > dev_access_enable() // ref += 1 ==> disable protection
> > dev_access_disable() // ref -= 1 ==> enable protection
> > // WARN_ON(ref != 0)
> > // disable protection
> > do_pmem_thing() // all good here
> > dev_access_disable() // ref -= 1 ==> 0 ==> enable protection
>
> ...
>
> > First I'm not sure if adding this state to idtentry_state and having
> > that state copied is the right way to go.
>
> Adding the state to idtentry_state is fine at least for most interrupts
> and exceptions. Emphasis on most.
>
> #PF does not work because #PF can schedule.


Merging with your other response:

<quote: https://lore.kernel.org/lkml/87o8o6vvt0.fsf@xxxxxxxxxxxxxxxxxxxxxxx/>
Only from #PF, but after the fault has been resolved and the tasks is
scheduled in again then the task returns through idtentry_exit() to the
place where it took the fault. That's not guaranteed to be on the same
CPU. If schedule is not aware of the fact that the exception turned off
stuff then you surely get into trouble. So you really want to store it
in the task itself then the context switch code can actually see the
state and act accordingly.
</quote>

I think, after fixing my code (see below), using idtentry_state could still
work. If the per-cpu cache and the MSR is updated in idtentry_exit() that
should carry the state to the new cpu, correct?

The exception may have turned off stuff but it is a bug if it did not also turn
it back on.

Ie the irq should not be doing:

kmap_atomic()
return;

... without a kunmap_atomic().

That is why I put the WARN_ON_ONCE() to ensure the ref count has been properly
returned to 0 by the exception handler.

FWIW the fixed code is working for my tests...

>
> > It seems like we should start passing this by reference instead of
> > value. But for now this works as an RFC. Comments?
>
> Works as in compiles, right?
>
> static void noinstr idt_save_pkrs(idtentry_state_t state)
> {
> state.foo = 1;
> }
>
> How is that supposed to change the caller state? C programming basics.

<sigh> I am so stupid. I was not looking at this particular case but you are
100% correct... I can't believe I did not see this.

In the above statement I was only thinking about the extra overhead I was
adding to idtentry_enter() and the callers of it.

"C programming basics" indeed... Once again sorry...

>
> > Second, I'm not 100% happy with having to save the reference count in
> > the exception handler. It seems like a very ugly layering violation but
> > I don't see a way around it at the moment.
>
> That state is strict per task, right? So why do you want to store it
> anywhere else that in task/thread storage. That solves your problem of
> #PF scheduling nicely.

The problem is with the kmap code. If an exception handler calls kmap_atomic()
[more importantly if they nest kmap_atomic() calls] the kmap code will need to
keep track of that re-entrant call. With a separate reference counter, the
kmap code would have to know it is in irq context or not.

Here I'm attempting to make the task 'current->dev_page_access_ref' counter
serve double duty so that the kmap code is agnostic to the context it is in.

>
> > Third, this patch has gone through a couple of revisions as I've had
> > crashes which just don't make sense to me. One particular issue I've
> > had is taking a MCE during memcpy_mcsafe causing my WARN_ON() to fire.
> > The code path was a pmem copy and the ref count should have been
> > elevated due to dev_access_enable() but why was
> > idtentry_enter()->idt_save_pkrs() not called I don't know.
>
> Because #MC does not go through idtentry_enter(). Neither do #NMI, #DB, #BP.

And the above probably would work if I knew how to code in C... :-/ I'm so
embarrassed.

>
> > Finally, it looks like the entry/exit code is being refactored into
> > common code. So likely this is best handled somewhat there. Because
> > this can be predicated on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS and handled
> > in a generic fashion. But that is a ways off I think.
>
> The invocation of save/restore might be placed in generic code at least
> for the common exception and interrupt entries.

I think you are correct. I'm testing now...

>
> > +static void noinstr idt_save_pkrs(idtentry_state_t state)
>
> *state. See above.

Yep... :-/

>
> > +#else
> > +/* Define as macros to prevent conflict of inline and noinstr */
> > +#define idt_save_pkrs(state)
> > +#define idt_restore_pkrs(state)
>
> Empty inlines do not need noinstr because they are optimized out. If you
> want inlines in a noinstr section then use __always_inline

Peter made the same comment so I've changed to __always_inline.

>
> > /**
> > * idtentry_enter - Handle state tracking on ordinary idtentries
> > * @regs: Pointer to pt_regs of interrupted context
> > @@ -604,6 +671,8 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
> > return ret;
> > }
> >
> > + idt_save_pkrs(ret);
>
> No. This really has no business to be called before the state is
> established. It's not something horribly urgent and write_pkrs() is NOT
> noinstr and invokes wrmsrl() which is subject to tracing.

I don't understand. I'm not calling it within intrumentation_{begin,end}() so
does that mean I can remove the noinstr? I think I can actually.

Or do you mean call it at the end of idtentry_enter()? Like this:

@@ -672,8 +672,6 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
return ret;
}

- idt_save_pkrs(ret);
-
/*
* If this entry hit the idle task invoke rcu_irq_enter() whether
* RCU is watching or not.
@@ -710,7 +708,7 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
instrumentation_end();

ret.exit_rcu = true;
- return ret;
+ goto done;
}

/*
@@ -725,6 +723,8 @@ idtentry_state_t noinstr idtentry_enter(struct pt_regs *regs)
trace_hardirqs_off();
instrumentation_end();

+done:
+ idt_save_pkrs(&ret);
return ret;
}


>
> > +
> > + idt_restore_pkrs(state);
>
> This one is placed correctly.
>
> Thanks,

No thank you... I'm really sorry for wasting every ones time on this one.

I'm still processing all your responses so forgive me if I've missed something.
And thank you so much for pointing out my mistake. That seems to have fixed
all the problems I have seen thus far. But I want to think on if there may be
more issues.

Thank you,
Ira