Re: Requirements to control kernel isolation/nohz_full at runtime

From: Frederic Weisbecker
Date: Wed Sep 09 2020 - 22:42:55 EST

On Mon, Sep 07, 2020 at 05:34:17PM +0200, peterz@xxxxxxxxxxxxx wrote:
> (your mailer broke and forgot to keep lines shorter than 78 chars)

I manually reordered the lines and that's indeed quite a mess :o)

> On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> > == TIF_NOHZ ==
> >
> > Need to get rid of that in order not to trigger syscall slowpath on
> > CPUs that don't want nohz_full. Also we don't want to iterate all
> > threads and clear the flag when the last nohz_full CPU exits nohz_full
> > mode. Prefer static keys to call context tracking on archs. x86 does
> > that well.
> Build on the common entry code I suppose. Then any arch that uses that
> gets to have the new features.

Yep, eventually I hope we can put all these crucial pieces on the common entry

> > == Proper entry code ==
> >
> > We must make sure that a given arch never calls exception_enter() /
> > exception_exit(). This saves the previous state of context tracking
> > and switch to kernel mode (from context tracking POV) temporarily.
> > Since this state is saved on the stack, this prevents us from turning
> > off context tracking entirely on a CPU: The tracking must be done on
> > all CPUs and that takes some cycles.
> >
> > This means that, considering early entry code (before the call to
> > context tracking upon kernel entry, and after the call to context
> > tracking upon kernel exit), we must take care of few things:
> >
> > 1) Make sure early entry code can't trigger exceptions. Or if it does,
> > the given exception can't schedule or use RCU (unless it calls
> > rcu_nmi_enter()). Otherwise the exception must call
> > exception_enter()/exception_exit() which we don't want.
> I think this is true for x86. Early entry has interrupts disabled, any
> exception that can still happen is NMI-like and will thus use
> rcu_nmi_enter().
> On x86 that now includes #DB (which is also excluded due to us refusing
> to set execution breakpoints on entry code), #BP, NMI and MCE.

Perfect! That's what I assumed as well.

> > 2) No call to schedule_user().
> I'm not sure what that is supposed to do, but x86 doesn't appear to have
> it, so all good :-)

I think it was there in case an exception would schedule after context tracking
exit kernel but before we actually exit kernel. But we removed that (Andy probably)
when we made sure the early entry was not interruptible. Now some other archs
still use it, I'm just not sure if they do it for a good reason...

> > 3) Make sure early entry code is not interruptible or
> > preempt_schedule_irq() would rely on
> > exception_entry()/exception_exit()
> This is so for x86.


> > 4) Make sure early entry code can't be traced (no call to
> > preempt_schedule_notrace()), or if it does it can't schedule
> noinstr is your friend.

Right. My fear was rather on special areas that temporarily
enable tracing (instrumentation_begin()...instrumentation_end())
but those should only happen with interrupts disabled on entry code
with preempt_schedule_notrace() having no effect.

> > I believe x86 does most of that well.
> It does now.

Thanks a lot for confirming! I guess I can remove
exception_enter()/exit() on x86. Fortunately any issue
will be very easily spotted.