Re: TIF_NOHZ can escape nonhz mask? (Was: [PATCH v3 6/8] x86: Split syscall_trace_enter into two phases)
From: Frederic Weisbecker
Date: Thu Jul 31 2014 - 14:47:43 EST
On Thu, Jul 31, 2014 at 08:12:30PM +0200, Oleg Nesterov wrote:
> On 07/31, Frederic Weisbecker wrote:
> >
> > On Thu, Jul 31, 2014 at 06:03:53PM +0200, Oleg Nesterov wrote:
> > > On 07/31, Frederic Weisbecker wrote:
> > > >
> > > > I was convinced that the very first kernel init task is PID 0 then
> > > > it forks on rest_init() to launch the userspace init with PID 1. Then init/0 becomes the
> > > > idle task of the boot CPU.
> > >
> > > Yes sure. But context_tracking_cpu_set() is called by init task with PID 1, not
> > > by "swapper".
> >
> > Are you sure? It's called from start_kernel() which is init/0.
>
> But do_initcalls() is called by kernel_init(), this is the init process which is
> going to exec /sbin/init later.
>
> But this doesn't really matter,
Yeah but tick_nohz_init() is not an initcall, it's a function called from start_kernel(),
before initcalls.
>
> > Maybe we should just enable it everywhere.
>
> Yes, yes, this doesn't really matter. We can even add set(TIF_NOHZ) at the start
> of start_kernel(). The question is, I still can't understand why do we want to
> have the global TIF_NOHZ.
Because then the flags is inherited in forks. It's better than inheriting it on
context switch due to context switch being called much more often than fork.
>
> > > OK. To simplify, lets discuss user_enter() only. So, it is actually a nop on
> > > CPU_0, and CPU_1 can miss it anyway.
> > >
> > > > But global TIF_NOHZ should enforce context tracking everywhere.
> > >
> > > And this is what I can't understand. Lets return to my initial question, why
> > > we can't change __context_tracking_task_switch()
> > >
> > > void __context_tracking_task_switch(struct task_struct *prev,
> > > struct task_struct *next)
> > > {
> > > if (context_tracking_cpu_is_enabled())
> > > set_tsk_thread_flag(next, TIF_NOHZ);
> > > else
> > > clear_tsk_thread_flag(next, TIF_NOHZ);
> > > }
> > >
> > > ?
> >
> > Well we can change it to global TIF_NOHZ
> >
> > > How can the global TIF_NOHZ help?
> >
> > It avoids that flag swap on task_switch.
>
> Ah, you probably meant that we can kill context_tracking_task_switch() as
> we discussed.
Yeah.
>
> But I meant another thing, TIF_NOHZ is already global because it is always
> set after context_tracking_cpu_set().
>
> Performance-wise, this set/clear code above can be better because it avoids
> the slow paths on the untracked CPU's.
But all CPUs are tracked when context tracking is enabled. So there is no such
per CPU granularity.
> But let's ignore this, the question is
> why the change above is not correct? Or why it can make the things worse?
Which change above?
>
>
> > > OK, OK, a task can return to usermode on CPU_0, notice TIF_NOHZ, take the
> > > slow path, and do the "right" thing if it migrates to CPU_1 _before_ it
> > > comes to user_enter(). But this case is very unlikely, certainly this can't
> > > explain why do we penalize the untracked CPU's ?
> >
> > It's rather that CPU 0 calls user_enter() and then migrate to CPU 1 and resume
> > to userspace.
>
> And in this case a) user_enter() is pointless on CPU_0, and b) CPU_1 misses
> the necessary user_enter().
No, user_enter() is necessary on CPU 0 so that CPU 1 sees that it is in userspace
from context tracking POV.
>
> > It's unlikely but possible. I actually observed that very easily on early testing.
>
> Sure. And this can happen anyway? Why the change in __context_tracking_task_switch()
> is wrong?
Which change?
>
> > And it's a big problem because then the CPU runs in userspace, possibly for a long while
> > in HPC case, and context tracking thinks it is in kernelspace. As a result, RCU waits
> > for that CPU to complete grace periods and cputime is accounted to kernelspace instead of
> > userspace.
> >
> > It looks like a harmless race but it can have big consequences.
>
> I see. Again, does the global TIF_NOHZ really help?
Yes, to remove the context switch overhead. But it doesn't change anything on those races.
> > Because calling context_switch_task_switch() on every context switch is costly.
>
> See above. This depends, but forcing the slow paths on all CPU's can be more
> costly.
Yeah but I don't have a safe solution that avoids that.
>
> > > And of course I can't understand exception_enter/exit(). Not to mention that
> > > (afaics) "prev_ctx == IN_USER" in exception_exit() can be false positive even
> > > if we forget that the caller can migrate in between. Just because, once again,
> > > a tracked CPU can miss user_exit().
> >
> > You lost me on this. How can a tracked CPU miss user_exit()?
>
> I am lost too ;) Didn't we already discuss this? A task calls user_exit() on
> CPU_0 for no reason, migrates to the tracked CPU_1 and finally returns to user
> space leaving this cpu in IN_KERNEL state?
Yeah, so? :)
I'm pretty sure there is a small but important element here that makes us misunderstanding
what each others says. It's like we aren't taking about the same thing :)
> > That's how I implemented it first. But then I changed it the way it is now:
> > 6c1e0256fad84a843d915414e4b5973b7443d48d
> > ("context_tracking: Restore correct previous context state on exception exit")
> >
> > This is again due to the shift between hard and soft userspace boundaries.
> > user_mode(regs) checks hard boundaries only.
> >
> > Lets get back to our beloved example:
> >
> > CPU 0 CPU 1
> > ---------------------------------------------
> >
> > returning from syscall {
> > user_enter();
> > exception {
> > exception_enter()
> > PREEMPT!
> > ----------------------->
> > //resume exception
> > exception_exit();
> > return to userspace
>
> OK, thanks, so in this case we miss user_enter().
>
> But again, we can miss it anyway if preemptions comes before "exception" ?
> if CPU 1 was in IN_KERNEL state.
No, because preempt_schedule_irq() does the ctx_state save and restore with
exception_enter/exception_exit.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/