Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
From: Andy Lutomirski
Date: Fri May 08 2015 - 06:59:45 EST
On May 8, 2015 12:07 PM, "Ingo Molnar" <mingo@xxxxxxxxxx> wrote:
>
>
> * Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
> > > So do you mean:
> > >
> > > this_cpu_set(rcu_state) = IN_KERNEL;
> > > ...
> > > this_cpu_inc(rcu_qs_ctr);
> > > this_cpu_set(rcu_state) = IN_USER;
> > >
> > > ?
> > >
> > > So in your proposal we'd have an INC and two MOVs. I think we can make
> > > it just two simple stores into a byte flag, one on entry and one on
> > > exit:
> > >
> > > this_cpu_set(rcu_state) = IN_KERNEL;
> > > ...
> > > this_cpu_set(rcu_state) = IN_USER;
> > >
> >
> > I was thinking that either a counter or a state flag could make sense.
> > Doing both would be pointless. The counter could use the low bits to
> > indicate the state. The benefit of the counter would be that the
> > RCU-waiting CPU could observe that the counter has incremented and
> > that therefore a grace period has elapsed. Getting it right would
> > require lots of care.
>
> So if you mean:
>
> <syscall entry>
> ...
> this_cpu_inc(rcu_qs_ctr);
> <syscall exit>
>
> I don't see how this would work reliably: how do you handle the case
> of a SCHED_FIFO task never returning from user-space (common technique
> in RT apps)? synchronize_rcu() would block indefinitely as it would
> never see rcu_qs_ctr increase.
>
> We have to be able to observe user-mode anyway, for system-time
> statistics purposes, and that flag could IMHO also drive the RCU GC
> machinery.
The counter would have to be fancier than that to work. We could say
that an even value means user mode, for example. IOW the high bits of
the counter would count transitions to quiescent and the low bits
would indicate the state.
Actually, I'd count backwards. We could use an andl instruction to go
to user mode and a decl to go to kernel mode.
>
> > > > The problem is that I don't see how TIF_RCU_THINGY can work
> > > > reliably. If the remote CPU sets it, it'll be too late and we'll
> > > > still enter user mode without seeing it. If it's just an
> > > > optimization, though, then it should be fine.
> > >
> > > Well, after setting it, the remote CPU has to re-check whether the
> > > RT CPU has entered user-mode - before it goes to wait.
> >
> > How?
> >
> > Suppose the exit path looked like:
> >
> > this_cpu_write(rcu_state, IN_USER);
> >
> > if (ti->flags & _TIF_RCU_NOTIFY) {
> > if (test_and_clear_bit(TIF_RCU_NOTIFY, &ti->flags))
> > slow_notify_rcu_that_we_are_exiting();
> > }
> >
> > iret or sysret;
>
> No, it would look like this:
>
> this_cpu_write(rcu_state, IN_USER);
> iret or sysret;
>
> I.e. IN_USER is set well after all notifications are checked. No
> kernel execution happens afterwards. (No extra checks added - the
> regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)
>
> ( Same goes for idle: we just mark it IN_IDLE and move it back to
> IN_KERNEL after the idling ends. )
>
> > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
> > _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET
> > because stores can be delayed. (It could even happen after sysret,
> > IIRC, but IRET is serializing.)
>
> All it has to do in the synchronize_rcu() slowpath is something like:
I don't think this works. Suppose rt_cpu starts in kernel mode. Then
it checks ti flags.
>
> if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> smp_mb__before_atomic();
> set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> smp_rmb();
> if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
Now it sets rcu_state and exits to user mode. It never notices
TIF_RCU_NOTIFY. We could fix this by sending an IPI to kick it.
> ... go wait ...
> }
> /* Cool, we observed quiescent state: */
>
> The cost of the trivial barrier is nothing compared to the 'go wait'
> cost which we will pay in 99.9% of the cases!
>
> > If we put an mfence after this_cpu_set or did an unconditional
> > test_and_clear_bit on ti->flags then this problem goes away, but
> > that would probably be slower than we'd like.
>
> We are talking about a dozen cycles, while a typical synchronize_rcu()
> will wait millions (sometimes billions) of cycles. There's absolutely
> zero performance concern here and it's all CPU local in any case.
>
The barrier would affect all syscalls, though. Admittedly that's
still much cheaper than the current thing, but given that we can round
trip a syscall in 110 cycles, we'd take at least 10% overhead.
My preference would be to use a counter and skip the barrier. If the
waiting CPU polled the counter, then, even if it lost this race, it
would be guaranteed to see the counter change reasonably soon as long
as no CPU implementation sits on its store buffer forever.
> In fact a user-mode/kernel-mode flag speeds up naive implementations
> of synchronize_rcu(): because it's able to observe extended quiescent
> state immediately, without having to wait for a counter to increase
> (which was always the classic way to observe grace periods).
>
> If all CPUs are in user mode or are idle (which is rather common!)
> then synchronize_rcu() could return almost immediately - while
> previously it had to wait for scheduling or periodic timer irqs to
> trigger on all CPUs - adding many millisecs of delay even in the best
> of cases.
I agree absolutely, except in the whole-tons-of-cpus case, where I
think the RCU tree stuff might preclude this. I'm not sure how well
that works with full nohz.
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/