Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

From: Ingo Molnar
Date: Fri May 08 2015 - 02:37:26 EST



* Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:

> > So do you mean:
> >
> > this_cpu_set(rcu_state) = IN_KERNEL;
> > ...
> > this_cpu_inc(rcu_qs_ctr);
> > this_cpu_set(rcu_state) = IN_USER;
> >
> > ?
> >
> > So in your proposal we'd have an INC and two MOVs. I think we can make
> > it just two simple stores into a byte flag, one on entry and one on
> > exit:
> >
> > this_cpu_set(rcu_state) = IN_KERNEL;
> > ...
> > this_cpu_set(rcu_state) = IN_USER;
> >
>
> I was thinking that either a counter or a state flag could make sense.
> Doing both would be pointless. The counter could use the low bits to
> indicate the state. The benefit of the counter would be that the
> RCU-waiting CPU could observe that the counter has incremented and
> that therefore a grace period has elapsed. Getting it right would
> require lots of care.

So if you mean:

<syscall entry>
...
this_cpu_inc(rcu_qs_ctr);
<syscall exit>

I don't see how this would work reliably: how do you handle the case
of a SCHED_FIFO task never returning from user-space (common technique
in RT apps)? synchronize_rcu() would block indefinitely as it would
never see rcu_qs_ctr increase.

We have to be able to observe user-mode anyway, for system-time
statistics purposes, and that flag could IMHO also drive the RCU GC
machinery.

> > > The problem is that I don't see how TIF_RCU_THINGY can work
> > > reliably. If the remote CPU sets it, it'll be too late and we'll
> > > still enter user mode without seeing it. If it's just an
> > > optimization, though, then it should be fine.
> >
> > Well, after setting it, the remote CPU has to re-check whether the
> > RT CPU has entered user-mode - before it goes to wait.
>
> How?
>
> Suppose the exit path looked like:
>
> this_cpu_write(rcu_state, IN_USER);
>
> if (ti->flags & _TIF_RCU_NOTIFY) {
> if (test_and_clear_bit(TIF_RCU_NOTIFY, &ti->flags))
> slow_notify_rcu_that_we_are_exiting();
> }
>
> iret or sysret;

No, it would look like this:

this_cpu_write(rcu_state, IN_USER);
iret or sysret;

I.e. IN_USER is set well after all notifications are checked. No
kernel execution happens afterwards. (No extra checks added - the
regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)

( Same goes for idle: we just mark it IN_IDLE and move it back to
IN_KERNEL after the idling ends. )

> The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
> _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET
> because stores can be delayed. (It could even happen after sysret,
> IIRC, but IRET is serializing.)

All it has to do in the synchronize_rcu() slowpath is something like:

if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
smp_mb__before_atomic();
set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
smp_rmb();
if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
... go wait ...
}
/* Cool, we observed quiescent state: */

The cost of the trivial barrier is nothing compared to the 'go wait'
cost which we will pay in 99.9% of the cases!

> If we put an mfence after this_cpu_set or did an unconditional
> test_and_clear_bit on ti->flags then this problem goes away, but
> that would probably be slower than we'd like.

We are talking about a dozen cycles, while a typical synchronize_rcu()
will wait millions (sometimes billions) of cycles. There's absolutely
zero performance concern here and it's all CPU local in any case.

In fact a user-mode/kernel-mode flag speeds up naive implementations
of synchronize_rcu(): because it's able to observe extended quiescent
state immediately, without having to wait for a counter to increase
(which was always the classic way to observe grace periods).

If all CPUs are in user mode or are idle (which is rather common!)
then synchronize_rcu() could return almost immediately - while
previously it had to wait for scheduling or periodic timer irqs to
trigger on all CPUs - adding many millisecs of delay even in the best
of cases.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/