Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

From: Ingo Molnar
Date: Fri May 08 2015 - 07:27:24 EST



* Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:

> On May 8, 2015 12:07 PM, "Ingo Molnar" <mingo@xxxxxxxxxx> wrote:
> >
> >
> > * Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> >
> > > > So do you mean:
> > > >
> > > > this_cpu_set(rcu_state) = IN_KERNEL;
> > > > ...
> > > > this_cpu_inc(rcu_qs_ctr);
> > > > this_cpu_set(rcu_state) = IN_USER;
> > > >
> > > > ?
> > > >
> > > > So in your proposal we'd have an INC and two MOVs. I think we can make
> > > > it just two simple stores into a byte flag, one on entry and one on
> > > > exit:
> > > >
> > > > this_cpu_set(rcu_state) = IN_KERNEL;
> > > > ...
> > > > this_cpu_set(rcu_state) = IN_USER;
> > > >
> > >
> > > I was thinking that either a counter or a state flag could make sense.
> > > Doing both would be pointless. The counter could use the low bits to
> > > indicate the state. The benefit of the counter would be that the
> > > RCU-waiting CPU could observe that the counter has incremented and
> > > that therefore a grace period has elapsed. Getting it right would
> > > require lots of care.
> >
> > So if you mean:
> >
> > <syscall entry>
> > ...
> > this_cpu_inc(rcu_qs_ctr);
> > <syscall exit>
> >
> > I don't see how this would work reliably: how do you handle the case
> > of a SCHED_FIFO task never returning from user-space (common technique
> > in RT apps)? synchronize_rcu() would block indefinitely as it would
> > never see rcu_qs_ctr increase.
> >
> > We have to be able to observe user-mode anyway, for system-time
> > statistics purposes, and that flag could IMHO also drive the RCU GC
> > machinery.
>
> The counter would have to be fancier than that to work. We could
> say that an even value means user mode, for example. IOW the high
> bits of the counter would count transitions to quiescent and the low
> bits would indicate the state.
>
> Actually, I'd count backwards. We could use an andl instruction to
> go to user mode and a decl to go to kernel mode.

at which point it's not really a monotonic quiescent state counter -
which your naming suggested before.

Also it would have to be done both at entry and at exit.

But yes, if you replace a state flag with a recursive state flag that
is INC/DEC maintained that would work as well. That's not a counter
though.

> > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
> > > _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET
> > > because stores can be delayed. (It could even happen after
> > > sysret, IIRC, but IRET is serializing.)
> >
> > All it has to do in the synchronize_rcu() slowpath is something
> > like:
>
> I don't think this works. Suppose rt_cpu starts in kernel mode.
> Then it checks ti flags.
>
> >
> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> > smp_mb__before_atomic();
> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> > smp_rmb();
> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
>
> Now it sets rcu_state and exits to user mode. It never notices
> TIF_RCU_NOTIFY.

Indeed, you are right, that's racy.

> [...] We could fix this by sending an IPI to kick it.

With an IPI we won't need any flags or counters on the RT CPU - it's
the IPI we are trying to avoid.

So how about the following way to fix the race: simply do a poll loop
with a fixed timeout sleep: like it would do anyway if
synchronize_rcu() was waiting for the timer irq to end the grace
period on the RT-CPU.

The TIF flag would cause an RCU grace period to lapse, no need to wake
up the synchronize_rcu() side: it will notice the (regular) increased
RCU counter.

> > We are talking about a dozen cycles, while a typical
> > synchronize_rcu() will wait millions (sometimes billions) of
> > cycles. There's absolutely zero performance concern here and it's
> > all CPU local in any case.
>
> The barrier would affect all syscalls, though. [...]

What barrier? I never suggested any barrier in the syscall fast path,
this bit:

> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> > smp_mb__before_atomic();
> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> > smp_rmb();
> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

runs inside synchronize_rcu() or so.

But none of that is needed if we do:

- simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context
entry/exit time, straight in the assembly

- TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does
nothing but: this_cpu_inc(rcu_qs_ctr).

- synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and
then does a timeout-poll-loop with jiffies granular
timeouts, simulating a timer IRQ in essence. It will either
observe IN_USER or will see the regular RCU qs counter
increase.

this should be race free.

Alternatively we can even get rid of the TIF flag by merging the
percpu counter with the percpu state. Quiescent states are recorded
via a counter shifted by two bits:

rcu_qs_ctr += 4;

while user/kernel/idle mode is recorded in the lower 2 bits.

So on entering the kernel we'd do:

rcu_qs_ctr += 4+1; /* Register QS and set bit 0 to IN_KERNEL */

on returning to user-space we'd do:

rcu_qs_ctr += 4-1; /* We have bit 0 set already, clear it and register a QS */

on going idle we'd do:

rcu_qs_ctr += 4+2; /* Register QS, set bit 1 */

on return from idle we'd do:

rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */

etc. On all boundary transitions we can use a constant ADD on a
suitable percpu variable.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/