Re: [PATCH] kcsan: Add option to allow watcher interruptions

From: Peter Zijlstra
Date: Sat Jul 25 2020 - 13:44:39 EST


On Sat, Jul 25, 2020 at 05:17:43PM +0200, Marco Elver wrote:
> On Sat, 25 Jul 2020 at 16:56, Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > On Thu, Feb 20, 2020 at 10:33:17PM +0100, Marco Elver wrote:
> > > On Thu, 20 Feb 2020, Paul E. McKenney wrote:
> > > > On Thu, Feb 20, 2020 at 03:15:51PM +0100, Marco Elver wrote:
> > > > > Add option to allow interrupts while a watchpoint is set up. This can be
> > > > > enabled either via CONFIG_KCSAN_INTERRUPT_WATCHER or via the boot
> > > > > parameter 'kcsan.interrupt_watcher=1'.
> [...]
> > > > > As an example, the first data race that this found:
> > > > >
> > > > > write to 0xffff88806b3324b8 of 4 bytes by interrupt on cpu 0:
> > > > > rcu_preempt_read_enter kernel/rcu/tree_plugin.h:353 [inline]
> > > > > __rcu_read_lock+0x3c/0x50 kernel/rcu/tree_plugin.h:373
> [...]
> > > > > read to 0xffff88806b3324b8 of 4 bytes by task 6131 on cpu 0: |
> > > > > rcu_preempt_read_enter kernel/rcu/tree_plugin.h:353 [inline] ----+
> [...]
> > > > >
> > > > > The writer is doing 'current->rcu_read_lock_nesting++'. The read is as
> > > > > vulnerable to compiler optimizations and would therefore conclude this
> > > > > is a valid data race.
> > > >
> > > > Heh! That one is a fun one! It is on a very hot fastpath. READ_ONCE()
> > > > and WRITE_ONCE() are likely to be measurable at the system level.
> > > >
> > > > Thoughts on other options?

> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index c6ea81cd41890..e0595abd50c0f 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -350,17 +350,17 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
> > >
> > > static void rcu_preempt_read_enter(void)
> > > {
> > > - current->rcu_read_lock_nesting++;
> > > + local_inc(&current->rcu_read_lock_nesting);
> > > }
> > >
> > > static void rcu_preempt_read_exit(void)
> > > {
> > > - current->rcu_read_lock_nesting--;
> > > + local_dec(&current->rcu_read_lock_nesting);
> > > }
> > >
> > > static void rcu_preempt_depth_set(int val)
> > > {
> > > - current->rcu_read_lock_nesting = val;
> > > + local_set(&current->rcu_read_lock_nesting, val);
>
> > I agree that this removes the data races, and that the code for x86 is
> > quite nice, but aren't rcu_read_lock() and rcu_read_unlock() going to
> > have heavyweight atomic operations on many CPUs?
> >
> > Maybe I am stuck with arch-specific code in rcu_read_lock() and
> > rcu_preempt_read_exit(). I suppose worse things could happen.
>
> Peter also mentioned to me that while local_t on x86 generates
> reasonable code, on other architectures it's terrible. So I think
> something else is needed, and feel free to discard the above idea.
> With sufficient enough reasoning, how bad would a 'data_race(..)' be?

Right, so local_t it atrocious on many architectures, they fall back to
atomic_t.

Even architectures that have optimized variants (eg. Power), they're
quite a lot more expensive than what we actually need here.

Only architectures like x86 that have single instruction memops can
generate anywhere near the code that we'd want here.

So the thing is, since RCU count is 0 per context (an IRQ must have an
equal amount of rcu_read_unlock() as it has rcu_read_lock()), interrupts
are not in fact a problem, even on load-store (RISC) architectures
(preempt_count has the same thing).

So the addition/subtraction in rcu_preempt_read_{enter,exit}() doesn't
need to be atomic vs interrupts. The only thing we really do need is
them being single-copy-atomic.

The problem with READ/WRITE_ONCE is that if we were to use it, we'd end
up with a load-store, even on x86, which is sub-optimal.

I suppose the 'correct' code here would be something like:

*((volatile int *)&current->rcu_read_lock_nesting)++;

then the compiler can either do a single memop (x86 and the like) or a
load-store that is free from tearing.