Re: [PATCH RFC] percpu: add data dependency barrier in percpu accessors and operations

From: Paul E. McKenney
Date: Tue Jun 17 2014 - 15:40:27 EST


On Tue, Jun 17, 2014 at 02:27:43PM -0500, Christoph Lameter wrote:
> On Thu, 12 Jun 2014, Tejun Heo wrote:
>
> > percpu areas are zeroed on allocation and, by its nature, accessed
> > from multiple cpus. Consider the following scenario.
>
> I am not sure that the premise is actually right. Percpu areas are
> designed to be accessed from a single cpu and we provide instances
> of variables for each cpu.
>
> There is no synchronization guarantee for accesses from other cpu. If
> these accesses occur then we tolerate some fuzziness and usualy only do
> read accesses. F.e. for statistics if we loop over all cpus to get a sum
> of percpu counters (which is a classic use case for percpu data).
>
> But there are numerous uses where no accesses from other cpus are required
> (mostly when percpu stuff is not used for statistics but for cpu local
> lists and status).
>
> Cross cpu write accesses typically occur only after the allocation and
> before the code that actually does something is aware of the existence of
> the percpu area allocated or if the processor is being offlines/onlines.
>
> > > p = NULL; >
> > CPU-1 CPU-2
> > p = alloc_percpu() if (p)
> > WARN_ON(this_cpu_read(*p));
>
> p is an offset into the per cpu area of the processor. The value of P
> first has to be made available to cpu2 somehow and this usually provides
> the opportunity for synchronization that avoids the above scenario.
>
> And so it is typical that these offsets are stored in larger structs that
> also have other means of synchronization.
>
> F.e. Allocators take a global lock and then instantiate a new
> structure with the associated per cpu area allocation which is added to a
> global list after it is ready. The address of the allocator structure
> is then made available to other processors.
>
> Another method is to perform this allocation on bootup which then also
> does not require synchronization (page allocator).
>
> Similar in swapon(). The percpu allocation is performed before access to
> the containing structure (via enable_swap_info).

Those are indeed common use cases. However...

There is code where one CPU writes to another CPU's per-CPU variables.
One example is RCU callback offloading, where a kernel thread (which
might be running anywhere) dequeues a given CPU's RCU callbacks and
processes them. The act of dequeuing requires write access to that
CPU's per-CPU rcu_data structure. And yes, atomic operations and memory
barriers are of course required to make this work.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/