Re: [PATCH 2/2 RFC] srcu: implement Peter's checking algorithm

From: Paul E. McKenney
Date: Wed Feb 29 2012 - 08:56:11 EST


On Wed, Feb 29, 2012 at 06:07:32PM +0800, Lai Jiangshan wrote:
> On 02/28/2012 09:47 PM, Paul E. McKenney wrote:
> > On Tue, Feb 28, 2012 at 09:51:22AM +0800, Lai Jiangshan wrote:
> >> On 02/28/2012 02:30 AM, Paul E. McKenney wrote:
> >>> On Mon, Feb 27, 2012 at 04:01:04PM +0800, Lai Jiangshan wrote:
> >>>> >From 40724998e2d121c2b5a5bd75114625cfd9d4f9a9 Mon Sep 17 00:00:00 2001
> >>>> From: Lai Jiangshan <laijs@xxxxxxxxxxxxxx>
> >>>> Date: Mon, 27 Feb 2012 14:22:47 +0800
> >>>> Subject: [PATCH 2/2] srcu: implement Peter's checking algorithm
> >>>>
> >>>> This patch implement the algorithm as Peter's:
> >>>> https://lkml.org/lkml/2012/2/1/119
> >>>>
> >>>> o Make the checking lock-free and we can perform parallel checking,
> >>>> Although almost parallel checking makes no sense, but we need it
> >>>> when 1) the original checking task is preempted for long, 2)
> >>>> sychronize_srcu_expedited(), 3) avoid lock(see next)
> >>>>
> >>>> o Since it is lock-free, we save a mutex in state machine for
> >>>> call_srcu().
> >>>>
> >>>> o Remove the SRCU_REF_MASK and remove the coupling with the flipping.
> >>>> (so we can remove the preempt_disable() in future, but use
> >>>> __this_cpu_inc() instead.)
> >>>>
> >>>> o reduce a smp_mb(), simplify the comments and make the smp_mb() pairs
> >>>> more intuitive.
> >>>
> >>> Hello, Lai,
> >>>
> >>> Interesting approach!
> >>>
> >>> What happens given the following sequence of events?
> >>>
> >>> o CPU 0 in srcu_readers_active_idx_check() invokes
> >>> srcu_readers_seq_idx(), getting some number back.
> >>>
> >>> o CPU 0 invokes srcu_readers_active_idx(), summing the
> >>> ->c[] array up through CPU 3.
> >>>
> >>> o CPU 1 invokes __srcu_read_lock(), and increments its counter
> >>> but not yet its ->seq[] element.
> >>
> >>
> >> Any __srcu_read_lock() whose increment of active counter is not seen
> >> by srcu_readers_active_idx() is considerred as
> >> "reader-started-after-this-srcu_readers_active_idx_check()",
> >> We don't need to wait.
> >>
> >> As you said, this srcu C.S 's increment seq is not seen by above
> >> srcu_readers_seq_idx().
> >>
> >>>
> >>> o CPU 0 completes its summing of the ->c[] array, incorrectly
> >>> obtaining zero.
> >>>
> >>> o CPU 0 invokes srcu_readers_seq_idx(), getting the same
> >>> number back that it got last time.
> >>
> >> If it incorrectly get zero, it means __srcu_read_unlock() is seen
> >> in srcu_readers_active_idx(), and it means the increment of
> >> seq is seen in this srcu_readers_seq_idx(), it is different
> >> from the above seq that it got last time.
> >>
> >> increment of seq is not seen by above srcu_readers_seq_idx(),
> >> but is seen by later one, so the two returned seq is different,
> >> this is the core of Peter's algorithm, and this was written
> >> in the comments(Sorry for my bad English). Or maybe I miss
> >> your means in this mail.
> >
> > OK, good, this analysis agrees with what I was thinking.
> >
> > So my next question is about the lock freedom. This lock freedom has to
> > be limited in nature and carefully implemented. The reasons for this are:
> >
> > 1. Readers can block in any case, which can of course block both
> > synchronize_srcu_expedited() and synchronize_srcu().
> >
> > 2. Because only one CPU at a time can be incrementing ->completed,
> > some sort of lock with preemption disabling will of course be
> > needed. Alternatively, an rt_mutex could be used for its
> > priority-inheritance properties.
> >
> > 3. Once some CPU has incremented ->completed, all CPUs that might
> > still be summing up the old indexes must stop. If they don't,
> > they might incorrectly call a too-short grace period in case of
> > ->seq[]-sum overflow on 32-bit systems.
> >
> > Or did you have something else in mind?
>
> When flip happens when check_zero, this check_zero will no be
> committed even it is success.

But if the CPU in check_zero isn't blocking the grace period, then
->completed could overflow while that CPU was preempted. Then how
would this CPU know that the flip had happened?

> I play too much with lock-free for call_srcu(), the code becomes complicated,
> I just give up lock-free for call_srcu(), the main aim of call_srcu() is simple.

Makes sense to me!

> (But I still like Peter's approach, it has some other good thing
> besides lock-free-checking, if you don't like it, I will send
> another patch to fix srcu_readers_active())

Try them both and check their performance &c. If within espilon of
each other, pick whichever one you prefer.

Thanx, Paul

> Thanks,
> Lai
>
> >
> > Thanx, Paul
> >
> >> Thanks,
> >> Lai
> >>
> >>>
> >>> o In parallel with the previous step, CPU 1 executes out of order
> >>> (as permitted by the lack of a second memory barrier in
> >>> __srcu_read_lock()), starting up the critical section before
> >>> incrementing its ->seq[] element.
> >>>
> >>> o Because CPU 0 is not aware that CPU 1 is an SRCU reader, it
> >>> completes the SRCU grace period before CPU 1 completes its
> >>> SRCU read-side critical section.
> >>>
> >>> This actually might be safe, but I need to think more about it. In the
> >>> meantime, I figured I should ask your thoughts.
> >>>
> >>> Thanx, Paul
> >>>
> >>>> Inspired-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >>>> Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxx>
> >>>> ---
> >>>> include/linux/srcu.h | 7 +--
> >>>> kernel/srcu.c | 137 ++++++++++++++++++++-----------------------------
> >>>> 2 files changed, 57 insertions(+), 87 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> >>>> index 5b49d41..15354db 100644
> >>>> --- a/include/linux/srcu.h
> >>>> +++ b/include/linux/srcu.h
> >>>> @@ -32,18 +32,13 @@
> >>>>
> >>>> struct srcu_struct_array {
> >>>> unsigned long c[2];
> >>>> + unsigned long seq[2];
> >>>> };
> >>>>
> >>>> -/* Bit definitions for field ->c above and ->snap below. */
> >>>> -#define SRCU_USAGE_BITS 1
> >>>> -#define SRCU_REF_MASK (ULONG_MAX >> SRCU_USAGE_BITS)
> >>>> -#define SRCU_USAGE_COUNT (SRCU_REF_MASK + 1)
> >>>> -
> >>>> struct srcu_struct {
> >>>> unsigned completed;
> >>>> struct srcu_struct_array __percpu *per_cpu_ref;
> >>>> struct mutex mutex;
> >>>> - unsigned long snap[NR_CPUS];
> >>>> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> >>>> struct lockdep_map dep_map;
> >>>> #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >>>> diff --git a/kernel/srcu.c b/kernel/srcu.c
> >>>> index 47ee35d..376b583 100644
> >>>> --- a/kernel/srcu.c
> >>>> +++ b/kernel/srcu.c
> >>>> @@ -73,10 +73,25 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
> >>>> #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
> >>>>
> >>>> /*
> >>>> + * Returns approximate total sequence of readers on the specified rank
> >>>> + * of per-CPU counters.
> >>>> + */
> >>>> +static unsigned long srcu_readers_seq_idx(struct srcu_struct *sp, int idx)
> >>>> +{
> >>>> + int cpu;
> >>>> + unsigned long sum = 0;
> >>>> + unsigned long t;
> >>>> +
> >>>> + for_each_possible_cpu(cpu) {
> >>>> + t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->seq[idx]);
> >>>> + sum += t;
> >>>> + }
> >>>> + return sum;
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> * Returns approximate number of readers active on the specified rank
> >>>> - * of per-CPU counters. Also snapshots each counter's value in the
> >>>> - * corresponding element of sp->snap[] for later use validating
> >>>> - * the sum.
> >>>> + * of per-CPU counters.
> >>>> */
> >>>> static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> >>>> {
> >>>> @@ -87,26 +102,36 @@ static unsigned long srcu_readers_active_idx(struct srcu_struct *sp, int idx)
> >>>> for_each_possible_cpu(cpu) {
> >>>> t = ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]);
> >>>> sum += t;
> >>>> - sp->snap[cpu] = t;
> >>>> }
> >>>> - return sum & SRCU_REF_MASK;
> >>>> + return sum;
> >>>> }
> >>>>
> >>>> -/*
> >>>> - * To be called from the update side after an index flip. Returns true
> >>>> - * if the modulo sum of the counters is stably zero, false if there is
> >>>> - * some possibility of non-zero.
> >>>> - */
> >>>> static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >>>> {
> >>>> int cpu;
> >>>> + unsigned long seq;
> >>>> +
> >>>> + seq = srcu_readers_seq_idx(sp, idx);
> >>>> +
> >>>> + /*
> >>>> + * smp_mb() A pairs with smp_mb() B for critical section.
> >>>> + * It ensures that the SRCU read-side critical section whose
> >>>> + * read-lock is not seen by the following srcu_readers_active_idx()
> >>>> + * will see any updates that before the current task performed before.
> >>>> + * (So we don't need to care these readers this time)
> >>>> + *
> >>>> + * Also, if we see the increment of the seq, we must see the
> >>>> + * increment of the active counter in the following
> >>>> + * srcu_readers_active_idx().
> >>>> + */
> >>>> + smp_mb(); /* A */
> >>>>
> >>>> /*
> >>>> * Note that srcu_readers_active_idx() can incorrectly return
> >>>> * zero even though there is a pre-existing reader throughout.
> >>>> * To see this, suppose that task A is in a very long SRCU
> >>>> * read-side critical section that started on CPU 0, and that
> >>>> - * no other reader exists, so that the modulo sum of the counters
> >>>> + * no other reader exists, so that the sum of the counters
> >>>> * is equal to one. Then suppose that task B starts executing
> >>>> * srcu_readers_active_idx(), summing up to CPU 1, and then that
> >>>> * task C starts reading on CPU 0, so that its increment is not
> >>>> @@ -122,53 +147,26 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
> >>>> return false;
> >>>>
> >>>> /*
> >>>> - * Since the caller recently flipped ->completed, we can see at
> >>>> - * most one increment of each CPU's counter from this point
> >>>> - * forward. The reason for this is that the reader CPU must have
> >>>> - * fetched the index before srcu_readers_active_idx checked
> >>>> - * that CPU's counter, but not yet incremented its counter.
> >>>> - * Its eventual counter increment will follow the read in
> >>>> - * srcu_readers_active_idx(), and that increment is immediately
> >>>> - * followed by smp_mb() B. Because smp_mb() D is between
> >>>> - * the ->completed flip and srcu_readers_active_idx()'s read,
> >>>> - * that CPU's subsequent load of ->completed must see the new
> >>>> - * value, and therefore increment the counter in the other rank.
> >>>> - */
> >>>> - smp_mb(); /* A */
> >>>> -
> >>>> - /*
> >>>> - * Now, we check the ->snap array that srcu_readers_active_idx()
> >>>> - * filled in from the per-CPU counter values. Since
> >>>> - * __srcu_read_lock() increments the upper bits of the per-CPU
> >>>> - * counter, an increment/decrement pair will change the value
> >>>> - * of the counter. Since there is only one possible increment,
> >>>> - * the only way to wrap the counter is to have a huge number of
> >>>> - * counter decrements, which requires a huge number of tasks and
> >>>> - * huge SRCU read-side critical-section nesting levels, even on
> >>>> - * 32-bit systems.
> >>>> - *
> >>>> - * All of the ways of confusing the readings require that the scan
> >>>> - * in srcu_readers_active_idx() see the read-side task's decrement,
> >>>> - * but not its increment. However, between that decrement and
> >>>> - * increment are smb_mb() B and C. Either or both of these pair
> >>>> - * with smp_mb() A above to ensure that the scan below will see
> >>>> - * the read-side tasks's increment, thus noting a difference in
> >>>> - * the counter values between the two passes.
> >>>> + * Validation step, smp_mb() D pairs with smp_mb() C. If the above
> >>>> + * srcu_readers_active_idx() see a decrement of the active counter
> >>>> + * in srcu_read_unlock(), it should see one of these for corresponding
> >>>> + * srcu_read_lock():
> >>>> + * See the increment of the active counter,
> >>>> + * Failed to see the increment of the active counter.
> >>>> + * The second one can cause srcu_readers_active_idx() incorrectly
> >>>> + * return zero, but it means the above srcu_readers_seq_idx() does not
> >>>> + * see the increment of the seq(ref: comments of smp_mb() A),
> >>>> + * and the following srcu_readers_seq_idx() sees the increment of
> >>>> + * the seq. The seq is changed.
> >>>> *
> >>>> - * Therefore, if srcu_readers_active_idx() returned zero, and
> >>>> - * none of the counters changed, we know that the zero was the
> >>>> - * correct sum.
> >>>> - *
> >>>> - * Of course, it is possible that a task might be delayed
> >>>> - * for a very long time in __srcu_read_lock() after fetching
> >>>> - * the index but before incrementing its counter. This
> >>>> - * possibility will be dealt with in __synchronize_srcu().
> >>>> + * This smp_mb() D pairs with smp_mb() C for critical section.
> >>>> + * then any of the current task's subsequent code will happen after
> >>>> + * that SRCU read-side critical section whose read-unlock is seen in
> >>>> + * srcu_readers_active_idx().
> >>>> */
> >>>> - for_each_possible_cpu(cpu)
> >>>> - if (sp->snap[cpu] !=
> >>>> - ACCESS_ONCE(per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx]))
> >>>> - return false; /* False zero reading! */
> >>>> - return true;
> >>>> + smp_mb(); /* D */
> >>>> +
> >>>> + return srcu_readers_seq_idx(sp, idx) == seq;
> >>>> }
> >>>>
> >>>> /**
> >>>> @@ -216,9 +214,9 @@ int __srcu_read_lock(struct srcu_struct *sp)
> >>>> preempt_disable();
> >>>> idx = rcu_dereference_index_check(sp->completed,
> >>>> rcu_read_lock_sched_held()) & 0x1;
> >>>> - ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) +=
> >>>> - SRCU_USAGE_COUNT + 1;
> >>>> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
> >>>> smp_mb(); /* B */ /* Avoid leaking the critical section. */
> >>>> + ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
> >>>> preempt_enable();
> >>>> return idx;
> >>>> }
> >>>> @@ -258,17 +256,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> >>>> int trycount = 0;
> >>>>
> >>>> /*
> >>>> - * If a reader fetches the index before the ->completed increment,
> >>>> - * but increments its counter after srcu_readers_active_idx_check()
> >>>> - * sums it, then smp_mb() D will pair with __srcu_read_lock()'s
> >>>> - * smp_mb() B to ensure that the SRCU read-side critical section
> >>>> - * will see any updates that the current task performed before its
> >>>> - * call to synchronize_srcu(), or to synchronize_srcu_expedited(),
> >>>> - * as the case may be.
> >>>> - */
> >>>> - smp_mb(); /* D */
> >>>> -
> >>>> - /*
> >>>> * SRCU read-side critical sections are normally short, so wait
> >>>> * a small amount of time before possibly blocking.
> >>>> */
> >>>> @@ -281,18 +268,6 @@ static void wait_idx(struct srcu_struct *sp, int idx, bool expedited)
> >>>> schedule_timeout_interruptible(1);
> >>>> }
> >>>> }
> >>>> -
> >>>> - /*
> >>>> - * The following smp_mb() E pairs with srcu_read_unlock()'s
> >>>> - * smp_mb C to ensure that if srcu_readers_active_idx_check()
> >>>> - * sees srcu_read_unlock()'s counter decrement, then any
> >>>> - * of the current task's subsequent code will happen after
> >>>> - * that SRCU read-side critical section.
> >>>> - *
> >>>> - * It also ensures the order between the above waiting and
> >>>> - * the next flipping.
> >>>> - */
> >>>> - smp_mb(); /* E */
> >>>> }
> >>>>
> >>>> static void srcu_flip(struct srcu_struct *sp)
> >>>> --
> >>>> 1.7.4.4
> >>>>
> >>>
> >>>
> >>
> >
> >
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/