Re: tasks-trace RCU: question about grace period forward progress

From: Paul E. McKenney
Date: Thu Feb 25 2021 - 13:36:21 EST


On Thu, Feb 25, 2021 at 10:47:32AM -0500, Mathieu Desnoyers wrote:
> ----- On Feb 25, 2021, at 10:36 AM, paulmck paulmck@xxxxxxxxxx wrote:
>
> > On Thu, Feb 25, 2021 at 09:22:48AM -0500, Mathieu Desnoyers wrote:
> >> Hi Paul,
> >>
> >> Answering a question from Peter on IRC got me to look at rcu_read_lock_trace(),
> >> and I see this:
> >>
> >> static inline void rcu_read_lock_trace(void)
> >> {
> >> struct task_struct *t = current;
> >>
> >> WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1);
> >> barrier();
> >> if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) &&
> >> t->trc_reader_special.b.need_mb)
> >> smp_mb(); // Pairs with update-side barriers
> >> rcu_lock_acquire(&rcu_trace_lock_map);
> >> }
> >>
> >> static inline void rcu_read_unlock_trace(void)
> >> {
> >> int nesting;
> >> struct task_struct *t = current;
> >>
> >> rcu_lock_release(&rcu_trace_lock_map);
> >> nesting = READ_ONCE(t->trc_reader_nesting) - 1;
> >> barrier(); // Critical section before disabling.
> >> // Disable IPI-based setting of .need_qs.
> >> WRITE_ONCE(t->trc_reader_nesting, INT_MIN);
> >> if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) {
> >> WRITE_ONCE(t->trc_reader_nesting, nesting);
> >> return; // We assume shallow reader nesting.
> >> }
> >> rcu_read_unlock_trace_special(t, nesting);
> >> }
> >>
> >> AFAIU, each thread keeps track of whether it is nested within a RCU read-side
> >> critical
> >> section with a counter, and grace periods iterate over all threads to make sure
> >> they
> >> are not within a read-side critical section before they can complete:
> >>
> >> # define rcu_tasks_trace_qs(t) \
> >> do { \
> >> if (!likely(READ_ONCE((t)->trc_reader_checked)) && \
> >> !unlikely(READ_ONCE((t)->trc_reader_nesting))) { \
> >> smp_store_release(&(t)->trc_reader_checked, true); \
> >> smp_mb(); /* Readers partitioned by store. */ \
> >> } \
> >> } while (0)
> >>
> >> It reminds me of the liburcu urcu-mb flavor which also deals with per-thread
> >> state to track whether threads are nested within a critical section:
> >>
> >> https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h#L90
> >> https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-mb.h#L125
> >>
> >> static inline void _urcu_mb_read_lock_update(unsigned long tmp)
> >> {
> >> if (caa_likely(!(tmp & URCU_GP_CTR_NEST_MASK))) {
> >> _CMM_STORE_SHARED(URCU_TLS(urcu_mb_reader).ctr,
> >> _CMM_LOAD_SHARED(urcu_mb_gp.ctr));
> >> cmm_smp_mb();
> >> } else
> >> _CMM_STORE_SHARED(URCU_TLS(urcu_mb_reader).ctr, tmp + URCU_GP_COUNT);
> >> }
> >>
> >> static inline void _urcu_mb_read_lock(void)
> >> {
> >> unsigned long tmp;
> >>
> >> urcu_assert(URCU_TLS(urcu_mb_reader).registered);
> >> cmm_barrier();
> >> tmp = URCU_TLS(urcu_mb_reader).ctr;
> >> urcu_assert((tmp & URCU_GP_CTR_NEST_MASK) != URCU_GP_CTR_NEST_MASK);
> >> _urcu_mb_read_lock_update(tmp);
> >> }
> >>
> >> The main difference between the two algorithm is that task-trace within the
> >> kernel lacks the global "urcu_mb_gp.ctr" state snapshot, which is either
> >> incremented or flipped between 0 and 1 by the grace period. This allow RCU
> >> readers
> >> outermost nesting starting after the beginning of the grace period not to
> >> prevent
> >> progress of the grace period.
> >>
> >> Without this, a steady flow of incoming tasks-trace-RCU readers can prevent the
> >> grace period from ever completing.
> >>
> >> Or is this handled in a clever way that I am missing here ?
> >
> > There are several mechanisms designed to handle this. The following
> > paragraphs describe these at a high level.
> >
> > The trc_wait_for_one_reader() is invoked on each task. It uses the
> > try_invoke_on_locked_down_task(), which, if the task is currently not
> > running, keeps it that way and invokes trc_inspect_reader(). If the
> > locked-down task is in a read-side critical section, the need_qs field
> > is set, which will cause the task's next rcu_read_lock_trace() to report
> > the quiescent state.
>
> I suspect you meant "rcu_read_unlock_trace()" here.

You are quite correct, apologies for my early morning confusion!

> > If read-side memory barriers have been enabled, trc_inspect_reader()
> > is able to check for a reader being active, and if not, reports the
> > quiescent state. If there is a reader, trc_inspect_reader() reports
> > failure, which is another path to the following paragraph.
> >
> > If the task could not be locked down due its currently running,
> > then trc_wait_for_one_reader() attempts to send an IPI, which results in
> > trc_read_check_handler() rechecking for a read-side critical section
> > and either reporting the quiescent state immediately or proceding in the
> > same way that trc_inspect_reader() does. The trc_read_check_handler()
> > of course checks to make sure that the target task is still running
> > before doing anything. If the attempt to send the IPI fails, then
> > the task is rechecked in a later pass.
> >
> > So what sequence of events did you find that causes these mechanisms
> > to fail?
>
> The explanation you provide takes care of my concerns, so I don't have
> any remaining problematic scenario in mind.

Would the block comment added by the below patch have helped?

One question for Peter... Does each and every context switch imply a
full barrier?

I am pretty sure that it does, but figured that this was a good time
to double-check, given that RCU Tasks Trace assumes this. ;-)

Thanx, Paul

------------------------------------------------------------------------

commit 581f79546b6be406a9c7280b2d3511b60821efe0
Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
Date: Thu Feb 25 10:26:00 2021 -0800

rcu-tasks: Add block comment laying out RCU Tasks Trace design

This commit adds a block comment that gives a high-level overview of
how RCU tasks trace grace periods progress. It also adds a note about
how exiting tasks are handles, plus it gives an overview of the memory
ordering.

Reported-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 17c8ebe..f818357 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -726,6 +726,42 @@ EXPORT_SYMBOL_GPL(show_rcu_tasks_rude_gp_kthread);
// flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace
// readers can operate from idle, offline, and exception entry/exit in no
// way allows rcu_preempt and rcu_sched readers to also do so.
+//
+// The implementation uses rcu_tasks_wait_gp(), which relies on function
+// pointers in the rcu_tasks structure. The rcu_spawn_tasks_trace_kthread()
+// function sets these function pointers up so that rcu_tasks_wait_gp()
+// invokes these functions in this order:
+//
+// rcu_tasks_trace_pregp_step():
+// Initialize the count of readers and block CPU-hotplug operations.
+// rcu_tasks_trace_pertask(), invoked on every non-idle task:
+// Initialize per-task state and attempt to identify an immediate
+// quiescent state for that task, or, failing that, attempt to set
+// that task's .need_qs flag so that that task's next outermost
+// rcu_read_unlock_trace() will report the quiescent state (in which
+// case the count of readers is incremented). If both attempts fail,
+// the task is added to a "holdout" list.
+// rcu_tasks_trace_postscan():
+// Initialize state and attempt to identify an immediate quiescent
+// state as above (but only for idle tasks), unblock CPU-hotplug
+// operations, and wait for an RCU grace period to avoid races with
+// tasks that are in the process of exiting.
+// check_all_holdout_tasks_trace(), repeatedly until holdout list is empty:
+// Scans the holdout list, attempting to identify a quiescent state
+// for each task on the list. If there is a quiescent state, the
+// corresponding task is removed from the holdout list.
+// rcu_tasks_trace_postgp():
+// Wait for the count of readers do drop to zero, reporting any stalls.
+// Also execute full memory barriers to maintain ordering with code
+// executing after the grace period.
+//
+// The exit_tasks_rcu_finish_trace() synchronizes with exiting tasks.
+//
+// Pre-grace-period update-side code is ordered before the grace
+// period via the ->cbs_lock and barriers in rcu_tasks_kthread().
+// Pre-grace-period read-side code is ordered before the grace period by
+// atomic_dec_and_test() of the count of readers (for IPIed readers) and by
+// scheduler context-switch ordering (for locked-down non-running readers).

// The lockdep state must be outside of #ifdef to be useful.
#ifdef CONFIG_DEBUG_LOCK_ALLOC