Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle statemachine
From: Frederic Weisbecker
Date: Wed Jul 17 2013 - 21:33:13 EST
On Wed, Jul 17, 2013 at 05:41:41PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 18, 2013 at 01:31:21AM +0200, Frederic Weisbecker wrote:
> > I'm missing a key here.
> >
> > Let's imagine that the timekeeper has finally set full_sysidle_state = RCU_SYSIDLE_FULL_NOTED
> > with cmpxchg, what guarantees that this CPU is not seeing a stale RCU_SYSIDLE_SHORT value
> > for example?
>
> Good question! Let's see if I have a reasonable answer. ;-)
>
> I am going to start with the large-CPU case, so that the state is advanced
> only by the grace-period kthread.
>
> 1. Case 1: the timekeeper CPU invoked rcu_sysidle_force_exit().
> In this case, this is the same CPU that set full_sysidle_state
> to RCU_SYSIDLE_FULL_NOTED, so it is guaranteed not to see a
> stale value.
>
> 2. Case 2: Some CPU came out of idle, and invoked rcu_sysidle_exit().
> In this case, if this CPU reads a RCU_SYSIDLE_SHORT from
> full_sysidle_state, this read must have come before the
> cmpxchg() (on some other CPU) that set it to RCU_SYSIDLE_LONG.
> Because this CPU's read from full_sysidle_state was preceded by
> an atomic_inc() that updated this CPU's ->dynticks_idle, that
> update must also precede the cmpxchg() that set full_sysidle_state
> to RCU_SYSIDLE_LONG. Because the state advancing is done from
> within a single thread, the subsequent scan is guaranteed to see
> the first CPU's update of ->dynticks_idle, and will therefore
> refrain from advancing full_sysidle_state to RCU_SYSIDLE_FULL.
>
> This will in turn prevent the timekeeping thread from advancing
> the state to RCU_SYSIDLE_FULL_NOTED, so this scenario cannot
> happen.
Ok, so IIUC the safety is guaranteed in the following ordering:
CPU 0 CPU 1
idle = 1 smp_mb()
//for each cpu
if (atomic_read(rdp(1)->dyntick_idle) & 1) atomic_inc(rdp->dyntick_idle)
idle = 0 smp_mb()
if (idle)
cmpxchg(full_sysidle_state, SHORT, LONG) while (full_sysidle_state > SHORT)
//reset with cmpxchg
So it's like:
CPU 0 CPU 1
read I write I
smp_mb() smp_mb()
cmpxchg S read S
I still can't find what guarantees we don't read a value in CPU 1 that is way below
what we want.
>
> Unfortunately, the reasoning in #2 above does not hold in the small-CPU
> case because there is the possibility of both the timekeeping CPU and
> the RCU grace-period kthread concurrently advancing the state machine.
> This would be bad, good catch!!!
It's not like I spotted anything myself but you're welcome :)
>
> The patch below (untested) is an attempt to fix this. If it actually
> works, I will merge it in with 6/7.
>
> Anything else I missed? ;-)
Well I guess I'll wait one more night before trying to understand
the below ;)
Thanks!
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
>
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index aa3f525..fe83085 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1364,7 +1364,7 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
> }
> force_qs_rnp(rsp, dyntick_save_progress_counter,
> &isidle, &maxj);
> - rcu_sysidle_report(rsp, isidle, maxj);
> + rcu_sysidle_report_gp(rsp, isidle, maxj);
> fqs_state = RCU_FORCE_QS;
> } else {
> /* Handle dyntick-idle and offline CPUs. */
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 1602c21..657b415 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -559,8 +559,8 @@ static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
> unsigned long *maxj);
> static bool is_sysidle_rcu_state(struct rcu_state *rsp);
> static void rcu_bind_gp_kthread(void);
> -static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> - unsigned long maxj);
> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> + unsigned long maxj);
> static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
>
> #endif /* #ifndef RCU_TREE_NONCORE */
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index a4d44c3..f65d9c2 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -2655,14 +2655,22 @@ static void rcu_sysidle_cancel(void)
> * scan of the CPUs' dyntick-idle state.
> */
> static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> - unsigned long maxj)
> + unsigned long maxj, bool gpkt)
> {
> if (rsp != rcu_sysidle_state)
> return; /* Wrong flavor, ignore. */
> - if (isidle)
> - rcu_sysidle(maxj); /* More idle! */
> - else
> + if (isidle) {
> + if (gpkt && nr_cpu_ids > RCU_SYSIDLE_SMALL)
> + rcu_sysidle(maxj); /* More idle! */
> + } else {
> rcu_sysidle_cancel(); /* Idle is over. */
> + }
> +}
> +
> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> + unsigned long maxj)
> +{
> + rcu_sysidle_report(rsp, isidle, maxj, true);
> }
>
> /* Callback and function for forcing an RCU grace period. */
> @@ -2713,7 +2721,8 @@ bool rcu_sys_is_idle(void)
> if (!isidle)
> break;
> }
> - rcu_sysidle_report(rcu_sysidle_state, isidle, maxj);
> + rcu_sysidle_report(rcu_sysidle_state,
> + isidle, maxj, false);
> oldrss = rss;
> rss = ACCESS_ONCE(full_sysidle_state);
> }
> @@ -2776,8 +2785,8 @@ static void rcu_bind_gp_kthread(void)
> {
> }
>
> -static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
> - unsigned long maxj)
> +static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
> + unsigned long maxj)
> {
> }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/