Re: RCU qsmask !=0 warnings on large-SMP...

From: Steffen Persvold
Date: Wed Jan 25 2012 - 11:29:42 EST


On 1/25/2012 15:00, Paul E. McKenney wrote:
> On Wed, Jan 25, 2012 at 05:44:34PM +0800, Daniel J Blueman wrote:
[]
>
> I do test three-level trees regularly, but on relatively small systems
> by limiting CONFIG_RCU_FANOUT. If you are running 64-bit systems
> with default values of CONFIG_RCU_FANOUT, a 1024-CPU system fits into a
> two-level tree, so I would guess that you either have more than 1024 CPUs
> (at least as far as NR_CPUS is concerned) or gaps in your CPU-numbering
> scheme.

NR_CPU is set to 4096 (i.e CONFIG_MAXSMP=y) and CONFIG_FCU_FANOUT is 64. We see 3 levels of RCU nodes with this configuration.

>
>> Has this been encountered previously?
>
> I haven't seen that warning since I was originally developing
> TREE_PREEMPT_RCU some years back.
>
> Numascale NumaConnect is cache-coherent, right?

Correct.

[]
>>
>> WARNING: at kernel/rcutree_plugin.h:990
>
> OK, if I have the right version, RCU-sched is trying to do a new grace
> period, but finds that one of the rcu_node structure's ->qsmask fields
> indicates that a CPU (or group of CPUs) failed to check into the previous
> grace period.
>
[]

>>
>> CPU 48, treason uncloaked, rsp @ ffffffff81a1d000 (rcu_sched),
>> rsp->gpnum is 4058, rsp->completed is 4057, rnp @ ffffffff81a1d000,
>> qsmask is 0x1
>> 0 ffff8803f840d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=0 ri=279
>> ql=1 qs=..W. b=10 ci=145949 co=0 ca=0
>> 12 ffff880bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=7 ri=10
>> ql=2 qs=.R.. b=10 ci=561 co=0 ca=0
>> 24 ffff8813d040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=14
>> ql=0 qs=.... b=10 ci=406 co=0 ca=0
>> 36 ffff881bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=5 ri=22
>> ql=0 qs=.... b=10 ci=215 co=0 ca=0
>> 48 ffff8823d040d4e0 c=4057 g=4057 pq=1 pgp=4057 qp=0 of=6 ri=14
>> ql=26 qs=.RWD b=10 ci=7345 co=0 ca=0
>
> And it looks like CPU 48 is the one that did not check in. What is
> CONFIG_NO_HZ set to? If CONFIG_NO_HZ=y, this can legitimately
> happen, and printing the values of rdp->dynticks->dynticks,
> rdp->dynticks->dynticks_nesting, rdp->dynticks->dynticks_nmi_nesting,
> and rdp->dynticks_fqs will determine whether or not the situation is
> legitimate or not.
>
> If CONFIG_NO_HZ=n, than this should not happen. In this case, I should
> be able to provide you with a more focused debug print.

CONFIG_NO_HZ is not set, so it should not happen. We see that the behavior is the same with CONFIG_NO_HZ=y though, but it takes longer to reproduce usually.

>
>> 60 ffff882bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=16
>> ql=0 qs=.... b=10 ci=50 co=0 ca=0
>> 72 ffff8833d040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=26
>> ql=0 qs=.... b=10 ci=22 co=0 ca=0
[]

>> CPU 48, treason uncloaked, rsp @ ffffffff81a1d000 (rcu_sched),
>> rsp->gpnum is 4058, rsp->completed is 4057, rnp @ ffffffff81a1d100,
>> qsmask is 0x8
>> 0 ffff8803f840d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=0 ri=279
>> ql=1 qs=..W. b=10 ci=145949 co=0 ca=0
>> 12 ffff880bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=7 ri=10
>> ql=3 qs=NR.. b=10 ci=561 co=0 ca=0
>> 24 ffff8813d040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=14
>> ql=0 qs=.... b=10 ci=406 co=0 ca=0
>> 36 ffff881bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=5 ri=22
>> ql=0 qs=.... b=10 ci=215 co=0 ca=0
>> 48 ffff8823d040d4e0 c=4057 g=4057 pq=1 pgp=4057 qp=0 of=6 ri=19
>> ql=26 qs=.RWD b=10 ci=7345 co=0 ca=0
>
> Hmmmm... Same CPU and grace-period number. This is a repeat of the
> above printout, correct?

Because the RCU tree is 3 levels, the printout function we added in the patch gets called 3 times each time with the same RDP but with different RNPs (in rcu_start_gp()).

>
>> 60 ffff882bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=16
>> ql=0 qs=.... b=10 ci=50 co=0 ca=0
>> 72 ffff8833d040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=26
>> ql=0 qs=.... b=10 ci=22 co=0 ca=0
>> CPU 48, treason uncloaked, rsp @ ffffffff81a1d000 (rcu_sched),
>> rsp->gpnum is 4058, rsp->completed is 4057, rnp @ ffffffff81a1d800,
>> qsmask is 0x1
>> 0 ffff8803f840d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=0 ri=280
>> ql=1 qs=..W. b=10 ci=145949 co=0 ca=0
>> 12 ffff880bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=7 ri=11
>> ql=3 qs=NR.. b=10 ci=561 co=0 ca=0
>> 24 ffff8813d040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=15
>> ql=0 qs=.... b=10 ci=406 co=0 ca=0
>> 36 ffff881bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=5 ri=23
>> ql=0 qs=.... b=10 ci=215 co=0 ca=0
>> 48 ffff8823d040d4e0 c=4057 g=4057 pq=1 pgp=4057 qp=0 of=6 ri=21
>> ql=26 qs=.RWD b=10 ci=7345 co=0 ca=0
>
> Same here, but most of the ql= values are larger. Later printout?

The loop in rcu_start_gp() releases the node lock between each time it gets a new level in the RCU tree (it has to) :

rcu_for_each_node_breadth_first(rsp, rnp) {
raw_spin_lock(&rnp->lock); /* irqs already disabled. */
rcu_debug_print(rsp, rnp);

so I guess this allows ql= values to increase maybe, no ?

>
> Thanx, Paul
>
>> 60 ffff882bd040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=16
>> ql=0 qs=.... b=10 ci=50 co=0 ca=0
>> 72 ffff8833d040d4e0 c=4057 g=4058 pq=1 pgp=4058 qp=0 of=6 ri=26
>> ql=0 qs=.... b=10 ci=22 co=0 ca=0
>>

Thanks for looking into this Paul, we'd be more than happy to test out theories and patches.


Kind regards,
--
Steffen Persvold, Chief Architect NumaChip
Numascale AS - www.numascale.com
Tel: +47 92 49 25 54 Skype: spersvold
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/