Re: RCU scaling on large systems

From: Jack Steiner
Date: Fri May 07 2004 - 15:54:21 EST



On Sat, May 01, 2004 at 02:17:04PM -0700, William Lee Irwin III wrote:
> On Sat, May 01, 2004 at 07:08:05AM -0500, Jack Steiner wrote:
> > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
> > RCU code. The time is spent contending for shared cache lines.
>
> Would something like this help cacheline contention? This uses the
> per_cpu data areas to hold per-cpu booleans for needing switches.
> Untested/uncompiled.
>
> The global lock is unfortunately still there.
>
>
> -- wli
>
...
>
> +#if RCU_CPU_SCATTER
> +#define rcu_need_switch(cpu) (!!atomic_read(&per_cpu(rcu_data, cpu).need_switch))
> +#define rcu_clear_need_switch(cpu) atomic_set(&per_cpu(rcu_data, cpu).need_switch, 0)
> +static inline int rcu_any_cpu_need_switch(void)
> +{
> + int cpu;
> + for_each_online_cpu(cpu) {
> + if (rcu_need_switch(cpu))
> + return 1;
> + }
> + return 0;
> +}
..

This fixes only part of the problem.

Referencing percpu data for every cpu is not particularily efficient - at
least on our platform. Percpu data is allocated so that it is on the
local node of each cpu.

We use 64MB granules. The percpu data structures on the individual nodes
are separated by addresses that differ by many GB. A scan of all percpu
data structures requires a TLB entry for each node. This is costly &
trashes the TLB. (Our max system size is currently 256 nodes).


For example, a 4 node, 2 cpus/node system shows:

mdb> m d __per_cpu_offset
<0xa000000100b357c8> 0xe000003001010000
<0xa000000100b357d0> 0xe000003001020000
<0xa000000100b357d8> 0xe00000b001020000
<0xa000000100b357e0> 0xe00000b001030000
<0xa000000100b357e8> 0xe000013001030000
<0xa000000100b357f0> 0xe000013001040000
<0xa000000100b357f8> 0xe00001b001040000
<0xa000000100b35800> 0xe00001b001050000

The node number is encoded in bits [48:39] of the virtual/physical
address.

Unfortunately, our hardware does not provide a way to allocate node
local memory for every node and have all the memory covered by a single
TLB entry.


Moving "need_switch" to a single array with cacheline aligned
entries would work. I can give that a try.....




--
Thanks

Jack Steiner (steiner@xxxxxxx) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/