Re: [PATCH] percpu_counter: Fix __percpu_counter_sum()
From: Peter Zijlstra
Date: Tue Dec 09 2008 - 03:34:38 EST
On Tue, 2008-12-09 at 09:12 +0100, Eric Dumazet wrote:
> Peter Zijlstra a Ãcrit :
> > On Mon, 2008-12-08 at 18:00 -0500, Theodore Tso wrote:
> >> On Mon, Dec 08, 2008 at 11:20:35PM +0100, Peter Zijlstra wrote:
> >>> atomic_t is pretty good on all archs, but you get to keep the cacheline
> >>> ping-pong.
> >>>
> >> Stupid question --- if you're worried about cacheline ping-pongs, why
> >> aren't each cpu's delta counter cacheline aligned? With a 64-byte
> >> cache-line, and a 32-bit counters entry, with less than 16 CPU's we're
> >> going to be getting cache ping-pong effects with percpu_counter's,
> >> right? Or am I missing something?
> >
> > sorta - a new per-cpu allocator is in the works, but we do cacheline
> > align the per-cpu allocations (or used to), also, the allocations are
> > node affine.
> >
>
> I did work on a 'light weight percpu counter', aka percpu_lcounter, for
> all metrics that dont need 64 bits wide, but a plain 'long'
> (network, nr_files, nr_dentry, nr_inodes, ...)
>
> struct percpu_lcounter {
> atomic_long_t count;
> #ifdef CONFIG_SMP
> #ifdef CONFIG_HOTPLUG_CPU
> struct list_head list; /* All percpu_counters are on a list */
> #endif
> long *counters;
> #endif
> };
>
> (No more spinlock)
>
> Then I tried to have atomic_t (or atomic_long_t) for 'counters', but got a
> 10% slow down of __percpu_lcounter_add(), even if never hitting the 'slow path'
> atomic_long_add_return() is really expensiven, even on a non contended cache
> line.
>
> struct percpu_lcounter {
> atomic_long_t count;
> #ifdef CONFIG_SMP
> #ifdef CONFIG_HOTPLUG_CPU
> struct list_head list; /* All percpu_counters are on a list */
> #endif
> atomic_long_t *counters;
> #endif
> };
>
> So I believe the percpu_clounter_sum() that tries to reset to 0 all cpu local
> counts would be really too expensive, if it slows down _add() so much.
>
> long percpu_lcounter_sum(struct percpu_lcounter *fblc)
> {
> long acc = 0;
> int cpu;
>
> for_each_online_cpu(cpu)
> acc += atomic_long_xchg(per_cpu_ptr(fblc->counters, cpu), 0);
> return atomic_long_add_return(acc, &fblc->count);
> }
>
> void __percpu_lcounter_add(struct percpu_lcounter *flbc, long amount, s32 batch)
> {
> long count;
> atomic_long_t *pcount;
>
> pcount = per_cpu_ptr(flbc->counters, get_cpu());
> count = atomic_long_add_return(amount, pcount); /* way too expensive !!! */
Yeah, its an extra LOCK ins where there wasn't one before.
> if (unlikely(count >= batch || count <= -batch)) {
> atomic_long_add(count, &flbc->count);
> atomic_long_sub(count, pcount);
Also, this are two LOCKs where, with the spinlock, you'd likely only
have 1.
So yes, having the per-cpu variable an atomic seems like a way too
expensive idea. That xchg based _sum is cool though.
> }
> put_cpu();
> }
>
> Just forget about it and let percpu_lcounter_sum() only read the values, and
> let percpu_lcounter_add() not using atomic ops in fast path.
>
> void __percpu_lcounter_add(struct percpu_lcounter *flbc, long amount, s32 batch)
> {
> long count;
> long *pcount;
>
> pcount = per_cpu_ptr(flbc->counters, get_cpu());
> count = *pcount + amount;
> if (unlikely(count >= batch || count <= -batch)) {
> atomic_long_add(count, &flbc->count);
> count = 0;
> }
> *pcount = count;
> put_cpu();
> }
> EXPORT_SYMBOL(__percpu_lcounter_add);
>
>
> Also, with upcoming NR_CPUS=4096, it may be time to design a hierarchical percpu_counter,
> to avoid hitting one shared "fbc->count" all the time a local counter overflows.
So we'd normally write to the shared cacheline every cpus/batch.
Cascading this you'd get ln(cpus)/(batch^ln(cpus)) or something like
that, right? Won't just increasing batch give the same result - or are
we going to play funny games with the topology information?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/