Re: [patch V3] percpu_counter: scalability works

From: Shaohua Li
Date: Tue May 17 2011 - 01:22:56 EST

Next message: Linus Torvalds: "Re: [PATCH -v2] pci: Check bridge resources after resource allocation."
Previous message: Shreshtha Kumar SAHU: "[PATCHv2 1/2] amba pl011: workaround for uart registers lockup"
In reply to: Eric Dumazet: "Re: [patch V3] percpu_counter: scalability works"
Next in thread: Eric Dumazet: "Re: [patch V3] percpu_counter: scalability works"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 2011-05-17 at 12:56 +0800, Eric Dumazet wrote:
> Le mardi 17 mai 2011 Ã 08:55 +0800, Shaohua Li a Ãcrit :
>
> > I'm still interesting in improving percpu_counter itself. If we can
> > improve it, why we don't? My patch doesn't slow down anything for all
> > tests. Why didn't you ever look at it?
> >
>
> I did and said there were obvious problems in it.
>
> 1) 4096 cpus : size of one percpu_counter is 16Kbytes.
>
> After your last patch -> 20 kbytes for no good reason.
I don't know why you said there is no good reason. I posted a lot of
data which shows improvement, while you just ignore.

The size issue is completely pointless. If you have 4096 CPUs, how could
you worry about 16k bytes memory. Especially the extra memory makes the
API much faster.

> 2) Two separate alloc_percpu() -> two separate cache lines instead of
> one.
Might be in one cache line actually, but can be easily fixed if not
anyway. On the other hand, even touch two cache lines, it's still faster
than the original spinlock implementation, which I already posted data.

> But then, if one alloc_percpu() -> 32 kbytes per object.
the size issue is completely pointless

> 3) Focus on percpu_counter() implementation instead of making an
> analysis of callers.
>
> I did a lot of rwlocks removal in network stack because they are not the
> right synchronization primitive in many cases. I did not optimize
> rwlocks. If rwlocks were even slower, I suspect other people would have
> help me to convert things faster.
My original issue is mmap, but I already declaimed several times we can
make percpu_counter better, why won't we?

> 4) There is a possible way to solve your deviation case : add at _add()
> beginning a short cut for crazy 'amount' values. Its a bit expensive on
> 32bit arches, so might be added in a new helper to let _add() be fast
> for normal and gentle users.

+ if (unlikely(cmpxchg(ptr, old, 0) != old))
> + goto retry;
this doesn't change anything, you still have the deviation issue here

> + atomic64_add(count, &fbc->count);

> if (unlikely(amount >= batch || amount <= -batch)) {
> atomic64(amount, &fbc->count);
> return;
> }
why we just handle this special case, my patch can make the whole part
faster without deviation

so you didn't point out any obvious problem with my patch actually. This
is good.

Thanks,
Shaohua

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Linus Torvalds: "Re: [PATCH -v2] pci: Check bridge resources after resource allocation."
Previous message: Shreshtha Kumar SAHU: "[PATCHv2 1/2] amba pl011: workaround for uart registers lockup"
In reply to: Eric Dumazet: "Re: [patch V3] percpu_counter: scalability works"
Next in thread: Eric Dumazet: "Re: [patch V3] percpu_counter: scalability works"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]