Re: [PATCH] alloc_percpu() fails to allocate percpu data

From: Mike Snitzer
Date: Tue Mar 11 2008 - 15:39:21 EST


On 3/11/08, Eric Dumazet <dada1@xxxxxxxxxxxxx> wrote:
> Mike Snitzer a écrit :
>
> > On 2/21/08, Eric Dumazet <dada1@xxxxxxxxxxxxx> wrote:
> >
> >> Some oprofile results obtained while using tbench on a 2x2 cpu machine
> >> were very surprising.
> >>
> >> For example, loopback_xmit() function was using high number of cpu
> >> cycles to perform
> >> the statistic updates, supposed to be real cheap since they use percpu data
> >>
> >> pcpu_lstats = netdev_priv(dev);
> >> lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
> >> lb_stats->packets++; /* HERE : serious contention */
> >> lb_stats->bytes += skb->len;
> >>
> >>
> >> struct pcpu_lstats is a small structure containing two longs. It appears
> >> that on my 32bits platform,
> >> alloc_percpu(8) allocates a single cache line, instead of giving to
> >> each cpu a separate
> >> cache line.
> >>
> >> Using the following patch gave me impressive boost in various benchmarks
> >> ( 6 % in tbench)
> >> (all percpu_counters hit this bug too)
> >>
> >> Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own
> >> block of memory, so that we
> >> dont need to roudup sizes to L1_CACHE_BYTES, or merging the SGI stuffof
> >> course...
> >>
> >> Note : SLUB vs SLAB is important here to *show* the improvement, since
> >> they dont have the same minimum
> >> allocation sizes (8 bytes vs 32 bytes).
> >> This could very well explain regressions some guys reported when they
> >> switched to SLUB.
> >>
> >
> >
> > I see that this fix was committed to mainline as commit
> > be852795e1c8d3829ddf3cb1ce806113611fa555
> >
> > The commit didn't "Cc: <stable@xxxxxxxxxx>", and it doesn't appear to
> > be queued for 2.6.24.x. Should it be?
> >
> >
>
> Yes, it should be queued fo 2.6.24.x

That means both of the following commits need to be cherry-picked into 2.6.24.x:
b3242151906372f30f57feaa43b4cac96a23edb1
be852795e1c8d3829ddf3cb1ce806113611fa555

> > If I understand you correctly, SLAB doesn't create this particular
> > cache thrashing on 32bit systems? Is SLAB ok on other architectures
> > too? Can you (or others) comment on the importance of this fix
> > relative to x86_64 (64byte cacheline) and SLAB?
> >
> >
>
>
> Fix is important both for 32 and 64 bits kernels, SLAB or SLUB.
>
> SLAB does have this problem, but less prevalent than SLUB, because these
> allocators dont have the same minimal size allocation (32 vs 8)
>
> So with SLUB, it is possible that 8 CPUS share the same 64 bytes
> cacheline to store their percpu counters, while only 2 cpus can share
> this same cache line with SLAB allocator.

Thanks for the clarification.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/