Re: Mainline kernel OLTP performance update

From: Zhang, Yanmin
Date: Tue Jan 20 2009 - 00:16:49 EST


On Fri, 2009-01-16 at 11:20 +0100, Andi Kleen wrote:
> "Zhang, Yanmin" <yanmin_zhang@xxxxxxxxxxxxxxx> writes:
>
>
> > I think that's because SLQB
> > doesn't pass through big object allocation to page allocator.
> > netperf UDP-U-1k has less improvement with SLQB.
>
> That sounds like just the page allocator needs to be improved.
> That would help everyone. We talked a bit about this earlier,
> some of the heuristics for hot/cold pages are quite outdated
> and have been tuned for obsolete machines and also its fast path
> is quite long. Unfortunately no code currently.
Andi,

Thanks for your kind information. I did more investigation with SLUB
on netperf UDP-U-4k issue.

oprofile shows:
328058 30.1342 linux-2.6.29-rc2 copy_user_generic_string
134666 12.3699 linux-2.6.29-rc2 __free_pages_ok
125447 11.5231 linux-2.6.29-rc2 get_page_from_freelist
22611 2.0770 linux-2.6.29-rc2 __sk_mem_reclaim
21442 1.9696 linux-2.6.29-rc2 list_del
21187 1.9462 linux-2.6.29-rc2 __ip_route_output_key

So ï__free_pages_ok and ïget_page_from_freelist consume too much cpu time.
With SLQB, these 2 functions almost don't consume time.

Command 'slabinfo -AD' shows:
Name Objects Alloc Free %Fast
:0000256 1685 29611065 29609548 99 99
:0000168 2987 164689 161859 94 39
:0004096 1471 114918 113490 99 97

So kmem_cache ï:0000256 is very active.

Kernel stack dump in ï__free_pages_ok shows
[<ffffffff8027010f>] __free_pages_ok+0x109/0x2e0
[<ffffffff8024bb34>] autoremove_wake_function+0x0/0x2e
[<ffffffff8060f387>] __kfree_skb+0x9/0x6f
[<ffffffff8061204b>] skb_free_datagram+0xc/0x31
[<ffffffff8064b528>] udp_recvmsg+0x1e7/0x26f
[<ffffffff8060b509>] sock_common_recvmsg+0x30/0x45
[<ffffffff80609acd>] sock_recvmsg+0xd5/0xed

The callchain is:
ï__kfree_skb =>
kfree_skbmem =>
kmem_cache_free(skbuff_head_cache, skb);

kmem_cache ïskbuff_head_cache's object size is just 256, so it shares the kmem_cache
with ï:0000256. Their order is 1 which means every slab consists of 2 physical pages.

ïnetperf UDP-U-4k is a UDP stream testing. client process keeps sending 4k-size packets
to server process and server process just receives the packets one by one.

If we start CPU_NUM clients and the same number of servers, every client will send lots
of packets within one sched slice, then process scheduler schedules the server to receive
many packets within one sched slice; then client resends again. So there are many packets
in the queue. When server receive the packets, it frees ïskbuff_head_cache. When the slab's
objects are all free, the slab will be released by calling __free_pages. Such batch
sending/receiving creates lots of slab free activity.

Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0.
But here ïskbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer.

SLQB has no such issue, because:
1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up
later on quickly without lock. A batch parameter to control the free object recollection is mostly
1024.
2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can
benefit from ïzone_pcp(zone, cpu)->pcp page buffer.

So SLUB need resolve such issues that one process allocates a batch of objects and another process
frees them batchly.

yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/