OK, I warmed the code up a bit and did some more measurement.
Your locking patch has improved things significantly. And
we're now getting hurt by all the cache misses walking the
pte chains.
Here are some uniprocessor numbers:
up, 2.5.30+rmap-lock-speedup:
./daniel.sh 28.32s user 42.59s system 90% cpu 1:18.20 total
./daniel.sh 29.25s user 38.62s system 91% cpu 1:14.34 total
./daniel.sh 29.13s user 38.70s system 91% cpu 1:14.50 total
c01cdc88 149 0.965276 strnlen_user
c01341f4 181 1.17258 __page_add_rmap
c012d364 195 1.26328 rmqueue
c0147680 197 1.27624 __d_lookup
c010bb28 229 1.48354 timer_interrupt
c013f3b0 235 1.52242 link_path_walk
c01122cc 261 1.69085 do_page_fault
c0111fd0 291 1.8852 pte_alloc_one
c0124be4 292 1.89168 do_anonymous_page
c0123478 304 1.96942 clear_page_tables
c01236c8 369 2.39052 copy_page_range
c01078dc 520 3.36875 page_fault
c012b620 552 3.57606 kmem_cache_alloc
c0124d58 637 4.12672 do_no_page
c0123960 648 4.19798 zap_pte_range
c012b80c 686 4.44416 kmem_cache_free
c0134298 2077 13.4556 __page_remove_rmap
c0124540 2661 17.2389 do_wp_page
up, 2.5.26:
./daniel.sh 27.90s user 31.28s system 90% cpu 1:05.25 total
./daniel.sh 31.41s user 35.30s system 100% cpu 1:06.71 total
./daniel.sh 28.54s user 32.01s system 91% cpu 1:06.41 total
c0124f2c 167 1.21155 find_vma
c0131ea8 183 1.32763 do_page_cache_readahead
c012c07c 186 1.34939 rmqueue
c01c7dc8 192 1.39292 strnlen_user
c010ba78 210 1.52351 timer_interrupt
c0144c50 222 1.61056 __d_lookup
c01120b8 250 1.8137 do_page_fault
c013cc40 260 1.88624 link_path_walk
c0122cd0 282 2.04585 clear_page_tables
c0124128 337 2.44486 do_anonymous_page
c0122e7c 347 2.51741 copy_page_range
c0111e50 363 2.63349 pte_alloc_one
c01c94ac 429 3.1123 radix_tree_lookup
c01077cc 571 4.14248 page_fault
c0123070 620 4.49797 zap_pte_range
c0124280 715 5.18717 do_no_page
c0123afc 2957 21.4524 do_wp_page
So the pte_chain stuff seems to be costing 20% system time here.
But note that I made the do_page_cache_readahead and radix_tree_lookup
cost go away in 2.5.29. So it's more like 30%.
And it's all really in __page_remove_rmap, kmem_cache_alloc/free.
If we convert the pte_chain structure to
struct pte_chain {
struct pte_chain *next;
pte_t *ptes[L1_CACHE_BYTES - 4];
};
and take care to keep them compacted we shall reduce the overhead
of both __page_remove_rmap and the slab functions by up to 7, 15
or 31-fold, depending on the L1 size. page_referenced() wins as well.
Plus we almost halve the memory consumption of the pte_chains
in the high sharing case. And if we have to kmap these suckers
we reduce the frequency of that by 7x,15x,31x,etc.
I'll code it tomorrow.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Wed Aug 07 2002 - 22:00:26 EST