On Wed, 1 Dec 2004, Christoph Lameter wrote:
Changes from V11->V12 of this patch:
- dump sloppy_rss in favor of list_rss (Linus' proposal)
- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)
This is a series of patches that increases the scalability of
the page fault handler for SMP. Here are some performance results
on a machine with 512 processors allocating 32 GB with an increasing
number of threads (that are assigned a processor each).
Your V12 patches would apply well to 2.6.10-rc3, except that (as noted
before) your mailer or whatever is eating trailing whitespace: trivial
patch attached to apply before yours, removing that whitespace so yours
apply. But what your patches need to apply to would be 2.6.10-mm.
Your i386 HIGHMEM64G 3level ptep_cmpxchg forgets to use cmpxchg8b, would
have tested out okay up to 4GB but not above: trivial patch attached.
Moving to the main patch, 1/7, the major issue I see there is the way
do_anonymous_page does update_mmu_cache after setting the pte, without
any page_table_lock to bracket them together. Obviously no problem on
architectures where update_mmu_cache is a no-op! But although there's
been plenty of discussion, particularly with Ben and Nick, I've not
noticed anything to guarantee that as safe on all architectures. I do
think it's fine for you to post your patches before completing hooks in
all the arches, but isn't this a significant issue which needs to be
sorted before your patches go into -mm? You hazily refer to such issues
in 0/7, but now you need to work with arch maintainers to settle them
and show the patches.
A lesser issue with the reordering in do_anonymous_page: don't you need
to move the lru_cache_add_active after the page_add_anon_rmap, to avoid
the very slight chance that vmscan will pick the page off the LRU and
unmap it before you've counted it in, hitting page_remove_rmap's
BUG_ON(page_mapcount(page) < 0)?
(I do wonder why do_anonymous_page calls mark_page_accessed as well as
lru_cache_add_active. The other instances of lru_cache_add_active for
an anonymous page don't mark_page_accessed i.e. SetPageReferenced too,
why here? But that's nothing new with your patch, and although you've
reordered the calls, the final page state is the same as before.)
Where handle_pte_fault does "entry = *pte" without page_table_lock:
you're quite right to passing down precisely that entry to the fault
handlers below, but there's still a problem on the 32bit architectures
supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints
of entry may be out of synch. Not a problem for do_anonymous_page, or
anything else relying on ptep_cmpxchg to check; but a problem for
do_wp_page (which could find !pfn_valid and kill the process) and
probably others (harder to think through). Your 4/7 patch for i386 has
an unused atomic get_64bit function from Nick, I think you'll have to
define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases.
Hmm, that will only work if you're using atomic set_64bit rather than
relying on page_table_lock in the complementary places which matter.
Which I believe you are indeed doing in your 3level set_pte. Shouldn't
__set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock?