Re: [PATCH] thp: tail page refcounting fix

From: Andrea Arcangeli
Date: Tue Aug 23 2011 - 10:56:18 EST

Next message: Felipe Balbi: "Re: [PATCHv3 2/4] usb: gadget: replace "is_dualspeed" with"max_speed""
Previous message: Matt Porter: "Re: [PATCH] DMAEngine: Define generic transfer request api"
In reply to: Andrea Arcangeli: "[PATCH] thp: tail page refcounting fix"
Next in thread: Minchan Kim: "Re: [PATCH] thp: tail page refcounting fix"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello everyone,

[ Chris CC'ed ]

Chris, could you run this patch on one of yours hugetlbfs benchmarks to
verify there is no scalability issue in the spinlock of
__get_page_tail in your environment?

With this patch O_DIRECT on hugetlbfs will take the compound_lock even
if it doesn't need to for hugetlbfs. We could have made this special
for THP only, and in fact we could avoid the "tail_page" reference
counting completely for tail pages with hugetlbfs, but we didn't do
that so far because it's safer to have a single code path for all
compound pages and not make hugetlbfs special case there and have all
compound pages behave the same regardless if it's THP or hugetlbfs or
some driver allocating/freeing it, it keeps things much simpler and
the overhead AFIK is unmeasurable (I ask to benchmark just to be 100%
sure). The refcounting path is tricky and the more testing the better
so if it's the same for all compound pages it's best in my view (at
least for the mid term).

Also note, even if we were to ultraoptimize hugetlbfs the SMP locking
scalability would remain identical because the atomic_inc would be
still needed on the "head" page which is the only "shared" item. The
"superflous" for hugetlbfs is _only_ the write on the
"head_page->flags" locked op, and the tail_page->_mapcount atomic
inc/dec. We can't possibly eliminate the head_page->_count atomic
inc/dec, even if we were to ultraoptimize for hugetlbfs and that's the
only possibly troublesome bit in terms of SMP scalability (the
tail_page is so finegrined it can't be a scalability issue). This is
why I exclude these changes can be measurable and they should work as
great as before.

Also it'd be nice if somebody would look in direct-io to stop doing
that get_page there and relay only on the reference taken by
get_user_pages (KVM and all other get_user_page users never take
additional references on pages returned by get_user_pages and relays
exclusively on the refcount taken by get_user_pages). get_user_pages
is capable of taking the reference on the tail pages without having to
take the compound lock perfectly safely through get_page_foll() (which
is taken while the page_table_lock is still held, so it automatically
serializes with split_huge_page through the page_table_lock). Only
additional get_page(page[i]) done on the tail pages taken with
get_user_pages requires the compound_lock to be safe vs
split_huge_page (and normally there is no way to ever call get_page on
any tail page, only after get_user_pages you could run into that, so
that can be usually easily avoided, and so the compound_lock can be
better optimized away by not calling get_page on tail pages in the
first place which also avoids all other locked ops of get_page, not
just the compound_lock!).

Also note, the compound_lock was already taken for the put_page of the
tail pages. We only could avoid it in get_user_pages. So this isn't
making things much difference. I still kept my priority to avoid any
comound_lock for head pages. head compound pages, or regular pages
(not compound) just have 1 atomic ops like always. That is way more
important for performance I think because the head page refcounting
and regular page refcounting is in the real CPU bound fast paths (not
more I/O bound dealing with pci devices like O_DIRECT). And I still
also avoid compound_lock for the secondary MMU page fault in
get_user_pages (can't avoid it in put_page run after the spte is
established, just no way around it but it's always been like that).

So I loaded this patch on all my systems so far so good, torture
testing is also going without problems.

a git tree including patch is here.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=summary

This is the actual patch you can apply to stock 3.0.0 (3.1 won't boot
here for me because I never use initrd on my development systems...).

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=41dc8190934cea22b8c8b3f89e82052610664fbb
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Felipe Balbi: "Re: [PATCHv3 2/4] usb: gadget: replace "is_dualspeed" with"max_speed""
Previous message: Matt Porter: "Re: [PATCH] DMAEngine: Define generic transfer request api"
In reply to: Andrea Arcangeli: "[PATCH] thp: tail page refcounting fix"
Next in thread: Minchan Kim: "Re: [PATCH] thp: tail page refcounting fix"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]