Re: [RFC PATCH 1/4] mm: madvise: read loop's step size beforehand in madvise_inject_error(), prepare for THP support.
From: Naoya Horiguchi
Date: Thu Aug 24 2017 - 00:33:13 EST
On Wed, Aug 23, 2017 at 10:20:02AM -0400, Zi Yan wrote:
> On 23 Aug 2017, at 3:49, Naoya Horiguchi wrote:
>
> > On Mon, Aug 14, 2017 at 09:52:13PM -0400, Zi Yan wrote:
> >> From: Zi Yan <zi.yan@xxxxxxxxxxxxxx>
> >>
> >> The loop in madvise_inject_error() reads its step size from a page
> >> after it is soft-offlined. It works because the page is:
> >> 1) a hugetlb page: the page size does not change;
> >> 2) a base page: the page size does not change;
> >> 3) a THP: soft-offline always splits THPs, thus, it is OK to use
> >> PAGE_SIZE as step size.
> >>
> >> It will be a problem when soft-offline supports THP migrations.
> >> When a THP is migrated without split during soft-offlining, the THP
> >> is split after migration, thus, before and after soft-offlining page
> >> sizes do not match. This causes a THP to be unnecessarily soft-lined,
> >> at most, 511 times, wasting free space.
> >
> > Hi Zi Yan,
> >
> > Thank you for the suggestion.
> >
> > I think that when madvise(MADV_SOFT_OFFLINE) is called with some range
> > over more than one 4kB page, the caller clearly intends to call
> > soft_offline_page() over all 4kB pages within the range in order to
> > simulate the multiple soft-offline events. Please note that the caller
> > only knows that specific pages are half-broken, and expect that all such
> > pages are offlined. So the end result should be same, whether the given
> > range is backed by thp or not.
> >
>
> But if the given virtual address is backed by a THP and the THP is soft-offlined
> without splitting (enabled by following patches), the old for-loop will cause extra
> 511 THPs being soft-offlined.
>
> For example, the caller wants to offline VPN 0-511, which is backed by a THP whose
> address range is PFN 0-511. In the first iteration of the for-loop,
> get_user_pages_fast(VPN0, ...) will return the THP and soft_offline_page() will offline the THP,
> replacing it with a new THP, say PFN 512-1023, so VPN 0-511 is backed by PFN 512-1023.
> But the original THP will be split after it is freed, thus, for-loop will not end
> at this moment, but continues to offline VPN1, which leads to PFN 512-1023 being offlined
> and replaced by another THP, say 1024-1535. This will go on and end up with
> 511 extra THPs are offlined. That is why we need to this patch to tell
> whether the THP is offlined as a whole or just its head page is offlined.
Thanks for elaborating this. I understand your point.
But I still not sure what the best behavior is.
madvise(MADV_SOFT_OFFLINE) is a test feature and giving multi-page range
on the call works like some stress testing. So multiple thp migrations
seem to me an expected behavior. At least it behaves in the same manner
as calling madvise(MADV_SOFT_OFFLINE) 512 times on VPN0-VPN511 separately,
which is consistent.
So I still feel like leaving the current behavior as long as your later
patches work without this change.
Thanks,
Naoya Horiguchi
>
> Let me know if it is still not clear to you. Or I missed something.
>
> >>
> >> Signed-off-by: Zi Yan <zi.yan@xxxxxxxxxxxxxx>
> >> ---
> >> mm/madvise.c | 21 ++++++++++++++++++---
> >> 1 file changed, 18 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/madvise.c b/mm/madvise.c
> >> index 47d8d8a25eae..49f6774db259 100644
> >> --- a/mm/madvise.c
> >> +++ b/mm/madvise.c
> >> @@ -612,19 +612,22 @@ static long madvise_remove(struct vm_area_struct *vma,
> >> static int madvise_inject_error(int behavior,
> >> unsigned long start, unsigned long end)
> >> {
> >> - struct page *page;
> >> + struct page *page = NULL;
> >> + unsigned long page_size = PAGE_SIZE;
> >>
> >> if (!capable(CAP_SYS_ADMIN))
> >> return -EPERM;
> >>
> >> - for (; start < end; start += PAGE_SIZE <<
> >> - compound_order(compound_head(page))) {
> >> + for (; start < end; start += page_size) {
> >> int ret;
> >>
> >> ret = get_user_pages_fast(start, 1, 0, &page);
> >> if (ret != 1)
> >> return ret;
> >>
> >> + page_size = (PAGE_SIZE << compound_order(compound_head(page))) -
> >> + (PAGE_SIZE * (page - compound_head(page)));
> >> +
> >
> > Assigning a value which is not 4kB or some hugepage size into page_size
> > might be confusing because that's not what the name says. You can introduce
> > 'next' virtual address and ALIGN() might be helpful to calculate it.
>
> Like:
>
> next = ALIGN(start, PAGE_SIZE<<compound_order(compound_head(page))) - start;
>
> I think it works. Thanks.
>
>
> --
> Best Regards
> Yan Zi