Re: [PATCH v3] mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()

From: Sasha Levin
Date: Sat Mar 01 2014 - 18:08:52 EST


On 02/28/2014 10:35 PM, Naoya Horiguchi wrote:
> On Fri, Feb 28, 2014 at 03:14:27PM -0800, Andrew Morton wrote:
>> On Fri, 28 Feb 2014 14:59:02 -0500 Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> wrote:
>>
>>> page->index stores pagecache index when the page is mapped into file mapping
>>> region, and the index is in pagecache size unit, so it depends on the page
>>> size. Some of users of reverse mapping obviously assumes that page->index
>>> is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.
>>>
>>> For example, consider that we have 3-hugepage vma and try to mbind the 2nd
>>> hugepage to migrate to another node. Then the vma is split and migrate_page()
>>> is called for the 2nd hugepage (belonging to the middle vma.)
>>> In migrate operation, rmap_walk_anon() tries to find the relevant vma to
>>> which the target hugepage belongs, but here we miscalculate pgoff.
>>> So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.
>>>
>>> This patch introduces a new API that is usable both for normal page and
>>> hugepage to get PAGE_SIZE offset from page->index. Users should clearly
>>> distinguish page_index for pagecache index and page_pgoff for page offset.
>>>
>>> ..
>>>
>>> --- a/include/linux/pagemap.h
>>> +++ b/include/linux/pagemap.h
>>> @@ -307,6 +307,22 @@ static inline loff_t page_file_offset(struct page *page)
>>> return ((loff_t)page_file_index(page)) << PAGE_CACHE_SHIFT;
>>> }
>>>
>>> +static inline unsigned int page_size_order(struct page *page)
>>> +{
>>> + return unlikely(PageHuge(page)) ?
>>> + huge_page_size_order(page) :
>
> I found that we have compound_order(page) for the same purpose, so we don't
> have to define this new function.
>
>>> + (PAGE_CACHE_SHIFT - PAGE_SHIFT);
>>> +}
>>
>> Could use some nice documentation, please. Why it exists, what it
>> does. Particularly: what sort of pages it can and can't operate on,
>> and why.
>
> OK.
>
>> The presence of PAGE_CACHE_SIZE is unfortunate - it at least implies
>> that the page is a pagecache page. I dunno, maybe just use "0"?
>
> Yes, PAGE_CACHE_SHIFT makes code messy if PAGE_CACHE_SHIFT is always PAGE_SHIFT.
> But I guess that recently people start to thinking of changing the size of
> pagecache (in the discussion around >4kB sector device.)
> And from readabilitie's perspective, "pagecache size" and "page size" are
> different things, so keeping it is better in a long run.
>
> Anyway, I revised the patch again, could you take a look?
>
> Thanks,
> Naoya
> ---
> From: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
> Date: Fri, 28 Feb 2014 21:56:24 -0500
> Subject: [PATCH] mm, hugetlbfs: fix rmapping for anonymous hugepages with
> page_pgoff()
>
> page->index stores pagecache index when the page is mapped into file mapping
> region, and the index is in pagecache size unit, so it depends on the page
> size. Some of users of reverse mapping obviously assumes that page->index
> is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous hugepage.
>
> For example, consider that we have 3-hugepage vma and try to mbind the 2nd
> hugepage to migrate to another node. Then the vma is split and migrate_page()
> is called for the 2nd hugepage (belonging to the middle vma.)
> In migrate operation, rmap_walk_anon() tries to find the relevant vma to
> which the target hugepage belongs, but here we miscalculate pgoff.
> So anon_vma_interval_tree_foreach() grabs invalid vma, which fires VM_BUG_ON.
>
> This patch introduces a new API that is usable both for normal page and
> hugepage to get PAGE_SIZE offset from page->index. Users should clearly
> distinguish page_index for pagecache index and page_pgoff for page offset.
>
> ChangeLog v3:
> - add comment on page_size_order()
> - use compound_order(compound_head(page)) instead of huge_page_size_order()
> - use page_pgoff() in rmap_walk_file() too
> - use page_size_order() in kill_proc()
> - fix space indent
>
> ChangeLog v2:
> - fix wrong shift direction
> - introduce page_size_order() and huge_page_size_order()
> - move the declaration of PageHuge() to include/linux/hugetlb_inline.h
> to avoid macro definition.
>
> Reported-by: Sasha Levin <sasha.levin@xxxxxxxxxx> # if the reported problem is fixed
> Signed-off-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx # 3.12+

I can confirm that with this patch the lockdep issue is gone. However, the NULL deref in
walk_pte_range() and the BUG at mm/hugemem.c:3580 still appear.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/