Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages andreturn page size

From: Dave Hansen
Date: Thu Feb 28 2008 - 20:05:04 EST



On Thu, 2008-02-28 at 13:00 +0100, Hans Rosenfeld wrote:
> On Wed, Feb 27, 2008 at 09:44:04AM -0800, Dave Hansen wrote:
> > On Tue, 2008-02-26 at 21:25 +0100, Hans Rosenfeld wrote:
> > I'm just worried that once we establish the format, we can't really
> > change it. We have enough room in the pseudo-pte now, but what happens
> > when the next group of people pop up that want something else from this
> > interface. Right now we have normal memory, swap, and hugetlb pages.
> >
> > What if people want migration ptes marked next? I'm not sure those fit
> > into what you have here.
> >
> > It all fits today, I'm just worried about tomorrow. :(
>
> We could change the interface to return just a pfn (which is aligned to
> the pshift returned), as it was before. That would free up some bits
> that we could reserve for future use.

Yeah, I think we should do that. No reason to zero-pad it.

> > > @@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
> > > u64 swap_pte_to_pagemap_entry(pte_t pte)
> > > {
> > > swp_entry_t e = pte_to_swp_entry(pte);
> > > - return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> > > + return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> > > }
> >
> > Is there any way to do unions of bitfields? It seems a bit silly that
> > we have this bitfield, and then subdivide the bitfield for the swap
> > entries.
>
> Having a union of bitfields is allowed, but having a union in a
> struct of bitfields or vice-versa will probably cause the compiler not
> to put all of this together in a single 64 bit entity.
>
> This whole swap thing still needs some thought. The swap file offset
> can take 59 bits, so there is a possibilty that this will break once
> someone uses a really huge swap file. I doubt that this will happen, but
> that doesn't mean it can't happen. Maybe there should be some completely
> different interface for the swap stuff, like /proc/pid/swapmap or
> something like that.

I wouldn't worry about overflowing it. I think there are plenty of
block layer things that will break, first. :)

> > > static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > > @@ -584,16 +591,23 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > > pte_t *pte;
> > > int err = 0;
> > >
> > > - for (; addr != end; addr += PAGE_SIZE) {
> > > - u64 pfn = PM_NOT_PRESENT;
> > > + if (pmd_huge(*pmd))
> > > + add_huge_to_pagemap(addr, end, pmd_to_ppte(pmd), pm);
> >
> > Could you make this work with other architectures' large pages as well?
> > I'd hate to leave ia64, MIPS and powerpc out in the cold. powerpc at
> > least has large pmds, it just doesn't really expose them to generic
> > code.
>
> Well, if some powerpc guy would implement pmd_huge() and pmd_pfn() for
> powerpc, the x86 specific pmd_to_ppte() won't be that x86 specific no
> more. I didn't know there were huge pmds on powerpc, as pmd_huge() is
> defined as zero for everything but x86.

OK, I'm now convinced that doing this with pmds is actually completely
wrong. :(

Take a look at this code from mm/hugetlb.c:

for (address = start; address < end; address += HPAGE_SIZE) {
ptep = huge_pte_offset(mm, address);
if (!ptep)
continue;

if (huge_pmd_unshare(mm, &address, ptep))
continue;

pte = huge_ptep_get_and_clear(mm, address, ptep);
if (pte_none(pte))
continue;

page = pte_page(pte);
if (pte_dirty(pte))
set_page_dirty(page);
list_add(&page->lru, &page_list);
}

The arch code is completely responsible for taking the mm and address
and giving you back a pte that you can do pte_page() on. This is a
nice, arch-abstracted interface that everybody can use regardless of how
their arch actually does it internally.

The only issue is that this is *after* the code has decided that a
particular virtual area is for huge pages. The best arch-generic
interface I know for that is: is_vm_hugetlb_page(), but that is
VMA-based. Perhaps we should change the pagemap walk to pass the VMA
around.

We should probably use a similar approach in the pagemap code. Or, if
we're really smart, we can actually share that code with the hugetlb
code.

> Does it have huge puds as well? Once we support 1G pages for x86 a new
> function has to be added to this file to handle that special case, too.

Yes, it does (or will soon have) huge puds. But, they're nicely wrapped
up in that pte_t interface I showed above like the rest of the large
pages.

> > > pte = pte_offset_map(pmd, addr);
> > > - if (is_swap_pte(*pte))
> > > - pfn = swap_pte_to_pagemap_entry(*pte);
> > > - else if (pte_present(*pte))
> > > - pfn = pte_pfn(*pte);
> > > + if (is_swap_pte(*pte)) {
> > > + ppte.swap = 1;
> > > + ppte.paddr = swap_pte_to_pagemap_entry(*pte);
> > > + } else if (pte_present(*pte)) {
> > > + ppte.present = 1;
> > > + ppte.pshift = PAGE_SHIFT;
> > > + ppte.paddr = pte_pfn(*pte) << PAGE_SHIFT;
> > > + }
>
> This is the place where those architectures that define the page size in
> the pte should test for a huge page and put the correct page size in the
> pshift field. I looked at some of them and did not find a function or a
> macro to do this test, no generic one and no arch-dependent one.

To test a pte for its huge page size? Well, for now, we only support
one huge page size at a time, and that's HPAGE_SIZE.

> > The bitfields are nice, and I do see they've spread to generic code.
> > So, I won't object to them, but please do double-check that they don't
> > cause any problems, especially with compilers that you might not be
> > using.
>
> The standard says the ordering of bitfields is "implementation defined".
> I'm currently unsure whether this means the implementation of a machine
> or of the compiler. In the latter case, using a different compiler for
> a user space program than the one that was used to compile the kernel
> could create problems.

I'd hate to be the first ones to depend on a bitfield for a
user<->kernel interface. Can you look around for precedent in this
area, or convert them back?

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/