Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pagesand return page size

From: Hans Rosenfeld
Date: Mon Feb 25 2008 - 07:13:06 EST


On Sat, Feb 23, 2008 at 10:31:01AM -0800, Dave Hansen wrote:
> > > - 4 bits for the page size, with 0 meaning native page size (4k on x86,
> > > 8k on alpha, ...) and values 1-15 being specific to the architecture
> > > (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
>
> "Native page size" probably a bad idea. ppc64 can use 64k or 4k for its
> "native" page size and has 16MB large pages (as well as some others).
> To make it even more confusing, you can have a 64k kernel page size with
> 4k mmu mappings!
>
> That said, this is a decent idea as long as we know that nobody will
> ever have more than 16 page sizes.

Then a better way to encode the page size would be returning the page
shift. This would need 6 bits instead of 4, but it would probably be
enough for any 64 bit architecture.


> > This is ok-ish, but I can't say I like it much. Especially the page size
> > field.
> >
> > But I don't really have many ideas here. Perhaps having a bit saying
> > "this entry is really a continuation of the previous one". Then any page
> > size can be trivially represented. This might also make the code on both
> > sides simpler?

I don't like the idea of parsing thousands of entries just to find out
that I'm using a huge page. It would be much better to just get the page
size one way or the other in the first entry one reads.


> > > -static int add_to_pagemap(unsigned long addr, u64 pfn,
> > > +struct ppte {
> > > + uint64_t paddr:58;
> > > + uint64_t psize:4;
> > > + uint64_t swap:1;
> > > + uint64_t present:1;
> > > +};
>
> It'd be nice to keep the current convention, which is to stay away from
> bitfields.

I like them, they make the code much more readable.


> > > +#ifdef CONFIG_X86
> > > + if (pmd_huge(*pmd)) {
> > > + struct ppte ppte = {
> > > + .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
> > > + .psize = (HPAGE_SHIFT == 22 ?
> > > + PM_PSIZE_4M : PM_PSIZE_2M),
> > > + .swap = 0,
> > > + .present = 1,
> > > + };
> > > +
> > > + for(; addr != end; addr += PAGE_SIZE) {
> > > + err = add_to_pagemap(addr, ppte, pm);
> > > + if (err)
> > > + return err;
> > > + }
> > > + } else
> > > +#endif
>
> It's great to make this support huge pages, but things like this
> probably need their own function. Putting an #ifdef in the middle of
> here makes it a lot harder to read. Just think of when powerpc, ia64
> and x86_64 get their grubby mitts in here. ;)

AFAIK the way huge pages are used on x86 differs much from other
architectures. While on x86 the address translation stops at some upper
table for a huge page, other architectures encode the page size in the
pte (at least the ones I looked at did). So pmd_huge() (and soon
pud_huge()) are very x86-specific and return just 0 on other archs, and
this code would be optimized away for them. All that would be necessary
for other archs is to eventually get the page size from the pte and put
it in the psize field.

The #ifdef could go away if pmd_pfn() was defined as 0 for !x86, it
wouldn't make sense to use it anyway.



--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/