Re: Proof-of-concept: better(?) page-table manipulation API

From: Kirill A. Shutemov
Date: Mon May 07 2018 - 07:31:32 EST


On Mon, May 07, 2018 at 04:51:57AM +0000, Andy Lutomirski wrote:
> On Tue, Apr 24, 2018 at 8:44 AM Kirill A. Shutemov <
> kirill.shutemov@xxxxxxxxxxxxxxx> wrote:
>
> > Hi everybody,
>
> > I've proposed to talk about page able manipulation API on the LSF/MM'2018,
> > so I need something material to talk about.
>
>
> I gave it a quick read. I like the concept a lot, and I have a few
> comments.

Thank you for the input.

> > +/*
> > + * How manu bottom level we account to mm->pgtables_bytes
> > + */
> > +#define PT_ACCOUNT_LVLS 3
> > +
> > +struct pt_ptr {
> > + unsigned long *ptr;
> > + int lvl;
> > +};
> > +
>
> I think you've inherited something that I consider to be a defect in the
> old code: you're conflating page *tables* with page table *entries*. Your
> 'struct pt_ptr' sounds like a pointer to an entire page table, but AFAICT
> you're using it to point to a specific entry within a table. I think that
> both the new core code and the code that uses it would be clearer and less
> error prone if you made the distinction explicit. I can think of two clean
> ways to do it:
>
> 1. Add a struct pt_entry_ptr, and make it so that get_ptv(), etc take a
> pt_entry_ptr instead of a pt_ptr. Add a helper to find a pt_entry_ptr
> given a pt_ptr and either an index or an address.
>
> 2. Don't allow pointers to page table entries at all. Instead, get_ptv()
> would take an address or an index parameter.

Well, I'm not sure how useful pointer to whole page tables are.
Where do you them useful?

How I see the picture so far:

- ptp_t represent a pointer to an entry in a page table.

In x86-64 case I pretend that CR3 is single-entry page table. It
requires a special threatement in ptp_page_vaddr(), but works fine
otherwise.

- ptv_t represents a value that dereferenced from ptp_t or can be set to
ptp_t.

It's trivial to find the start of page table if we would need it by
masking out botom bits from ptp->ptr. It works on x86 and should be
possible on any architecture.

> Also, what does lvl == 0 mean? Is it the top or the bottom? I think a
> comment would be helpful.

It is bottom. But it should be up to architecture to decide.

>
> > +/*
> > + * When walking page tables, get the address of the next boundary,
> > + * or the end address of the range if that comes earlier. Although no
> > + * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
> > + */
>
> I read this comment twice, and I still don't get it. Can you clarify what
> this function does and why you would use it?

That's basically ported variant of p?d_addr_end. It helps step address by
right value for the page table entry and handles wrapping properly.

See example in copy_pt_range().

> > +/* Operations on page table pointers */
> > +
> > +/* Initialize ptp_t with pointer to top page table level. */
> > +static inline ptp_t ptp_init(struct mm_struct *mm)
> > +{
> > + struct pt_ptr ptp ={
> > + .ptr = (unsigned long *)mm->pgd,
> > + .lvl = PT_TOP_LEVEL,
> > + };
> > +
> > + return ptp;
> > +}
> > +
>
> On some architectures, there are multiple page table roots. For example,
> ARM64 has a root for the kernel half of the address space and a root for
> the user half (at least -- I don't fully understand it). x86 PAE sort-of
> has four roots. Would it make sense to expose this in the API for
> real?

I will give it a thought.

Is there a reason not to threat it as an additional page table layer and
deal with it in a unified way?

> For example, ptp_init(mm) could be replaced with ptp_init(mm, addr). This
> would make it a bit cleaner to handle an separate user and kernel tables.
> (As it stands, what is supposed to happen on ARM if you do
> ptp_init(something that isn't init_mm) and then walk it to look for a
> kernel address?)

IIUC, we can handle it in ptp_walk() since we have all may handle root in
a special way as I do for x86-64.

> Also, ptp_init() seems oddly named for me. ptp_get_root_for_mm(),
> perhaps? There could also be ptp_get_kernel_root() to get the root for the
> init_mm's tables.

Yeah, sounds better.

> > +static inline void ptp_walk(ptp_t *ptp, unsigned long addr)
> > +{
> > + ptp->ptr = (unsigned long *)ptp_page_vaddr(ptp);
> > + ptp->ptr += __pt_index(addr, --ptp->lvl);
> > +}
>
> Can you add a comment that says what this function does?

Okay, I will.

> Why does it not change the level?

It does. --ptp->lvl.

> > +
> > +static void ptp_free(struct mm_struct *mm, ptv_t ptv)
> > +{
> > + if (ptv.lvl < PT_SPLIT_LOCK_LVLS)
> > + ptlock_free(pfn_to_page(ptv_pfn(ptv)));
> > +}
> > +
>
> As it stands, this is a function that seems easy easy to misuse given the
> confusion between page tables and page table entries.

Hm. I probably have a blind spot, but I don't see it.

The function has to be named better for sure.

> Finally, a general comment. Actually fully implementing this the way
> you've done it seems like a giant mess given that you need to support all
> architectures. But couldn't you implement the new API as a wrapper around
> the old API so you automatically get all architectures?

I will look into this. But I'm not sure if it possbile without measurable
overhead.

--
Kirill A. Shutemov