Re: [PATCH 01/36] mmu_notifier: add event information to address invalidation v7

From: John Hubbard
Date: Mon Jun 01 2015 - 19:10:57 EST


On Mon, 1 Jun 2015, Jerome Glisse wrote:

> On Fri, May 29, 2015 at 08:43:59PM -0700, John Hubbard wrote:
> > On Thu, 21 May 2015, j.glisse@xxxxxxxxx wrote:
> >
> > > From: Jérôme Glisse <jglisse@xxxxxxxxxx>
> > >
> > > The event information will be useful for new user of mmu_notifier API.
> > > The event argument differentiate between a vma disappearing, a page
> > > being write protected or simply a page being unmaped. This allow new
> > > user to take different path for different event for instance on unmap
> > > the resource used to track a vma are still valid and should stay around.
> > > While if the event is saying that a vma is being destroy it means that any
> > > resources used to track this vma can be free.
> > >
> > > Changed since v1:
> > > - renamed action into event (updated commit message too).
> > > - simplified the event names and clarified their usage
> > > also documenting what exceptation the listener can have in
> > > respect to each event.
> > >
> > > Changed since v2:
> > > - Avoid crazy name.
> > > - Do not move code that do not need to move.
> > >
> > > Changed since v3:
> > > - Separate hugue page split from mlock/munlock and softdirty.
> >
> > Do we care about fixing up patch comments? If so:
> >
> > s/hugue/huge/
>
> I am noting them down and will go over them.
>
>
> [...]
> > > + MMU_HSPLIT,
> >
> > Let's rename MMU_HSPLIT to one of the following, take your pick:
> >
> > MMU_HUGE_PAGE_SPLIT (too long, but you can't possibly misunderstand it)
> > MMU_PAGE_SPLIT (my favorite: only huge pages are ever split, so it works)
> > MMU_HUGE_SPLIT (ugly, but still hard to misunderstand)
>
> I will go with MMU_HUGE_PAGE_SPLIT
>
>
> [...]
> >
> > > + MMU_ISDIRTY,
> >
> > This MMU_ISDIRTY seems like a problem to me. First of all, it looks
> > backwards: the only place that invokes it is the clear_refs_write()
> > routine, for the soft-dirty tracking feature. And in that case, the pages
> > are *not* being made dirty! Rather, the kernel is actually making the
> > pages non-writable, in order to be able to trap the subsequent page fault
> > and figure out if the page is in active use.
> >
> > So, given that there is only one call site, and that call site should
> > actually be setting MMU_WRITE_PROTECT instead (I think), let's just delete
> > MMU_ISDIRTY.
> >
> > Come to think about it, there is no callback possible for "a page became
> > dirty", anyway. Because the dirty and accessed bits are actually set by
> > the hardware, and software is generally unable to know the current state.
> > So MMU_ISDIRTY just seems inappropriate to me, across the board.
> >
> > I'll take a look at the corresponding HMM_ISDIRTY, too.
>
> Ok i need to rename that one to CLEAR_SOFT_DIRTY, the idea is that
> for HMM i would rather not write protect the memory for the device
> and just rely on the regular and conservative dirtying of page. The
> soft dirty is really for migrating a process where you first clear
> the soft dirty bit, then copy memory while process still running,
> then freeze process an only copy memory that was dirtied since
> first copy. Point being that adding soft dirty to HMM is something
> that can be done down the road. We should have enough bit inside
> the device page table for that.
>

Yes, I think renaming it to CLEAR_SOFT_DIRTY will definitely allow more
accurate behavior in response to these events.

Looking ahead, a couple things:

1. This mechanism is also used for general memory utilization tracking (I
see that Vladimir DavyDov has an "idle memory tracking" proposal that
assumes this works, for example: https://lwn.net/Articles/642202/ and
https://lkml.org/lkml/2015/5/12/449).

2. It seems hard to avoid the need to eventually just write protect the
page, whether it is on the CPU or the remote device, if things like device
drivers or user space need to track write accesses to a virtual address.
Either you write protect the page, and trap the page faults, or you wait
until later and read the dirty bit (indirectly, via something like
unmap_mapping_range). Or did you have something else in mind?

Anyway, none of that needs to hold up this part of the patchset, because
the renaming fixes things up for the future code to do the right thing.

>
> >
> > > + MMU_MIGRATE,
> > > + MMU_MPROT,
> >
> > The MMU_PROT also looks questionable. Short answer: probably better to
> > read the protection, and pass either MMU_WRITE_PROTECT, MMU_READ_WRITE
> > (that's a new item, of course), or MMU_UNMAP.
> >
> > Here's why: the call site knows the protection, but by the time it filters
> > down to HMM (in later patches), that information is lost, and HMM ends up
> > doing (ouch!) another find_vma() call in order to retrieve it--and then
> > translates it into only three possible things:
> >
> > // hmm_mmu_mprot_to_etype() sets one of these:
> >
> > HMM_MUNMAP
> > HMM_WRITE_PROTECT
> > HMM_NONE
>
> Linus complained of my previous version where i differenciated the
> kind of protection change that was happening, hence why i only pass
> down mprot.
>
>
> >
> >
> > > + MMU_MUNLOCK,
> >
> > I think MMU_UNLOCK would be clearer. We already know the scope, so the
> > extra "M" isn't adding anything.
>
> I named it that way so it matches syscall name munlock(). I think
> it is clearer to use MUNLOCK, or maybe SYSCALL_MUNLOCK
>
> >
> > > + MMU_MUNMAP,
> >
> > Same thing here: MMU_UNMAP seems better.
>
> Well same idea here.

OK, sure.

>
>
> >
> > > + MMU_WRITE_BACK,
> > > + MMU_WRITE_PROTECT,
> >
> > We may have to add MMU_READ_WRITE (and maybe another one, I haven't
> > bottomed out on that), if you agree with the above approach of
> > always sending a precise event, instead of "protection changed".
>
> I think Linus point made sense last time, but i would need to read
> again the thread. The idea of that patch is really to provide context
> information on what kind of CPU page table changes is happening and
> why.
>

Shoot, I tried to find that conversation, but my search foo is too weak.
If you have a link to that thread, I'd appreciate it, so I can refresh my
memory.

I was hoping to re-read it and see if anything has changed. It's not
really a huge problem to call find_vma() again, but I do want to be sure
that there's a good reason for doing so.

Otherwise, I'll just rely on your memory that Linus preferred your current
approach, and call it good, then.

> In that respect i should probably change MMU_WRITE_PROTECT to
> MMU_KSM_WRITE_PROTECT.
>

Yes, that might help clarify to the reader, because otherwise it's not
always obvious why we have "MPROT" and "WRITE_PROTECT" (which seems at
first like merely a subset of MPROT).

thanks,
john h

>
> Cheers,
> Jérôme
>