Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

From: John Hubbard
Date: Wed Dec 12 2018 - 16:56:06 EST


On 12/12/18 1:30 PM, Jerome Glisse wrote:
> On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote:
>> On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse <jglisse@xxxxxxxxxx> wrote:
>>>
>>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
>>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
>>>>> Another crazy idea, why not treating GUP as another mapping of the page
>>>>> and caller of GUP would have to provide either a fake anon_vma struct or
>>>>> a fake vma struct (or both for PRIVATE mapping of a file where you can
>>>>> have a mix of both private and file page thus only if it is a read only
>>>>> GUP) that would get added to the list of existing mapping.
>>>>>
>>>>> So the flow would be:
>>>>> somefunction_thatuse_gup()
>>>>> {
>>>>> ...
>>>>> GUP(_fast)(vma, ..., fake_anon, fake_vma);
>>>>> ...
>>>>> }
>>>>>
>>>>> GUP(vma, ..., fake_anon, fake_vma)
>>>>> {
>>>>> if (vma->flags == ANON) {
>>>>> // Add the fake anon vma to the anon vma chain as a child
>>>>> // of current vma
>>>>> } else {
>>>>> // Add the fake vma to the mapping tree
>>>>> }
>>>>>
>>>>> // The existing GUP except that now it inc mapcount and not
>>>>> // refcount
>>>>> GUP_old(..., &nanonymous, &nfiles);
>>>>>
>>>>> atomic_add(&fake_anon->refcount, nanonymous);
>>>>> atomic_add(&fake_vma->refcount, nfiles);
>>>>>
>>>>> return nanonymous + nfiles;
>>>>> }
>>>>
>>>> Thanks for your idea! This is actually something like I was suggesting back
>>>> at LSF/MM in Deer Valley. There were two downsides to this I remember
>>>> people pointing out:
>>>>
>>>> 1) This cannot really work with __get_user_pages_fast(). You're not allowed
>>>> to get necessary locks to insert new entry into the VMA tree in that
>>>> context. So essentially we'd loose get_user_pages_fast() functionality.
>>>>
>>>> 2) The overhead e.g. for direct IO may be noticeable. You need to allocate
>>>> the fake tracking VMA, get VMA interval tree lock, insert into the tree.
>>>> Then on IO completion you need to queue work to unpin the pages again as you
>>>> cannot remove the fake VMA directly from interrupt context where the IO is
>>>> completed.
>>>>
>>>> You are right that the cost could be amortized if gup() is called for
>>>> multiple consecutive pages however for small IOs there's no help...
>>>>
>>>> So this approach doesn't look like a win to me over using counter in struct
>>>> page and I'd rather try looking into squeezing HMM public page usage of
>>>> struct page so that we can fit that gup counter there as well. I know that
>>>> it may be easier said than done...
>>>
>>> So i want back to the drawing board and first i would like to ascertain
>>> that we all agree on what the objectives are:
>>>
>>> [O1] Avoid write back from a page still being written by either a
>>> device or some direct I/O or any other existing user of GUP.
>>> This would avoid possible file system corruption.
>>>
>>> [O2] Avoid crash when set_page_dirty() is call on a page that is
>>> considered clean by core mm (buffer head have been remove and
>>> with some file system this turns into an ugly mess).
>>>
>>> [O3] DAX and the device block problems, ie with DAX the page map in
>>> userspace is the same as the block (persistent memory) and no
>>> filesystem nor block device understand page as block or pinned
>>> block.
>>>
>>> For [O3] i don't think any pin count would help in anyway. I believe
>>> that the current long term GUP API that does not allow GUP of DAX is
>>> the only sane solution for now.
>>
>> No, that's not a sane solution, it's an emergency hack.
>>
>>> The real fix would be to teach file-
>>> system about DAX/pinned block so that a pinned block is not reuse
>>> by filesystem.
>>
>> We already have taught filesystems about pinned dax pages, see
>> dax_layout_busy_page(). As much as possible I want to eliminate the
>> concept of "dax pages" as a special case that gets sprinkled
>> throughout the mm.
>
> So thinking on O3 issues what about leveraging the recent change i
> did to mmu notifier. Add a event for truncate or any other file
> event that need to invalidate the file->page for a range of offset.
>
> Add mmu notifier listener to GUP user (except direct I/O) so that
> they invalidate they hardware mapping or switch the hardware mapping
> to use a crappy page. When such event happens what ever user do to
> the page through that driver is broken anyway. So it is better to
> be loud about it then trying to make it pass under the radar.
>
> This will put the burden on broken user and allow you to properly
> recycle your DAX page.
>
> Think of it as revoke through mmu notifier.
>
> So patchset would be:
> enum mmu_notifier_event {
> + MMU_NOTIFY_TRUNCATE,
> };
>
> + Change truncate code path to emit MMU_NOTIFY_TRUNCATE
>

That part looks good.

> Then for each user of GUP (except direct I/O or other very short
> term GUP):

but, why is there a difference between how we handle long- and
short-term callers? Aren't we just leaving a harder-to-reproduce race
condition, if we ignore the short-term gup callers?

So, how does activity (including direct IO and other short-term callers)
get quiesced (stopped, and guaranteed not to restart or continue), so
that truncate or umount can continue on?


>
> Patch 1: register mmu notifier
> Patch 2: listen to MMU_NOTIFY_TRUNCATE and MMU_NOTIFY_UNMAP
> when that happens update the device page table or
> usage to point to a crappy page and do put_user_page
> on all previously held page

Minor point, this sequence should be done within a wrapper around existing
get_user_pages(), such as get_user_pages_revokable() or something.

thanks,
--
John Hubbard
NVIDIA

>
> So this would solve the revoke side of thing without adding a burden
> on GUP user like direct I/O. Many existing user of GUP already do
> listen to mmu notifier and already behave properly. It is just about
> making every body list to that. Then we can even add the mmu notifier
> pointer as argument to GUP just to make sure no new user of GUP forget
> about registering a notifier (argument as a teaching guide not as a
> something actively use).
>
>
> So does that sounds like a plan to solve your concern with long term
> GUP user ? This does not depend on DAX or anything it would apply to
> any file back pages.
>
>
> Cheers,
> JÃrÃme
>