Re: [PATCH v8 15/18] mm, fs, dax: handle layout changes to pinned dax mappings

From: Jan Kara
Date: Thu Apr 19 2018 - 06:44:42 EST


On Fri 13-04-18 15:03:51, Dan Williams wrote:
> On Mon, Apr 9, 2018 at 9:51 AM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> > On Mon, Apr 9, 2018 at 9:49 AM, Jan Kara <jack@xxxxxxx> wrote:
> >> On Sat 07-04-18 12:38:24, Dan Williams wrote:
> > [..]
> >>> I wonder if this can be trivially solved by using srcu. I.e. we don't
> >>> need to wait for a global quiescent state, just a
> >>> get_user_pages_fast() quiescent state. ...or is that an abuse of the
> >>> srcu api?
> >>
> >> Well, I'd rather use the percpu rwsemaphore (linux/percpu-rwsem.h) than
> >> SRCU. It is a more-or-less standard locking mechanism rather than relying
> >> on implementation properties of SRCU which is a data structure protection
> >> method. And the overhead of percpu rwsemaphore for your use case should be
> >> about the same as that of SRCU.
> >
> > I was just about to ask that. Yes, it seems they would share similar
> > properties and it would be better to use the explicit implementation
> > rather than a side effect of srcu.
>
> ...unfortunately:
>
> BUG: sleeping function called from invalid context at
> ./include/linux/percpu-rwsem.h:34
> [..]
> Call Trace:
> dump_stack+0x85/0xcb
> ___might_sleep+0x15b/0x240
> dax_layout_lock+0x18/0x80
> get_user_pages_fast+0xf8/0x140
>
> ...and thinking about it more srcu is a better fit. We don't need the
> 100% exclusion provided by an rwsem we only need the guarantee that
> all cpus that might have been running get_user_pages_fast() have
> finished it at least once.
>
> In my tests synchronize_srcu is a bit slower than unpatched for the
> trivial 100 truncate test, but certainly not the 200x latency you were
> seeing with syncrhonize_rcu.
>
> Elapsed time:
> 0.006149178 unpatched
> 0.009426360 srcu

Hum, right. Yesterday I was looking into KSM for a different reason and
I've noticed it also does writeprotect pages and deals with races with GUP.
And what KSM relies on is:

write_protect_page()
...
entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
/*
* Check that no O_DIRECT or similar I/O is in progress on the
* page
*/
if (page_mapcount(page) + 1 + swapped != page_count(page)) {
page used -> bail
}

And this really works because gup_pte_range() does:

page = pte_page(pte);
head = compound_head(page);

if (!page_cache_get_speculative(head))
goto pte_unmap;

if (unlikely(pte_val(pte) != pte_val(*ptep))) {
bail
}

So either write_protect_page() page sees the elevated reference or
gup_pte_range() bails because it will see the pte changed.

In the truncate path things are a bit different but in principle the same
should work - once truncate blocks page faults and unmaps pages from page
tables, we can be sure GUP will not grab the page anymore or we'll see
elevated page count. So IMO there's no need for any additional locking
against the GUP path (but a comment explaining this is highly desirable I
guess).

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR