Re: [PATCH, RFC 2/2] dax: use range_lock instead of i_mmap_lock

From: Kirill A. Shutemov
Date: Tue Aug 11 2015 - 11:28:59 EST


On Tue, Aug 11, 2015 at 05:31:04PM +0300, Boaz Harrosh wrote:
> On 08/11/2015 04:50 PM, Jan Kara wrote:
> > On Tue 11-08-15 19:37:08, Dave Chinner wrote:
> >>>> The patch below tries to recover some scalability for DAX by introducing
> >>>> per-mapping range lock.
> >>>
> >>> So this grows noticeably (3 longs if I'm right) struct address_space and
> >>> thus struct inode just for DAX. That looks like a waste but I don't see an
> >>> easy solution.
> >>>
> >>> OTOH filesystems in normal mode might want to use the range lock as well to
> >>> provide truncate / punch hole vs page fault exclusion (XFS already has a
> >>> private rwsem for this and ext4 needs something as well) and at that point
> >>> growing generic struct inode would be acceptable for me.
> >>
> >> It sounds to me like the way DAX has tried to solve this race is the
> >> wrong direction. We really need to drive the truncate/page fault
> >> serialisation higher up the stack towards the filesystem, not deeper
> >> into the mm subsystem where locking is greatly limited.
> >>
> >> As Jan mentions, we already have this serialisation in XFS, and I
> >> think it would be better first step to replicate that locking model
> >> in each filesystem that is supports DAX. I think this is a better
> >> direction because it moves towards solving a whole class of problems
> >> fileystem face with page fault serialisation, not just for DAX.
> >
> > Well, but at least in XFS you take XFS_MMAPLOCK in shared mode for the
> > fault / page_mkwrite callback so it doesn't provide the exclusion necessary
> > for DAX which needs exclusive access to the page given range in the page
> > cache. And replacing i_mmap_lock with fs-private mmap rwsem is a moot
> > excercise (at least from DAX POV).
> >
>
> Hi Jan. So you got me confused above. You say:
> "DAX which needs exclusive access to the page given range in the page cache"
>
> but DAX and page-cache are mutually exclusive. I guess you meant the VMA
> range, or the inode->mapping range (which one is it)

The second -- pgoff range within the inode->mapping.

> Actually I do not understand this race you guys found at all. (Please bear with
> me sorry for being slow)
>
> If two threads of the same VMA fault on the same pte
> (I'm not sure how you call it I mean a single 4k entry at each VMAs page-table)
> then the mm knows how to handle this just fine.

It does. But only if we have struct page. See lock_page_or_retry() in
filemap_fault(). Without lock_page() it's problematic.

> If two processes, ie two VMAs fault on the same inode->mapping. Then an inode
> wide lock like XFS's to protect against i_size-change / truncate is more than
> enough.

We also used lock_page() to make sure we shoot out all pages as we don't
exclude page faults during truncate. Consider this race:

<fault> <truncate>
get_block
check i_size
update i_size
unmap
setup pte

With normal page cache we make sure that all pages beyond i_size is
dropped using lock_page() in truncate_inode_pages_range().

For DAX we need a way to stop all page faults to the pgoff range before
doing unmap.

> Because with DAX there is no inode->mapping "mapping" at all. You have the call
> into the FS with get_block() to replace "holes" (zero pages) with real allocated
> blocks, on WRITE faults, but this conversion should be protected inside the FS
> already. Then there is the atomic exchange of the PTE which is fine.
> (And vis versa with holes mapping and writes)

Having unmap_mapping_range() in PMD fault handling is very unfortunate.
Go to rmap just to solve page fault is very wrong.
BTW, we need to do it in write path too.

I'm not convinced that all these "let's avoid backing storage allocation"
in DAX code is not layering violation. I think the right place to solve
this is filesystem. And we have almost all required handles for this in
place. We only need to change vm_ops->page_mkwrite() interface to be able
to return different page than what was given on input.

> > So regardless whether the lock will be a fs-private one or in
> > address_space, DAX needs something like the range lock Kirill suggested.
> > Having the range lock in fs-private part of inode has the advantage that
> > only filesystems supporting DAX / punch hole will pay the memory overhead.
> > OTOH most major filesystems need it so the savings would be IMO noticeable
>
> punch-hole is truncate for me. With the xfs model of read-write lock where
> truncate takes write, any fault taking read before executing the fault looks
> good for the FS side of things. I guess you mean the optimization of the
> radix-tree lock. But you see DAX does not have a radix-tree, ie it is empty.

Hm. Where does XFS take this read-write lock in fault path?

IIUC, truncation vs. page fault serialization relies on i_size being
updated before doing truncate_pagecache() and checking i_size under
page_lock() on fault side. We don't have i_size fence for punch hole.

BTW, how things like ext4_collapse_range() can be safe wrt parallel page
fault? Ted?

--
Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/