Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads

From: Jan Kara
Date: Wed Apr 26 2017 - 04:55:04 EST


On Tue 25-04-17 16:59:36, Ross Zwisler wrote:
> On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote:
> <>
> > Hum, but now thinking more about it I have hard time figuring out why write
> > vs fault cannot actually still race:
> >
> > CPU1 - write(2) CPU2 - read fault
> >
> > dax_iomap_pte_fault()
> > ->iomap_begin() - sees hole
> > dax_iomap_rw()
> > iomap_apply()
> > ->iomap_begin - allocates blocks
> > dax_iomap_actor()
> > invalidate_inode_pages2_range()
> > - there's nothing to invalidate
> > grab_mapping_entry()
> > - we add zero page in the radix
> > tree & map it to page tables
> >
> > Similarly read vs write fault may end up racing in a wrong way and try to
> > replace already existing exceptional entry with a hole page?
>
> Yep, this race seems real to me, too. This seems very much like the issues
> that exist when a thread is doing direct I/O. One thread is doing I/O to an
> intermediate buffer (page cache for direct I/O case, zero page for us), and
> the other is going around it directly to media, and they can get out of sync.
>
> IIRC the direct I/O code looked something like:
>
> 1/ invalidate existing mappings
> 2/ do direct I/O to media
> 3/ invalidate mappings again, just in case. Should be cheap if there weren't
> any conflicting faults. This makes sure any new allocations we made are
> faulted in.

Yeah, the problem is people generally expect weird behavior when they mix
direct and buffered IO (or let alone mmap) however everyone expects
standard read(2) and write(2) to be completely coherent with mmap(2).

> I guess one option would be to replicate that logic in the DAX I/O path, or we
> could try and enhance our locking so page faults can't race with I/O since
> both can allocate blocks.

In the abstract way, the problem is that we have radix tree (and page
tables) cache block mapping information and the operation: "read block
mapping information, store it in the radix tree" is not serialized in any
way against other block allocations so the information we store can be out
of date by the time we store it.

One way to solve this would be to move ->iomap_begin call in the fault
paths under entry lock although that would mean I have to redo how ext4
handles DAX faults because with current code it would create lock inversion
wrt transaction start.

Another solution would be to grab i_mmap_sem for write when doing write
fault of a page and similarly have it grabbed for writing when doing
write(2). This would scale rather poorly but if we later replaced it with a
range lock (Davidlohr has already posted a nice implementation of it) it
won't be as bad. But I guess option 1) is better...

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR