Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

From: Jan Kara
Date: Thu Feb 04 2016 - 04:16:27 EST


On Wed 03-02-16 13:13:28, Ross Zwisler wrote:
> On Wed, Feb 03, 2016 at 11:46:11AM +0100, Jan Kara wrote:
> > On Tue 02-02-16 10:34:56, Ross Zwisler wrote:
> <>
> > > Oh, no, the primary issue with calling get_block() in the fsync path isn't
> > > performance. It's that we don't have any idea what get_block() function to
> > > call.
> > >
> > > The fault handler calls all come from the filesystem directly, so they can
> > > easily give us an appropriate get_block() function pointer. But the
> > > dax_writeback_mapping_range() calls come from the generic code in
> > > mm/filemap.c, and don't know what get_block() to pass in.
> > >
> > > During one iteration I had the calls to dax_writeback_mapping_range()
> > > happening in the filesystem fsync code so that it could pass in get_block(),
> > > but Dave Chinner pointed out that this misses other paths in the filesystem
> > > that need to have things flushed via a call to filemap_write_and_wait_range().
> >
> > Let's clear this up a bit: The problem with using ->fsync() method is that
> > it doesn't get called for sync(2).
>
> Ugh, yep. We need to correct this.
>
> > We could use ->sync_fs() to flush caches
> > in case of sync(2) (that's what's happening for normal storage) but the
> > problem with PMEM is that "flush all cached data" operation effectively
> > means iterate through all modified pages and we didn't want to implement
> > this for DAX fsync code.
>
> I'm not sure what you mean by this - this is exactly what we implemented for
> the DAX fsync code? We iterate through all dirty pages, as recorded in the
> radix tree, and flush them to media.

I by "iterate through all modified pages" I mean "iterate through all
inodes with modified pages and for each inode iterate through all modified
pages". The "iterate through all inodes" part is actually the one which
adds to complexity - either you resort to iterating over all inodes
attached to sb->s_inodes_list which is unnecessarily slow (there can be
milions of inodes there and you may need to flush only a couple of them)
or you have to somehow track which inodes need flushing and so you have
another inode list to manage.

By marking inodes that need flushing dirty and treating indexes to flush as
dirty pages, we can use all the writeback & dirty inode tracking machinery
to track and iterate through dirty inodes for us.

> It seems like we could and should do the same thing for sync(2) and syncfs()?
>
> > So we have decided to do cache flushing for DAX at a different point - mark
> > inodes which may have writes cached as dirty and use writeback code for the
> > cache flushing. But looking at it now, we have actually chosen a wrong
> > place to do the flushing in the writeback path - note that sync(2) writes
> > data via __writeback_single_inode() -> do_writepages() and thus doesn't
> > even get to filemap_write_and_wait().
>
> True - but I think we have to basically add another case for sync()
> regardless of what we do, since AFAICS the fsync() and sync() paths never
> intersect. So, if we call into the DAX code at the filesystem level
> we'll need to do it in both ->fsync and ->sync_fs/->writepages, and if we
> do it in common MM code we'll need to do it in
> filemap_write_and_wait_range() and do_writepages() or similar.

Both fsync(2) and sync(2) paths end up in do_writepages() and consequently
->writepages() callback. That's why I suggested moving calls to DAX code
into ->writepages(). The only trouble is that sometimes we have a
performance optimization checks for (mapping->nrpages > 0) which get in the
way. We need to update those to:

(mapping->nrpages > 0 || (IS_DAX(inode) && mapping->nrexceptional > 0))

Probably we should introduce a helper function for this check so that
people adding new checks won't forget about DAX.

> Here is the comment from Dave Chinner that had me move to having the calls to
> dax_writeback_mapping_range() into the generic mm code:
>
> > Lastly, this flushing really needs to be inside
> > filemap_write_and_wait_range(), because we call the writeback code
> > from many more places than just fsync to ensure ordering of various
> > operations such that files are in known state before proceeding
> > (e.g. hole punch).
>
> https://lkml.org/lkml/2015/11/16/847
>
> Both ext4 & XFS call filemap_write_and_wait_range() outside of ->fsync for
> hole punch, truncate, and block relocation (xfs_shift_file_space() &&
> ext4_collapse_range()/ext4_insert_range()).
>
> I think having DAX special case code sprinkled over all these call sites seems
> like an indication that we've made a bad choice, and that we should centralize
> in the mm layer with filemap_write_and_wait_range() and do_writepages().
>
> OTOH, I think that it might be true that we don't actually need to cover all
> these non-fsync/sync cases. In the page cache case when we have dirty data in
> the page cache, that data will be actively lost if we evict a dirty page cache
> page without flushing it to media first. For DAX, though, the data will
> remain consistent with the physical address to which it was written regardless
> of whether it's in the processor cache or not - really the only reason I see
> to flush is in response to a fsync/msync/sync/syncfs so that our data is
> durable on media in case of a power loss. The case where we could throw dirty
> data out of the page cache and essentially lose writes simply doesn't exist.
>
> Does this sound correct, or am I missing something?

Yes, you are right. filemap_write_and_wait_range() actually doesn't
guarantee data durability. That function only means all dirty data has been
sent to storage and the storage has acknowledged them. This is noop for
PMEM. So we are perfectly fine ignoring calls to
filemap_write_and_wait_range(). What guarantees data durability are only
->fsync() and ->sync_fs() calls. But some code could get upset by seeing
that filemap_write_and_wait_range() didn't actually get rid of dirty pages
(in some special cases like inode eviction or similar). That's why I'd
choose one of the two options for consistency:

1) Treat inode indexes to flush as close to dirty pages as we can - this
means inode is dirty with all the tracking associated with it, radix tree
entries have dirty tag, we get rid of these in ->writepages(). We are close
to this currently.

2) Completely avoid the dirty tracking and writeback code and reimplement
everything in DAX code.

Because some hybrid between these is IMHO bound to provoke weird (and very
hard to find) bugs.

> > So revisiting the decision I see two options:
> >
> > 1) Move the DAX flushing code from filemap_write_and_wait() into
> > ->writepages() fs callback. There the filesystem can provide all the
> > information it needs including bdev, get_block callback, or whatever.
>
> This seems fine as long as we add it to ->fsync as well since ->writepages is
> never called in that path, and as long as we are okay with skipping DAX
> writebacks on hole punch, truncate, and block relocation.

Look at ext4_sync_file() -> filemap_write_and_wait_range() ->
__filemap_fdatawrite_range() -> do_writepages(). Except those nrpages > 0
checks which would need to be changed.

> > 2) Back out even further and implement own tracking and iteration of inodes
> > to write.
> >
> > So far I still think 2) is not worth the complexity (although it would
> > bring DAX code closer to how things behave with standard storage) so I
> > would go for 1).

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR