Re: [PATCH v8 6/9] dax: add support for fsync/msync

From: Jan Kara
Date: Fri Jan 15 2016 - 08:10:29 EST


On Wed 13-01-16 11:58:02, Ross Zwisler wrote:
> On Wed, Jan 13, 2016 at 10:35:25AM +0100, Jan Kara wrote:
> > On Wed 13-01-16 00:30:19, Ross Zwisler wrote:
> > > > And secondly: You must write-protect all mappings of the flushed range so
> > > > that you get fault when the sector gets written-to again. We spoke about
> > > > this in the past already but somehow it got lost and I forgot about it as
> > > > well. You need something like rmap_walk_file()...
> > >
> > > The code that write protected mappings and then cleaned the radix tree entries
> > > did get written, and was part of v2:
> > >
> > > https://lkml.org/lkml/2015/11/13/759
> > >
> > > I removed all the code that cleaned PTE entries and radix tree entries for v3.
> > > The reason behind this was that there was a race that I couldn't figure out
> > > how to solve between the cleaning of the PTEs and the cleaning of the radix
> > > tree entries.
> > >
> > > The race goes like this:
> > >
> > > Thread 1 (write) Thread 2 (fsync)
> > > ================ ================
> > > wp_pfn_shared()
> > > pfn_mkwrite()
> > > dax_radix_entry()
> > > radix_tree_tag_set(DIRTY)
> > > dax_writeback_mapping_range()
> > > dax_writeback_one()
> > > radix_tag_clear(DIRTY)
> > > pgoff_mkclean()
> > > ... return up to wp_pfn_shared()
> > > wp_page_reuse()
> > > pte_mkdirty()
> > >
> > > After this sequence we end up with a dirty PTE that is writeable, but with a
> > > clean radix tree entry. This means that users can write to the page, but that
> > > a follow-up fsync or msync won't flush this dirty data to media.
> > >
> > > The overall issue is that in the write path that goes through wp_pfn_shared(),
> > > the DAX code has control over when the radix tree entry is dirtied but not
> > > when the PTE is made dirty and writeable. This happens up in wp_page_reuse().
> > > This means that we can't easily add locking, etc. to protect ourselves.
> > >
> > > I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
> > > easy solutions presented themselves in the absence of a page lock. I do have
> > > one idea, but I think it's pretty invasive and will need to wait for another
> > > kernel cycle.
> > >
> > > The current code that leaves the radix tree entry will give us correct
> > > behavior - it'll just be less efficient because we will have an ever-growing
> > > dirty set to flush.
> >
> > Ahaa! Somehow I imagined tag_pages_for_writeback() clears DIRTY radix tree
> > tags but it does not (I should have known, I have written that functions
> > few years ago ;). Makes sense. Thanks for clarification.
> >
> > > > > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > > > > * dax_pfn_mkwrite - handle first write to DAX page
> > > > > * @vma: The virtual memory area where the fault occurred
> > > > > * @vmf: The description of the fault
> > > > > - *
> > > > > */
> > > > > int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > > {
> > > > > - struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > > > + struct file *file = vma->vm_file;
> > > > >
> > > > > - sb_start_pagefault(sb);
> > > > > - file_update_time(vma->vm_file);
> > > > > - sb_end_pagefault(sb);
> > > > > + dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> > > >
> > > > Why is NO_SECTOR argument correct here?
> > >
> > > Right - so NO_SECTOR means "I expect there to already be an entry in the radix
> > > tree - just make that entry dirty". This works because pfn_mkwrite() always
> > > follows a normal __dax_fault() or __dax_pmd_fault() call. These fault calls
> > > will insert the radix tree entry, regardless of whether the fault was for a
> > > read or a write. If the fault was for a write, the radix tree entry will also
> > > be made dirty.
> > >
> > > For reads the radix tree entry will be inserted but left clean. When the
> > > first write happens we will get a pfn_mkwrite() call, which will call
> > > dax_radix_entry() with the NO_SECTOR argument. This will look up the radix
> > > tree entry & set the dirty tag.
> >
> > So the explanation of this should be somewhere so that everyone knows that
> > we must have radix tree entries even for clean mapped blocks. Because upto
> > know that was not clear to me. Also __dax_pmd_fault() seems to insert
> > entries only for write fault so the assumption doesn't seem to hold there?
>
> Ah, right, sorry, the read fault() -> pfn_mkwrite() sequence only happens for
> 4k pages. You are right about our handling of 2MiB pages - for a read
> followed by a write we will just call into the normal __dax_pmd_fault() code
> again, which will do the get_block() call and insert a dirty radix tree entry.
> Because we have to go all the way through the fault handler again at write
> time there isn't a benefit to inserting a clean radix tree entry on read, so
> we just skip it.

Ouch, I wasn't aware of this asymetry between PMD and PTE faults. OK, so
please just document this all somewhere because I'm pretty sure casual
reader won't be able to figure this all out just from the code.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR