Re: [PATCH v8 4/9] dax: support dirty DAX entries in radix tree
From: Ross Zwisler
Date: Wed Jan 13 2016 - 14:02:12 EST
On Wed, Jan 13, 2016 at 10:44:11AM +0100, Jan Kara wrote:
> On Thu 07-01-16 22:27:54, Ross Zwisler wrote:
> > Add support for tracking dirty DAX entries in the struct address_space
> > radix tree. This tree is already used for dirty page writeback, and it
> > already supports the use of exceptional (non struct page*) entries.
> >
> > In order to properly track dirty DAX pages we will insert new exceptional
> > entries into the radix tree that represent dirty DAX PTE or PMD pages.
> > These exceptional entries will also contain the writeback sectors for the
> > PTE or PMD faults that we can use at fsync/msync time.
> >
> > There are currently two types of exceptional entries (shmem and shadow)
> > that can be placed into the radix tree, and this adds a third. We rely on
> > the fact that only one type of exceptional entry can be found in a given
> > radix tree based on its usage. This happens for free with DAX vs shmem but
> > we explicitly prevent shadow entries from being added to radix trees for
> > DAX mappings.
> >
> > The only shadow entries that would be generated for DAX radix trees would
> > be to track zero page mappings that were created for holes. These pages
> > would receive minimal benefit from having shadow entries, and the choice
> > to have only one type of exceptional entry in a given radix tree makes the
> > logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> > Reviewed-by: Jan Kara <jack@xxxxxxx>
>
> I have realized there's one issue with this code. See below:
>
> > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
> > return;
> >
> > spin_lock_irq(&mapping->tree_lock);
> > - /*
> > - * Regular page slots are stabilized by the page lock even
> > - * without the tree itself locked. These unlocked entries
> > - * need verification under the tree lock.
> > - */
> > - if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> > - goto unlock;
> > - if (*slot != entry)
> > - goto unlock;
> > - radix_tree_replace_slot(slot, NULL);
> > - mapping->nrshadows--;
> > - if (!node)
> > - goto unlock;
> > - workingset_node_shadows_dec(node);
> > - /*
> > - * Don't track node without shadow entries.
> > - *
> > - * Avoid acquiring the list_lru lock if already untracked.
> > - * The list_empty() test is safe as node->private_list is
> > - * protected by mapping->tree_lock.
> > - */
> > - if (!workingset_node_shadows(node) &&
> > - !list_empty(&node->private_list))
> > - list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > - __radix_tree_delete_node(&mapping->page_tree, node);
> > +
> > + if (dax_mapping(mapping)) {
> > + if (radix_tree_delete_item(&mapping->page_tree, index, entry))
> > + mapping->nrexceptional--;
>
> So when you punch hole in a file, you can delete a PMD entry from a radix
> tree which covers part of the file which still stays. So in this case you
> have to split the PMD entry into PTE entries (probably that needs to happen
> up in truncate_inode_pages_range()) or something similar...
I think (and will verify) that the DAX code just unmaps the entire PMD range
when we receive a hole punch request inside of the PMD. If this is true then
I think the radix tree code should behave the same way and just remove the PMD
entry in the radix tree.
This will cause new accesses that used to land in the PMD range to get new
page faults. These faults will call get_blocks(), where presumably the
filesystem will tell us that we don't have a contiguous 2MiB range anymore, so
we will fall back to PTE faults. These PTEs will fill in both the radix tree
and the page tables.
So, I think the work here is to verify the behavior of DAX wrt hole punches
for PMD ranges, and make the radix tree code match that behavior. Sound good?