Re: [PATCH 3/8] mm, truncate: Remove all exceptional entries from pagevec under one lock

From: Mel Gorman
Date: Thu Oct 12 2017 - 10:53:13 EST


On Thu, Oct 12, 2017 at 03:33:23PM +0200, Jan Kara wrote:
> > return;
> >
> > - if (dax_mapping(mapping)) {
> > - dax_delete_mapping_entry(mapping, index);
> > - return;
> > + dax = dax_mapping(mapping);
> > + if (!dax)
> > + spin_lock_irq(&mapping->tree_lock);
> > +
> > + for (i = ei, j = ei; i < pagevec_count(pvec); i++) {
> > + struct page *page = pvec->pages[i];
> > + pgoff_t index = indices[i];
> > +
> > + if (!radix_tree_exceptional_entry(page)) {
> > + pvec->pages[j++] = page;
> > + continue;
> > + }
> > +
> > + if (unlikely(dax)) {
> > + dax_delete_mapping_entry(mapping, index);
> > + continue;
> > + }
> > +
> > + __clear_shadow_entry(mapping, index, page);
> > }
> > - clear_shadow_entry(mapping, index, entry);
> > +
> > + if (!dax)
> > + spin_unlock_irq(&mapping->tree_lock);
> > + pvec->nr = j;
> > }
>
> When I look at this I think could make things cleaner. I have the following
> observations:
>
> 1) All truncate_inode_pages(), invalidate_mapping_pages(),
> invalidate_inode_pages2_range() essentially do very similar thing and would
> benefit from a similar kind of batching.
>

While this is true, the benefit is much more marginal that I didn't feel
the level of churn was justified. Primarily it would help fadvise() and
invalidating when buffered and direct IO is mixed. I didn't think it would
be that much cleaner as a result so I left it.

> 2) As you observed and measured, batching of radix tree operations makes
> sense both when removing pages and shadow entries, I'm very confident it
> would make sense for DAX exceptional entries as well.
>

True, but I didn't have a suitable setup for testing DAX so I wasn't
comfortable with making the change. dax_delete_mapping_entry can sleep but it
should be as simple as not taking the spinlock in dax_delete_mapping_entry
and always locking in truncate_exceptional_pvec_entries. dax is already
releasing the mapping->tree_lock if it needs to sleep and I didn't spot
any other gotcha but I'd prefer that change was done by someone that can
verify it works properly.

> 3) In all cases (i.e., those three functions and for all entry types) the
> workflow seems to be:
> * lockless lookup of entries
> * prepare entry for reclaim (or determine it is not elligible)
> * lock mapping->tree_lock
> * verify entry is still elligible for reclaim (otherwise bail)
> * clear radix tree entry
> * unlock mapping->tree_lock
> * final cleanup of the entry
>
> So I'm wondering whether we cannot somehow refactor stuff so that batching
> of radix tree operations could be shared and we wouldn't have to duplicate
> it in all those cases.
>
> But it would be rather large overhaul of the code so it may be a bit out of
> scope for these improvements...
>

I think it would be out of scope for this improvement but I can look into
it if the series is accepted. I think it would be a lot of churn for fairly
marginal benefit though.

> > @@ -409,8 +445,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
> > }
> >
> > if (radix_tree_exceptional_entry(page)) {
> > - truncate_exceptional_entry(mapping, index,
> > - page);
> > + if (ei != PAGEVEC_SIZE)
> > + ei = i;
>
> This should be ei == PAGEVEC_SIZE I think.
>
> Otherwise the patch looks good to me so feel free to add:
>

Fixed.

> Reviewed-by: Jan Kara <jack@xxxxxxx>

Thanks

--
Mel Gorman
SUSE Labs