Re: [PATCH 24/28] xfs: reclaim inodes from the LRU

From: Brian Foster
Date: Wed Nov 06 2019 - 12:21:17 EST


On Fri, Nov 01, 2019 at 10:46:14AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
>
> Replace the AG radix tree walking reclaim code with a list_lru
> walker, giving us both node-aware and memcg-aware inode reclaim
> at the XFS level. This requires adding an inode isolation function to
> determine if the inode can be reclaim, and a list walker to
> dispose of the inodes that were isolated.
>
> We want the isolation function to be non-blocking. If we can't
> grab an inode then we either skip it or rotate it. If it's clean
> then we skip it, if it's dirty then we rotate to give it time to be
> cleaned before it is scanned again.
>
> This congregates the dirty inodes at the tail of the LRU, which
> means that if we start hitting a majority of dirty inodes either
> there are lots of unlinked inodes in the reclaim list or we've
> reclaimed all the clean inodes and we're looped back on the dirty
> inodes. Either way, this is an indication we should tell kswapd to
> back off.
>
> The non-blocking isolation function introduces a complexity for the
> filesystem shutdown case. When the filesystem is shut down, we want
> to free the inode even if it is dirty, and this may require
> blocking. We already hold the locks needed to do this blocking, so
> what we do is that we leave inodes locked - both the ILOCK and the
> flush lock - while they are sitting on the dispose list to be freed
> after the LRU walk completes. This allows us to process the
> shutdown state outside the LRU walk where we can block safely.
>
> Because we now are reclaiming inodes from the context that it needs
> memory in (memcg and/or node), direct reclaim throttling within the
> high level reclaim code in now much more effective. Hence we don't
> wait on IO for either kswapd or direct reclaim. However, we have to
> tell kswapd to back off if we start hitting too many dirty inodes.
> This implies we've wrapped around the LRU and don't have many clean
> inodes left to reclaim, so it needs to wait a while for the AIL
> pushing to clean some of the remaining reclaimable inodes.
>
> Keep in mind we don't have to care about inode lock order or
> blocking with inode locks held here because a) we are using
> trylocks, and b) once marked with XFS_IRECLAIM they can't be found
> via the LRU and inode cache lookups will abort and retry. Hence
> nobody will try to lock them in any other context that might also be
> holding other inode locks.
>
> Also convert xfs_reclaim_all_inodes() to use a LRU walk to free all
> the reclaimable inodes in the filesystem.
>
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---

Looks fundamentally sane. Some logic quibbles..

> fs/xfs/xfs_icache.c | 404 +++++++++++++-------------------------------
> fs/xfs/xfs_icache.h | 18 +-
> fs/xfs/xfs_inode.h | 18 ++
> fs/xfs/xfs_super.c | 46 ++++-
> 4 files changed, 190 insertions(+), 296 deletions(-)
>
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 350f42e7730b..05dd292bfdb6 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -968,160 +968,110 @@ xfs_inode_ag_iterator_tag(
> return last_error;
> }
>
> -/*
> - * Grab the inode for reclaim.
> - *
> - * Return false if we aren't going to reclaim it, true if it is a reclaim
> - * candidate.
> - *
> - * If the inode is clean or unreclaimable, return 0 to tell the caller it does
> - * not require flushing. Otherwise return the log item lsn of the inode so the
> - * caller can determine it's inode flush target. If we get the clean/dirty
> - * state wrong then it will be sorted in xfs_reclaim_inode() once we have locks
> - * held.
> - */
> -STATIC bool
> -xfs_reclaim_inode_grab(
> - struct xfs_inode *ip,
> - int flags,
> - xfs_lsn_t *lsn)
> +enum lru_status
> +xfs_inode_reclaim_isolate(
> + struct list_head *item,
> + struct list_lru_one *lru,
> + spinlock_t *lru_lock,

Did we ever establish whether we should cycle the lru_lock during long
running scans?

> + void *arg)
> {
> - ASSERT(rcu_read_lock_held());
> - *lsn = 0;
> + struct xfs_ireclaim_args *ra = arg;
> + struct inode *inode = container_of(item, struct inode,
> + i_lru);
> + struct xfs_inode *ip = XFS_I(inode);

Whitespace damage on the above lines (space indentation vs tabs).

> + enum lru_status ret;
> + xfs_lsn_t lsn = 0;
> +
> + /* Careful: inversion of iflags_lock and everything else here */
> + if (!spin_trylock(&ip->i_flags_lock))
> + return LRU_SKIP;
> +
> + /* if we are in shutdown, we'll reclaim it even if dirty */
> + ret = LRU_ROTATE;
> + if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE) &&
> + !XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> + lsn = ip->i_itemp->ili_item.li_lsn;
> + ra->dirty_skipped++;
> + goto out_unlock_flags;
> + }
>
> - /* quick check for stale RCU freed inode */
> - if (!ip->i_ino)
> - return false;
> + ret = LRU_SKIP;
> + if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> + goto out_unlock_flags;
>
> - /*
> - * Do unlocked checks to see if the inode already is being flushed or in
> - * reclaim to avoid lock traffic. If the inode is not clean, return the
> - * position in the AIL for the caller to push to.
> - */
> - if (!xfs_inode_clean(ip)) {
> - *lsn = ip->i_itemp->ili_item.li_lsn;
> - return false;
> + if (!__xfs_iflock_nowait(ip)) {
> + lsn = ip->i_itemp->ili_item.li_lsn;

This looks like a potential crash vector if we ever got here with a
clean inode.

> + ra->dirty_skipped++;
> + goto out_unlock_inode;
> }
>
> - if (__xfs_iflags_test(ip, XFS_IFLOCK | XFS_IRECLAIM))
> - return false;
> + if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> + goto reclaim;
>
> /*
> - * The radix tree lock here protects a thread in xfs_iget from racing
> - * with us starting reclaim on the inode. Once we have the
> - * XFS_IRECLAIM flag set it will not touch us.
> - *
> - * Due to RCU lookup, we may find inodes that have been freed and only
> - * have XFS_IRECLAIM set. Indeed, we may see reallocated inodes that
> - * aren't candidates for reclaim at all, so we must check the
> - * XFS_IRECLAIMABLE is set first before proceeding to reclaim.
> + * Now the inode is locked, we can actually determine if it is dirty
> + * without racing with anything.
> */
> - spin_lock(&ip->i_flags_lock);
> - if (!__xfs_iflags_test(ip, XFS_IRECLAIMABLE) ||
> - __xfs_iflags_test(ip, XFS_IRECLAIM)) {
> - /* not a reclaim candidate. */
> - spin_unlock(&ip->i_flags_lock);
> - return false;
> + ret = LRU_ROTATE;
> + if (xfs_ipincount(ip)) {
> + ra->dirty_skipped++;

Hmm.. didn't we have an LSN check here?

Altogether, I think the logic in this function would be a lot more
simple if we had something like the following:

...
/* ret == LRU_SKIP */
if (!xfs_inode_clean(ip)) {
ret = LRU_ROTATE;
lsn = ip->i_itemp->ili_item.li_lsn;
ra->dirty_skipped++;
}
if (lsn && XFS_LSN_CMP(lsn, ra->lowest_lsn) < 0)
ra->lowest_lsn = lsn;
return ret;

... as the non-reclaim exit path. Then the earlier logic simply dictates
how we process the inode instead of conflating lru processing with
lsn/dirty checks. Otherwise for example (based on the current logic),
it's not really clear to me whether ->dirty_skipped cares about dirty
inodes or just the fact that we skipped an inode.

> + goto out_ifunlock;
> + }
> + if (!xfs_inode_clean(ip) && !__xfs_iflags_test(ip, XFS_ISTALE)) {
> + lsn = ip->i_itemp->ili_item.li_lsn;
> + ra->dirty_skipped++;
> + goto out_ifunlock;
> }
> +
...
> @@ -1165,167 +1108,52 @@ xfs_reclaim_inode(
...
> void
> xfs_reclaim_all_inodes(
> struct xfs_mount *mp)
> {
...
> + while (list_lru_count(&mp->m_inode_lru)) {

It seems unnecessary to call this twice per-iter:

while ((to_free = list_lru_count(&mp->m_inode_lru))) {
...
}

Hm?

Brian

> + struct xfs_ireclaim_args ra;
> + long freed, to_free;
> +
> + xfs_ireclaim_args_init(&ra);
> +
> + to_free = list_lru_count(&mp->m_inode_lru);
> + freed = list_lru_walk(&mp->m_inode_lru,
> + xfs_inode_reclaim_isolate, &ra, to_free);
> + xfs_dispose_inodes(&ra.freeable);
> +
> + if (freed == 0) {
> + xfs_log_force(mp, XFS_LOG_SYNC);
> + xfs_ail_push_all(mp->m_ail);
> + } else if (ra.lowest_lsn != NULLCOMMITLSN) {
> + xfs_ail_push_sync(mp->m_ail, ra.lowest_lsn);
> + }
> + cond_resched();
> + }
> }
>
> STATIC int
> diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
> index afd692b06c13..86e858e4a281 100644
> --- a/fs/xfs/xfs_icache.h
> +++ b/fs/xfs/xfs_icache.h
> @@ -49,8 +49,24 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
> struct xfs_inode * xfs_inode_alloc(struct xfs_mount *mp, xfs_ino_t ino);
> void xfs_inode_free(struct xfs_inode *ip);
>
> +struct xfs_ireclaim_args {
> + struct list_head freeable;
> + xfs_lsn_t lowest_lsn;
> + unsigned long dirty_skipped;
> +};
> +
> +static inline void
> +xfs_ireclaim_args_init(struct xfs_ireclaim_args *ra)
> +{
> + INIT_LIST_HEAD(&ra->freeable);
> + ra->lowest_lsn = NULLCOMMITLSN;
> + ra->dirty_skipped = 0;
> +}
> +
> +enum lru_status xfs_inode_reclaim_isolate(struct list_head *item,
> + struct list_lru_one *lru, spinlock_t *lru_lock, void *arg);
> +void xfs_dispose_inodes(struct list_head *freeable);
> void xfs_reclaim_all_inodes(struct xfs_mount *mp);
> -long xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
>
> void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
>
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index bcfb35a9c5ca..00145debf820 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -270,6 +270,15 @@ static inline int xfs_isiflocked(struct xfs_inode *ip)
>
> extern void __xfs_iflock(struct xfs_inode *ip);
>
> +static inline int __xfs_iflock_nowait(struct xfs_inode *ip)
> +{
> + lockdep_assert_held(&ip->i_flags_lock);
> + if (ip->i_flags & XFS_IFLOCK)
> + return false;
> + ip->i_flags |= XFS_IFLOCK;
> + return true;
> +}
> +
> static inline int xfs_iflock_nowait(struct xfs_inode *ip)
> {
> return !xfs_iflags_test_and_set(ip, XFS_IFLOCK);
> @@ -281,6 +290,15 @@ static inline void xfs_iflock(struct xfs_inode *ip)
> __xfs_iflock(ip);
> }
>
> +static inline void __xfs_ifunlock(struct xfs_inode *ip)
> +{
> + lockdep_assert_held(&ip->i_flags_lock);
> + ASSERT(ip->i_flags & XFS_IFLOCK);
> + ip->i_flags &= ~XFS_IFLOCK;
> + smp_mb();
> + wake_up_bit(&ip->i_flags, __XFS_IFLOCK_BIT);
> +}
> +
> static inline void xfs_ifunlock(struct xfs_inode *ip)
> {
> ASSERT(xfs_isiflocked(ip));
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 98ffbe42f8ae..096ae31b5436 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -17,6 +17,7 @@
> #include "xfs_alloc.h"
> #include "xfs_fsops.h"
> #include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
> #include "xfs_buf_item.h"
> #include "xfs_log.h"
> #include "xfs_log_priv.h"
> @@ -1772,23 +1773,54 @@ xfs_fs_mount(
> }
>
> static long
> -xfs_fs_nr_cached_objects(
> +xfs_fs_free_cached_objects(
> struct super_block *sb,
> struct shrink_control *sc)
> {
> - /* Paranoia: catch incorrect calls during mount setup or teardown */
> - if (WARN_ON_ONCE(!sb->s_fs_info))
> - return 0;
> + struct xfs_mount *mp = XFS_M(sb);
> + struct xfs_ireclaim_args ra;
> + long freed;
>
> - return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> + xfs_ireclaim_args_init(&ra);
> +
> + freed = list_lru_shrink_walk(&mp->m_inode_lru, sc,
> + xfs_inode_reclaim_isolate, &ra);
> + xfs_dispose_inodes(&ra.freeable);
> +
> + /*
> + * Deal with dirty inodes. We will have the LSN of
> + * the oldest dirty inode in our reclaim args if we skipped any.
> + *
> + * For kswapd, if we skipped too many dirty inodes (i.e. more dirty than
> + * we freed) then we need kswapd to back off once it's scan has been
> + * completed. That way it will have some clean inodes once it comes back
> + * and can make progress, but make sure we have inode cleaning in
> + * progress.
> + *
> + * Direct reclaim will be throttled by the caller as it winds the
> + * priority up. All we need to do is keep pushing on dirty inodes
> + * in the background so when we come back progress will be made.
> + */
> + if (current_is_kswapd() && ra.dirty_skipped >= freed) {
> + if (current->reclaim_state)
> + current->reclaim_state->need_backoff = true;
> + }
> + if (ra.lowest_lsn != NULLCOMMITLSN)
> + xfs_ail_push(mp->m_ail, ra.lowest_lsn);
> +
> + return freed;
> }
>
> static long
> -xfs_fs_free_cached_objects(
> +xfs_fs_nr_cached_objects(
> struct super_block *sb,
> struct shrink_control *sc)
> {
> - return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
> + /* Paranoia: catch incorrect calls during mount setup or teardown */
> + if (WARN_ON_ONCE(!sb->s_fs_info))
> + return 0;
> +
> + return list_lru_shrink_count(&XFS_M(sb)->m_inode_lru, sc);
> }
>
> static const struct super_operations xfs_super_operations = {
> --
> 2.24.0.rc0
>