Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"

From: Dave Chinner
Date: Mon Feb 04 2019 - 16:48:02 EST


On Fri, Feb 01, 2019 at 09:19:04AM +1100, Dave Chinner wrote:
> > So, assuming all this, can we, please, first check if Rik's patch is addressing
> > the regression?
>
> Nope, it's broken and buggy, and reintroduces problems with racing
> deferred counts that were fixed years ago when I originally
> wrote the numa aware shrinker infrastructure.

So, the first thing to do here is fix the non-deterministic deferred
reclaim behaviour of the shrinker. Any deferred reclaim because of
things like GFP_NOFS contexts get landed on the the very next
shrinker instance that runs, be it kswapd, direct reclaim, or
whether it can even perform reclaim or not (e.g. it might be a
GFP_NOFS context itself).

As a result, there is no predicting when a shrinker instance might
get landed with a huge amount of work that isn't it's own. We can't
even guarantee that kswapd reclaim context sees this deferred work,
because if there is another reclaimer on that node running at the
same time, it will have scooped away the deferred work and kswapd
will only do a small amount of work.

How small? Well a node with 1.8m freeable items, reclaim priority
12 (i.e. lowest priority), and seeks = 2 will result in a scan count
of:

delta = 1,800,000 >> 12 = 440
delta *= 4 = 1600
delta /= 2 = 800.

the shrinker will only scan 800 objects when there is light memory
pressure. That's only 0.04% of the cache, which is insignificant.
That's fine for direct reclaim, but for kswapd we need it to do more
work when there is deferred work.

So, say we have another 1m objects of deferred work (very common
on filesystem (and therefore GFP_NOFS) heavy workloads), we'll do:

total_scan = deferred_objects;
....
total_scan += delta;

(ignoring clamping)

which means that we'll actually try to scan 1,000,800 objects from
the cache on the next pass. This may be clamped down to
(freeable_objects / 2), but that's still 900,000 objects in this
case.

IOWs, we just increased the reclaim work of this shrinker instance
by a factor of 1000x. That's where all the long tail shrinker
reclaim latencies are coming from, and where a large amount of the
"XFS is doing inode IO from the shrinker" come from as they drive it
straight through all the clean inodes and into dirty reclaimable
inodes. With the additional "small cache" pressure being added and
then deferred since 4.18-rc5, this is much more likely to happen
with direct reclaim because the deferred count from GFP_NOFS
allocations are wound up faster.

So, if we want to prevent direct reclaim from this obvious long tail
latency problem, we have to stop direct reclaim from ever doing
deferred work. i.e. we need to move deferred work to kswapd and, for
XFS, we then have to ensure that kswapd will not block on dirty
inodes so it can do all this work as quickly as possible. And then
for kswapd, we need to limit the amount of deferred work so that it
doesn't spend all it's time emptying a single cache at low
priorities, but will attempt to perform all the deferred work if
reclaim priority winds up far enough.

This gives us a solid, predictable "deferred work" infrastructure
for the shrinkers. It gets rid of the nasty behaviour, and paves the
way for adding different sorts of deferred work (like Rik's "small
cache" pressure increase) to kswapd rather than in random reclaim
contexts. It also allows us to use a different "how much work should
we do" calculation for kswapd. i.e. one that is appropriate for
background, non-blocking scanning rather than being tailored to
limiting the work that any one direct reclaim context must do.

So, if I just set the XFS inode cache shrinker to skip inode
writeback (as everyone seems to want to do), fsmark is OOM-killed at
27M inodes, right when the log runs out of space, the tail-pushing
thread goes near to being CPU bound, and inode writeback from
reclaim is necessary to retire the dirty inode from the log before
it can be reclaimed. It's easily reproducable, and sometimes the
oom-killer chooses daemons rather than fsmark and it has killed the
system on occasion. Reverting the two patches in this thread makes
the OOM kill problem go away - it just turns it back into a
performance issue.

So, on top of the reverts, the patch below that reworks the deferred
shrinker work to kswapd is the first patch I've been able to get a
"XFs inode shrinker doesn't block kswapd" patch through my
benchmarks and memory stress workloads without triggering OOM-kills
randomly or seeing substantial performance regressions. Indeed, it
appears to behave better that the existing code (fsmark inode create
is marginally faster, simoops long tails have completely gone(*)).

This indicates to me that we really should be considering fixing the
deferred work problems before adding new types of deferred work into
the shrinker infrastructure (for whatever reason). Get the
infrastructure, reliable, predictable and somewhat deterministic,
then we can start trialling pressure/balance changes knowing exactly
where we are directing that extra work....

Cheers,

Dave.

(*) Chris, FYI, the last output before symoops died because "too
many open files" - p99 latency is nearly identical to p50 latency:

Run time: 10873 seconds
Read latency (p50: 3,084,288) (p95: 3,158,016) (p99: 3,256,320)
Write latency (p50: 7,479,296) (p95: 8,101,888) (p99: 8,437,760)
Allocation latency (p50: 479,744) (p95: 744,448) (p99: 1,016,832)
work rate = 8.63/sec (avg 9.05/sec) (p50: 9.19) (p95: 9.91) (p99: 11.42)
alloc stall rate = 0.02/sec (avg: 0.01) (p50: 0.01) (p95: 0.01) (p99: 0.01)

This is the same machine that I originally ran simoops on back in
~4.9 when you first proposed async kswapd behaviour for XFS. Typical
long tail latencies back then were:

[https://www.spinics.net/lists/linux-xfs/msg02235.html]

Run time: 1140 seconds
Read latency (p50: 3,035,136) (p95: 4,415,488) (p99: 6,119,424)
Write latency (p50: 27,557,888) (p95: 31,490,048) (p99: 45,285,376)
Allocation latency (p50: 247,552) (p95: 1,497,088) (p99: 19,496,960)
work rate = 3.68/sec (avg 3.71/sec) (p50: 3.71) (p95: 4.04) (p99: 4.04)
alloc stall rate = 1.65/sec (avg: 0.12) (p50: 0.00) (p95: 0.12) (p99: 0.12)

--
Dave Chinner
david@xxxxxxxxxxxxx

mm: shrinker deferral cleanup

From: Dave Chinner <dchinner@xxxxxxxxxx>

Shrinker defers to random GFP_KERNEL reclaim context, which means so
poor direct reclaimer coul dbe loaded with huge amounts of work just
because it's the first reclaimer in a while.

This can be seen from the shrinker trace point output, where a
random reclaim contexts take all the deferred scan count and try to
run it, then put it all back in the global pool when they are done.
Racing shrinkers see none of that deferred work, to the point where
kswapd may never see any load on the shrinker at all because it's
always being held by a direct reclaimer.

SO, first things first: only do deferred work in kswapd context. We
know that this is GFP_KERNEL context, so work deferred from GFP_NOFS
contexts will always be able to run from kswapd.

This also gets rid of the need to specifically avoid windup because
we only have one thread that will process the deferred work, and it
will be capped in what it can do in a single by the reclaim priority
it operates under. i.e. reclaim priority prevents deferred work from
being done all at once under light memory pressure. If we have realy
heavy pressure, then we're aiming to kill as much cache as we can,
so at that point windup no longer matters.

Next, factor of the calculation of the amount of work to do from the
rest of the code. This makes it easier to see what is work
calculation and what are constraints, clamping and behavioural
restrictions. Rename the variables to be more meaningful, too,
and convert everything to uint64_t because all the hoops we jump
through to keep things in 32 bits for 32 bit systems makes this all
just a mess.

Next, allow the shrinker "freeable object count" callout tell the
shrinker it won't be able to do any work. e.g. GFP_NOFS context on a
filesystem shrinker. THis means it can simply calculate the work to
defer to kswapd and move on. Fast, and doesn't require calling into
the scan code to find out that we can't actually do any work.

Next, cleanup the tracing to be less ... obtuse. We care about the
work being done, the amount of work that was done, and how much we
still have deferred to do. The rest of it is mostly useless.

Finally, remove the blocking SYNC_WAIT from kswapd context in the
XFS inode shrinker. Still block direct reclaim, but allow kswapd to
scan primarily for clean, immediately reclaimable inodes without
regard to any other reclaim that is on-going. This means kswapd
won't get stuck behind blocked direct reclaim, nor will it issue IO
unless there

Further experiments:
- kick kswapd when deferred gets too big
- store deferred priority rather than a count? Windup always ends up
with more deferred work than there is freeable items, so a
do_div(freeable, deferred_priority) setup might make sense.
- get kswapd reclaim priority priority wound up if shrinker
is not making enough progress on deferred work.
- factor out deferral code

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
fs/super.c | 8 +++
fs/xfs/xfs_icache.c | 7 +-
include/linux/shrinker.h | 2 +
include/trace/events/vmscan.h | 69 +++++++++----------
mm/vmscan.c | 156 +++++++++++++++++++++++++-----------------
5 files changed, 141 insertions(+), 101 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 48e25eba8465..59bfb285a856 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -139,6 +139,14 @@ static unsigned long super_cache_count(struct shrinker *shrink,
return 0;
smp_rmb();

+ /*
+ * If we know we can't reclaim, let the shrinker know so it can account
+ * for deferred reclaim that kswapd must do, but doesn't have to call
+ * super_cache_count to find this out.
+ */
+ if (!(sc->gfp_mask & __GFP_FS))
+ sc->will_defer = true;
+
if (sb->s_op && sb->s_op->nr_cached_objects)
total_objects = sb->s_op->nr_cached_objects(sb, sc);

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 245483cc282b..60723ae79ec2 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1373,11 +1373,16 @@ xfs_reclaim_inodes_nr(
struct xfs_mount *mp,
int nr_to_scan)
{
+ int flags = SYNC_TRYLOCK;
+
/* kick background reclaimer and push the AIL */
xfs_reclaim_work_queue(mp);
xfs_ail_push_all(mp->m_ail);

- return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+ if (!current_is_kswapd())
+ flags |= SYNC_WAIT;
+
+ return xfs_reclaim_inodes_ag(mp, flags, &nr_to_scan);
}

/*
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 9443cafd1969..a4216dcdd59e 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -31,6 +31,8 @@ struct shrink_control {

/* current memcg being shrunk (for memcg aware shrinkers) */
struct mem_cgroup *memcg;
+
+ bool will_defer;
};

#define SHRINK_STOP (~0UL)
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index a1cb91342231..a4f34cde779a 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -195,84 +195,81 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re

TRACE_EVENT(mm_shrink_slab_start,
TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
- long nr_objects_to_shrink, unsigned long cache_items,
- unsigned long long delta, unsigned long total_scan,
- int priority),
+ int64_t deferred_count, int64_t freeable_objects,
+ int64_t scan_count, int priority),

- TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan,
+ TP_ARGS(shr, sc, deferred_count, freeable_objects, scan_count,
priority),

TP_STRUCT__entry(
__field(struct shrinker *, shr)
__field(void *, shrink)
__field(int, nid)
- __field(long, nr_objects_to_shrink)
- __field(gfp_t, gfp_flags)
- __field(unsigned long, cache_items)
- __field(unsigned long long, delta)
- __field(unsigned long, total_scan)
+ __field(int64_t, deferred_count)
+ __field(int64_t, freeable_objects)
+ __field(int64_t, scan_count)
__field(int, priority)
+ __field(gfp_t, gfp_flags)
),

TP_fast_assign(
__entry->shr = shr;
__entry->shrink = shr->scan_objects;
__entry->nid = sc->nid;
- __entry->nr_objects_to_shrink = nr_objects_to_shrink;
- __entry->gfp_flags = sc->gfp_mask;
- __entry->cache_items = cache_items;
- __entry->delta = delta;
- __entry->total_scan = total_scan;
+ __entry->deferred_count = deferred_count;
+ __entry->freeable_objects = freeable_objects;
+ __entry->scan_count = scan_count;
__entry->priority = priority;
+ __entry->gfp_flags = sc->gfp_mask;
),

- TP_printk("%pF %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
+ TP_printk("%pF %p: nid: %d objects to scan %lld freeable items %lld deferred count %lld priority %d gfp_flags %s",
__entry->shrink,
__entry->shr,
__entry->nid,
- __entry->nr_objects_to_shrink,
- show_gfp_flags(__entry->gfp_flags),
- __entry->cache_items,
- __entry->delta,
- __entry->total_scan,
- __entry->priority)
+ __entry->scan_count,
+ __entry->freeable_objects,
+ __entry->deferred_count,
+ __entry->priority,
+ show_gfp_flags(__entry->gfp_flags))
);

TRACE_EVENT(mm_shrink_slab_end,
- TP_PROTO(struct shrinker *shr, int nid, int shrinker_retval,
- long unused_scan_cnt, long new_scan_cnt, long total_scan),
+ TP_PROTO(struct shrinker *shr, int nid, int64_t freed_objects,
+ int64_t unused_scan_cnt, int64_t new_deferred_count,
+ int64_t old_deferred_count),

- TP_ARGS(shr, nid, shrinker_retval, unused_scan_cnt, new_scan_cnt,
- total_scan),
+ TP_ARGS(shr, nid, freed_objects, unused_scan_cnt, new_deferred_count,
+ old_deferred_count),

TP_STRUCT__entry(
__field(struct shrinker *, shr)
__field(int, nid)
__field(void *, shrink)
- __field(long, unused_scan)
- __field(long, new_scan)
- __field(int, retval)
- __field(long, total_scan)
+ __field(long long, unused_scan)
+ __field(long long, new_deferred_count)
+ __field(long long, freed_objects)
+ __field(long long, old_deferred_count)
),

TP_fast_assign(
__entry->shr = shr;
__entry->nid = nid;
__entry->shrink = shr->scan_objects;
+ __entry->freed_objects = freed_objects;
__entry->unused_scan = unused_scan_cnt;
- __entry->new_scan = new_scan_cnt;
- __entry->retval = shrinker_retval;
- __entry->total_scan = total_scan;
+ __entry->new_deferred_count = new_deferred_count;
+ __entry->old_deferred_count = old_deferred_count;
),

- TP_printk("%pF %p: nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+ TP_printk("%pF %p: nid: %d freed objects %lld unused scan count %lld new deferred count %lld old deferred count %lld",
__entry->shrink,
__entry->shr,
__entry->nid,
+ __entry->freed_objects,
__entry->unused_scan,
- __entry->new_scan,
- __entry->total_scan,
- __entry->retval)
+ __entry->new_deferred_count,
+ __entry->old_deferred_count)
);

TRACE_EVENT(mm_vmscan_lru_isolate,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e979705bbf32..7db6d8242613 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -447,37 +447,21 @@ void unregister_shrinker(struct shrinker *shrinker)
}
EXPORT_SYMBOL(unregister_shrinker);

-#define SHRINK_BATCH 128
-
-static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
- struct shrinker *shrinker, int priority)
+/*
+ * calculate the number of new objects to scan this this time around
+ */
+static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
+ struct shrinker *shrinker, int priority,
+ uint64_t *freeable_objects)
{
- unsigned long freed = 0;
- unsigned long long delta;
- long total_scan;
- long freeable;
- long nr;
- long new_nr;
int nid = shrinkctl->nid;
- long batch_size = shrinker->batch ? shrinker->batch
- : SHRINK_BATCH;
- long scanned = 0, next_deferred;
-
- if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
- nid = 0;
+ uint64_t delta;
+ uint64_t freeable;

freeable = shrinker->count_objects(shrinker, shrinkctl);
if (freeable == 0 || freeable == SHRINK_EMPTY)
return freeable;

- /*
- * copy the current shrinker scan count into a local variable
- * and zero it so that other concurrent shrinker invocations
- * don't also do this scanning work.
- */
- nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
-
- total_scan = nr;
if (shrinker->seeks) {
delta = freeable >> priority;
delta *= 4;
@@ -491,40 +475,81 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
delta = freeable / 2;
}

- total_scan += delta;
- if (total_scan < 0) {
- pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
- shrinker->scan_objects, total_scan);
- total_scan = freeable;
- next_deferred = nr;
- } else
- next_deferred = total_scan;
+ *freeable_objects = freeable;
+ return delta > 0 ? delta : 0;
+}

- /*
- * We need to avoid excessive windup on filesystem shrinkers
- * due to large numbers of GFP_NOFS allocations causing the
- * shrinkers to return -1 all the time. This results in a large
- * nr being built up so when a shrink that can do some work
- * comes along it empties the entire cache due to nr >>>
- * freeable. This is bad for sustaining a working set in
- * memory.
- *
- * Hence only allow the shrinker to scan the entire cache when
- * a large delta change is calculated directly.
- */
- if (delta < freeable / 4)
- total_scan = min(total_scan, freeable / 2);
+#define SHRINK_BATCH 128
+
+static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
+ struct shrinker *shrinker, int priority)
+{
+ int nid = shrinkctl->nid;
+ int batch_size = shrinker->batch ? shrinker->batch
+ : SHRINK_BATCH;
+ int64_t scan_count;
+ int64_t freeable_objects = 0;
+ int64_t scanned_objects = 0;
+ int64_t next_deferred = 0;
+ int64_t new_dcount;
+ int64_t freed = 0;
+ int64_t deferred_count = 0;
+
+ if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
+ nid = 0;
+
+ shrinkctl->will_defer = false;
+ scan_count = shrink_scan_count(shrinkctl, shrinker, priority,
+ &freeable_objects);
+ if (scan_count == 0 || scan_count == SHRINK_EMPTY)
+ return scan_count;
+
+/*
+ * If kswapd, we take all the deferred work and do it here. We don't let direct
+ * reclaim do this, because then it means some poor sod is going to have to do
+ * somebody else's GFP_NOFS reclaim, and it hides the real amount of reclaim
+ * work from concurrent kswapd operations. hence we do the work in the wrong
+ * place, at the wrong time, and it's largely unpredictable.
+ *
+ * By doing the deferred work only in kswapd, we can schedule the work according
+ * the the reclaim priority - low priority reclaim will do less deferred work,
+ * hence we'll do more of the deferred work the more desperate we become for
+ * free memory. This avoids the need for needing to specifically avoid deferred
+ * work windup as low amount os memory pressure won't excessive trim caches
+ * anymore.
+ */
+ if (current_is_kswapd()) {
+ int64_t deferred_scan;
+
+ deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
+
+ /* we want to scan 5-10% of the deferred work here at minimum */
+ deferred_scan = deferred_count;
+ if (priority)
+ do_div(deferred_scan, priority);
+
+ scan_count += deferred_scan;
+ }

/*
- * Avoid risking looping forever due to too large nr value:
+ * Avoid risking looping forever due to too much deferred work:
* never try to free more than twice the estimate number of
* freeable entries.
*/
- if (total_scan > freeable * 2)
- total_scan = freeable * 2;
+ if (scan_count > freeable_objects * 2)
+ scan_count = freeable_objects * 2;
+

- trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
- freeable, delta, total_scan, priority);
+ trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
+ freeable_objects, scan_count, priority);
+
+ /*
+ * If the shrinker can't run (e.g. due to gfp_mask constraints), then
+ * defer the work to kswapd. kswapd runs under GFP_KERNEL, so should
+ * never have shrinker defer wok in that context.
+ */
+ if (shrinkctl->will_defer)
+ goto done;

/*
* Normally, we should not scan less than batch_size objects in one
@@ -541,10 +566,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
* scanning at high prio and therefore should try to reclaim as much as
* possible.
*/
- while (total_scan >= batch_size ||
- total_scan >= freeable) {
- unsigned long ret;
- unsigned long nr_to_scan = min(batch_size, total_scan);
+ while (scan_count >= batch_size ||
+ scan_count >= freeable_objects) {
+ int64_t ret;
+ int64_t nr_to_scan = min_t(int64_t, batch_size, scan_count);

shrinkctl->nr_to_scan = nr_to_scan;
shrinkctl->nr_scanned = nr_to_scan;
@@ -554,28 +579,31 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
freed += ret;

count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
- total_scan -= shrinkctl->nr_scanned;
- scanned += shrinkctl->nr_scanned;
+ scan_count -= shrinkctl->nr_scanned;
+ scanned_objects += shrinkctl->nr_scanned;

cond_resched();
}

- if (next_deferred >= scanned)
- next_deferred -= scanned;
- else
- next_deferred = 0;
+done:
+ if (deferred_count)
+ next_deferred = deferred_count - scanned_objects;
+ else if (scan_count > 0)
+ next_deferred = scan_count;
+
/*
* move the unused scan count back into the shrinker in a
* manner that handles concurrent updates. If we exhausted the
* scan, there is no need to do an update.
*/
if (next_deferred > 0)
- new_nr = atomic_long_add_return(next_deferred,
+ new_dcount = atomic64_add_return(next_deferred,
&shrinker->nr_deferred[nid]);
else
- new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
+ new_dcount = atomic64_read(&shrinker->nr_deferred[nid]);

- trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
+ trace_mm_shrink_slab_end(shrinker, nid, freed, scan_count, deferred_count,
+ new_dcount);
return freed;
}