Re: [PATCH v2 1/1] kvfree_rcu: Release a page cache under memory pressure

From: Uladzislau Rezki
Date: Tue Mar 16 2021 - 16:42:46 EST


> On Wed, Mar 10, 2021 at 09:07:57PM +0100, Uladzislau Rezki (Sony) wrote:
> > From: Zhang Qiang <qiang.zhang@xxxxxxxxxxxxx>
> >
> > Add a drain_page_cache() function to drain a per-cpu page cache.
> > The reason behind of it is a system can run into a low memory
> > condition, in that case a page shrinker can ask for its users
> > to free their caches in order to get extra memory available for
> > other needs in a system.
> >
> > When a system hits such condition, a page cache is drained for
> > all CPUs in a system. Apart of that a page cache work is delayed
> > with 5 seconds interval until a memory pressure disappears.
>
> Does this capture it?
>
It would be good to have kind of clear interface saying that:

- low memory condition starts;
- it is over, watermarks were fixed.

but i do not see it. Therefore 5 seconds back-off has been chosen
to make a cache refilling to be less aggressive. Suppose 5 seconds
is not enough, in that case the work will attempt to allocate some
pages using less permissive parameters. What means that if we are
still in a low memory condition a refilling will probably fail and
next job will be invoked in 5 seconds one more time.

> ------------------------------------------------------------------------
>
> Add a drain_page_cache() function that drains the specified per-cpu
> page cache. This function is invoked on each CPU when the system
> enters a low-memory state, that is, when the shrinker invokes
> kfree_rcu_shrink_scan(). Thus, when the system is low on memory,
> kvfree_rcu() starts taking its slow paths.
>
> In addition, the first subsequent attempt to refill the caches is
> delayed for five seconds.
>
> ------------------------------------------------------------------------
>
> A few questions below.
>
> Thanx, Paul
>
> > Co-developed-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
> > Signed-off-by: Zqiang <qiang.zhang@xxxxxxxxxxxxx>
> > ---
> > kernel/rcu/tree.c | 59 ++++++++++++++++++++++++++++++++++++++++-------
> > 1 file changed, 51 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 2c9cf4df942c..46b8a98ca077 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3163,7 +3163,7 @@ struct kfree_rcu_cpu {
> > bool initialized;
> > int count;
> >
> > - struct work_struct page_cache_work;
> > + struct delayed_work page_cache_work;
> > atomic_t work_in_progress;
> > struct hrtimer hrtimer;
> >
> > @@ -3175,6 +3175,17 @@ static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
> > .lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
> > };
> >
> > +// A page shrinker can ask for freeing extra pages
> > +// to get them available for other needs in a system.
> > +// Usually it happens under low memory condition, in
> > +// that case hold on a bit with page cache filling.
> > +static unsigned long backoff_page_cache_fill;
> > +
> > +// 5 seconds delay. That is long enough to reduce
> > +// an interfering and racing with a shrinker where
> > +// the cache is drained.
> > +#define PAGE_CACHE_FILL_DELAY (5 * HZ)
> > +
> > static __always_inline void
> > debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
> > {
> > @@ -3229,6 +3240,26 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp,
> >
> > }
> >
> > +static int
> > +drain_page_cache(struct kfree_rcu_cpu *krcp)
> > +{
> > + unsigned long flags;
> > + struct llist_node *page_list, *pos, *n;
> > + int freed = 0;
> > +
> > + raw_spin_lock_irqsave(&krcp->lock, flags);
> > + page_list = llist_del_all(&krcp->bkvcache);
> > + krcp->nr_bkv_objs = 0;
> > + raw_spin_unlock_irqrestore(&krcp->lock, flags);
> > +
> > + llist_for_each_safe(pos, n, page_list) {
> > + free_page((unsigned long)pos);
> > + freed++;
> > + }
> > +
> > + return freed;
> > +}
> > +
> > /*
> > * This function is invoked in workqueue context after a grace period.
> > * It frees all the objects queued on ->bhead_free or ->head_free.
> > @@ -3419,7 +3450,7 @@ schedule_page_work_fn(struct hrtimer *t)
> > struct kfree_rcu_cpu *krcp =
> > container_of(t, struct kfree_rcu_cpu, hrtimer);
> >
> > - queue_work(system_highpri_wq, &krcp->page_cache_work);
> > + queue_delayed_work(system_highpri_wq, &krcp->page_cache_work, 0);
> > return HRTIMER_NORESTART;
> > }
> >
> > @@ -3428,7 +3459,7 @@ static void fill_page_cache_func(struct work_struct *work)
> > struct kvfree_rcu_bulk_data *bnode;
> > struct kfree_rcu_cpu *krcp =
> > container_of(work, struct kfree_rcu_cpu,
> > - page_cache_work);
> > + page_cache_work.work);
> > unsigned long flags;
> > bool pushed;
> > int i;
> > @@ -3457,10 +3488,14 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
> > {
> > if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
> > !atomic_xchg(&krcp->work_in_progress, 1)) {
> > - hrtimer_init(&krcp->hrtimer, CLOCK_MONOTONIC,
> > - HRTIMER_MODE_REL);
> > - krcp->hrtimer.function = schedule_page_work_fn;
> > - hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL);
> > + if (xchg(&backoff_page_cache_fill, 0UL)) {
>
> How often can run_page_cache_worker() be invoked? I am a bit
> concerned about the possibility of heavy memory contention on the
> backoff_page_cache_fill variable on large systems. Unless there
> is something that sharply bounds the frequency of calls to
> run_page_cache_worker(), something like this would be more scalable:
>
> if (backoff_page_cache_fill &&
> xchg(&backoff_page_cache_fill, 0UL)) {
>
It is called per-cpu. Once the cache is empty it will be called. Next time
will be after the worker completes filling the cache and krcp is run out of
cache again. I do not consider it as high contention on the backoff_page_cache_fill
variable. On my 64 CPUs system the run_page_cache_worker() itself does not
consume much CPU cycles during the test:

Samples: 2K of event 'cycles:k', Event count (approx.): 1372274198
Overhead Command Shared Object Symbol
27.45% kworker/0:2-eve [kernel.vmlinux] [k] kmem_cache_free_bulk
14.56% vmalloc_test/0 [kernel.vmlinux] [k] kmem_cache_alloc_trace
11.34% vmalloc_test/0 [kernel.vmlinux] [k] kvfree_call_rcu
7.61% vmalloc_test/0 [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
7.60% vmalloc_test/0 [kernel.vmlinux] [k] allocate_slab
5.38% vmalloc_test/0 [kernel.vmlinux] [k] check_preemption_disabled
3.12% vmalloc_test/0 [kernel.vmlinux] [k] _raw_spin_lock
2.85% vmalloc_test/0 [kernel.vmlinux] [k] preempt_count_add
2.64% vmalloc_test/0 [kernel.vmlinux] [k] __list_del_entry_valid
2.53% vmalloc_test/0 [kernel.vmlinux] [k] preempt_count_sub
1.81% vmalloc_test/0 [kernel.vmlinux] [k] native_write_msr
1.05% kworker/0:2-eve [kernel.vmlinux] [k] __slab_free
0.96% vmalloc_test/0 [kernel.vmlinux] [k] asm_sysvec_apic_timer_interrupt
0.96% vmalloc_test/0 [kernel.vmlinux] [k] setup_object_debug.isra.69
0.76% kworker/0:2-eve [kernel.vmlinux] [k] free_pcppages_bulk
0.72% kworker/0:2-eve [kernel.vmlinux] [k] put_cpu_partial
0.72% vmalloc_test/0 [test_vmalloc] [k] kvfree_rcu_2_arg_slab_test
0.52% kworker/0:2-eve [kernel.vmlinux] [k] kfree_rcu_work
0.52% vmalloc_test/0 [kernel.vmlinux] [k] get_page_from_freelist
0.52% vmalloc_test/0 [kernel.vmlinux] [k] run_page_cache_worker

<run_page_cache_worker>
│ arch_atomic_xchg():
│ mov $0x1,%eax
│ run_page_cache_worker():
│ push %rbx
│ arch_atomic_xchg():
│ xchg %eax,0x188(%rdi)
│ run_page_cache_worker():
100.00 │ test %eax,%eax
<run_page_cache_worker>

<snip>
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
!atomic_xchg(&krcp->work_in_progress, 1)) { <-- here all cycles of run_page_cache_worker()
<snip>

>
> It looks to me like all the CPUs could invoke run_page_cache_worker()
> at the same time. Or am I missing something that throttles calls to
> run_page_cache_worker(), even on systems with hundreds of CPUs?
>
It is per-cpu, thus is serialized.

>
> Also, if I am reading the code correctly, the unlucky first CPU to
> attempt to refill cache after a shrinker invocation would be delayed
> five seconds (thus invoking the slow path during that time), but other
> CPUs would continue unimpeded. Is this the intent?
>
A backoff_page_cache_fill is global and shared among all CPUs. So, if one
changes it following a slow path whereas all the rest will refill their
caches anyway following a fast path.

That should be fixed making it per-cpu also. A shrinker should mark each
CPU to back-off refilling.

>
> If I understand correctly, the point is to avoid the situation where
> memory needed elsewhere is drained and then immediately refilled.
> But the code will do the immediate refill when the rest of the CPUs show
> up, correct?
>
Correct. We do not want to request pages for some period of time, because
they might be needed for other needs and other users in a system. We have
fall-backs, so there is no a high demand in it for our case.

>
> Might it be better to put a low cap on the per-CPU caches for some
> period of time after the shrinker runs? Maybe allow at most one page
> to be cached for the five seconds following?
>
That we can do!

> > + queue_delayed_work(system_wq,
> > + &krcp->page_cache_work, PAGE_CACHE_FILL_DELAY);
> > + } else {
> > + hrtimer_init(&krcp->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > + krcp->hrtimer.function = schedule_page_work_fn;
> > + hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL);
> > + }
> > }
> > }
> >
> > @@ -3612,14 +3647,20 @@ kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> > {
> > int cpu;
> > unsigned long count = 0;
> > + unsigned long flags;
> >
> > /* Snapshot count of all CPUs */
> > for_each_possible_cpu(cpu) {
> > struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
> >
> > count += READ_ONCE(krcp->count);
> > +
> > + raw_spin_lock_irqsave(&krcp->lock, flags);
> > + count += krcp->nr_bkv_objs;
> > + raw_spin_unlock_irqrestore(&krcp->lock, flags);
>
> Not a big deal given that this should not be invoked often, but couldn't
> the read from ->nr_bkv_objs be READ_ONCE() without the lock? (This would
> require that updates use WRITE_ONCE() as well.)
>
I was thinking about it. Will re-spin and rework :)

Thanks!

--
Vlad Rezki