Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high

From: Michal Hocko
Date: Wed Feb 19 2020 - 13:37:37 EST


On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> We have received regression reports from users whose workloads moved
> into containers and subsequently encountered new latencies. For some
> users these were a nuisance, but for some it meant missing their SLA
> response times. We tracked those delays down to cgroup limits, which
> inject direct reclaim stalls into the workload where previously all
> reclaim was handled my kswapd.

I am curious why is this unexpected when the high limit is explicitly
documented as a throttling mechanism.

> This patch adds asynchronous reclaim to the memory.high cgroup limit
> while keeping direct reclaim as a fallback. In our testing, this
> eliminated all direct reclaim from the affected workload.

Who is accounted for all the work? Unless I am missing something this
just gets hidden in the system activity and that might hurt the
isolation. I do see how moving the work to a different context is
desirable but this work has to be accounted properly when it is going to
become a normal mode of operation (rather than a rare exception like the
existing irq context handling).

> memory.high has a grace buffer of about 4% between when it becomes
> exceeded and when allocating threads get throttled. We can use the
> same buffer for the async reclaimer to operate in. If the worker
> cannot keep up and the grace buffer is exceeded, allocating threads
> will fall back to direct reclaim before getting throttled.
>
> For irq-context, there's already async memory.high enforcement. Re-use
> that work item for all allocating contexts, but switch it to the
> unbound workqueue so reclaim work doesn't compete with the workload.
> The work item is per cgroup, which means the workqueue infrastructure
> will create at maximum one worker thread per reclaiming cgroup.
>
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> ---
> mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
> mm/vmscan.c | 10 +++++++--
> 2 files changed, 54 insertions(+), 16 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cf02e3ef3ed9..bad838d9c2bb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1446,6 +1446,10 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
> seq_buf_printf(&s, "pgsteal %lu\n",
> memcg_events(memcg, PGSTEAL_KSWAPD) +
> memcg_events(memcg, PGSTEAL_DIRECT));
> + seq_buf_printf(&s, "pgscan_direct %lu\n",
> + memcg_events(memcg, PGSCAN_DIRECT));
> + seq_buf_printf(&s, "pgsteal_direct %lu\n",
> + memcg_events(memcg, PGSTEAL_DIRECT));
> seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGACTIVATE),
> memcg_events(memcg, PGACTIVATE));
> seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGDEACTIVATE),
> @@ -2235,10 +2239,19 @@ static void reclaim_high(struct mem_cgroup *memcg,
>
> static void high_work_func(struct work_struct *work)
> {
> + unsigned long high, usage;
> struct mem_cgroup *memcg;
>
> memcg = container_of(work, struct mem_cgroup, high_work);
> - reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
> +
> + high = READ_ONCE(memcg->high);
> + usage = page_counter_read(&memcg->memory);
> +
> + if (usage <= high)
> + return;
> +
> + set_worker_desc("cswapd/%llx", cgroup_id(memcg->css.cgroup));
> + reclaim_high(memcg, usage - high, GFP_KERNEL);
> }
>
> /*
> @@ -2304,15 +2317,22 @@ void mem_cgroup_handle_over_high(void)
> unsigned long pflags;
> unsigned long penalty_jiffies, overage;
> unsigned int nr_pages = current->memcg_nr_pages_over_high;
> + bool tried_direct_reclaim = false;
> struct mem_cgroup *memcg;
>
> if (likely(!nr_pages))
> return;
>
> - memcg = get_mem_cgroup_from_mm(current->mm);
> - reclaim_high(memcg, nr_pages, GFP_KERNEL);
> current->memcg_nr_pages_over_high = 0;
>
> + memcg = get_mem_cgroup_from_mm(current->mm);
> + high = READ_ONCE(memcg->high);
> +recheck:
> + usage = page_counter_read(&memcg->memory);
> +
> + if (usage <= high)
> + goto out;
> +
> /*
> * memory.high is breached and reclaim is unable to keep up. Throttle
> * allocators proactively to slow down excessive growth.
> @@ -2325,12 +2345,6 @@ void mem_cgroup_handle_over_high(void)
> * overage amount.
> */
>
> - usage = page_counter_read(&memcg->memory);
> - high = READ_ONCE(memcg->high);
> -
> - if (usage <= high)
> - goto out;
> -
> /*
> * Prevent division by 0 in overage calculation by acting as if it was a
> * threshold of 1 page
> @@ -2369,6 +2383,16 @@ void mem_cgroup_handle_over_high(void)
> if (penalty_jiffies <= HZ / 100)
> goto out;
>
> + /*
> + * It's possible async reclaim just isn't able to keep
> + * up. Before we go to sleep, try direct reclaim.
> + */
> + if (!tried_direct_reclaim) {
> + reclaim_high(memcg, nr_pages, GFP_KERNEL);
> + tried_direct_reclaim = true;
> + goto recheck;
> + }
> +
> /*
> * If we exit early, we're guaranteed to die (since
> * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
> @@ -2544,13 +2568,21 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> */
> do {
> if (page_counter_read(&memcg->memory) > memcg->high) {
> + /*
> + * Kick off the async reclaimer, which should
> + * be doing most of the work to avoid latency
> + * in the workload. But also check in on its
> + * progress before resuming to userspace, in
> + * case we need to do direct reclaim, or even
> + * throttle the allocating thread if reclaim
> + * cannot keep up with allocation demand.
> + */
> + queue_work(system_unbound_wq, &memcg->high_work);
> /* Don't bother a random interrupted task */
> - if (in_interrupt()) {
> - schedule_work(&memcg->high_work);
> - break;
> + if (!in_interrupt()) {
> + current->memcg_nr_pages_over_high += batch;
> + set_notify_resume(current);
> }
> - current->memcg_nr_pages_over_high += batch;
> - set_notify_resume(current);
> break;
> }
> } while ((memcg = parent_mem_cgroup(memcg)));
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 74e8edce83ca..d6085115c7f2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1947,7 +1947,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
> reclaim_stat->recent_scanned[file] += nr_taken;
>
> - item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
> + if (current_is_kswapd() || (cgroup_reclaim(sc) && current_work()))
> + item = PGSCAN_KSWAPD;
> + else
> + item = PGSCAN_DIRECT;
> if (!cgroup_reclaim(sc))
> __count_vm_events(item, nr_scanned);
> __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
> @@ -1961,7 +1964,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>
> spin_lock_irq(&pgdat->lru_lock);
>
> - item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
> + if (current_is_kswapd() || (cgroup_reclaim(sc) && current_work()))
> + item = PGSTEAL_KSWAPD;
> + else
> + item = PGSTEAL_DIRECT;
> if (!cgroup_reclaim(sc))
> __count_vm_events(item, nr_reclaimed);
> __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> --
> 2.24.1
>

--
Michal Hocko
SUSE Labs