Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory

From: Michal Hocko
Date: Wed Jan 14 2015 - 10:34:34 EST

Next message: Javier Martinez Canillas: "Re: [PATCH RESEND 1/4] ARM: dts: Add power and lid GPIO keys pinctrl for exynos5250-snow"
Previous message: Cyrill Gorcunov: "Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files"
In reply to: Vladimir Davydov: "Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory"
Next in thread: Johannes Weiner: "Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu 08-01-15 23:15:04, Johannes Weiner wrote:
[...]
> @@ -2353,6 +2353,22 @@ done_restock:
> css_get_many(&memcg->css, batch);
> if (batch > nr_pages)
> refill_stock(memcg, batch - nr_pages);
> + /*
> + * If the hierarchy is above the normal consumption range,
> + * make the charging task trim the excess.
> + */
> + do {
> + unsigned long nr_pages = page_counter_read(&memcg->memory);
> + unsigned long high = ACCESS_ONCE(memcg->high);
> +
> + if (nr_pages > high) {
> + mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> +
> + try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> + gfp_mask, true);
> + }

As I've said before I am not happy about this. Heavy parallel load
hitting on the limit can generate really high reclaim targets causing
over reclaim and long stalls. Moreover portions of the excess would be
reclaimed multiple times which is not necessary.

I am not entirely happy about reclaiming nr_pages for THP_PAGES already
and this might be much worse, more probable and less predictable.

I believe the target should be capped somehow. nr_pages sounds like a
compromise. It would still throttle the charger and scale much better.

> +
> + } while ((memcg = parent_mem_cgroup(memcg)));
> done:
> return ret;
> }
[...]
> +static int memory_events_show(struct seq_file *m, void *v)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +
> + seq_printf(m, "low %lu\n", mem_cgroup_read_events(memcg, MEMCG_LOW));
> + seq_printf(m, "high %lu\n", mem_cgroup_read_events(memcg, MEMCG_HIGH));
> + seq_printf(m, "max %lu\n", mem_cgroup_read_events(memcg, MEMCG_MAX));
> + seq_printf(m, "oom %lu\n", mem_cgroup_read_events(memcg, MEMCG_OOM));
> +
> + return 0;
> +}

OK, but I really think we need a way of OOM notification for user space
OOM handling as well - including blocking the OOM killer as we have
now. This is not directly related to this patch so it doesn't have to
happen right now, we should just think about the proper interface if
oom_control is consider not suitable.

[...]
> @@ -2322,6 +2325,12 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> struct lruvec *lruvec;
> int swappiness;
>
> + if (mem_cgroup_low(root, memcg)) {
> + if (!sc->may_thrash)
> + continue;
> + mem_cgroup_events(memcg, MEMCG_LOW, 1);
> + }
> +
> lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> swappiness = mem_cgroup_swappiness(memcg);
>
> @@ -2343,8 +2352,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> mem_cgroup_iter_break(root, memcg);
> break;
> }
> - memcg = mem_cgroup_iter(root, memcg, &reclaim);
> - } while (memcg);
> + } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));

I had a similar code but then I could trigger quick priority drop downs
during parallel reclaim with multiple low limited groups. I've tried to
address that by retrying shrink_zone if it hasn't called shrink_lruvec
at all. Still not ideal because it can livelock theoretically, but I
haven't seen that in my testing.

The following is a quick rebase (not tested yet). I can send it as a
separate patch with a full changelog if you prefer that over merging it
into yours but I think we need it in some form:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd4cbc045b79..6cf7d4293976 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2306,6 +2306,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
{
unsigned long nr_reclaimed, nr_scanned;
bool reclaimable = false;
+ int scanned_memcgs = 0;

do {
struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2331,6 +2332,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
mem_cgroup_events(memcg, MEMCG_LOW, 1);
}

+ scanned_memcgs++;
lruvec = mem_cgroup_zone_lruvec(zone, memcg);
swappiness = mem_cgroup_swappiness(memcg);

@@ -2355,6 +2357,14 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));

/*
+ * reclaimers are racing and only saw low limited memcgs
+ * so we have to retry the memcg walk.
+ */
+ if (!scanned_memcgs && (global_reclaim(sc) ||
+ !mem_cgroup_low(root, root)))
+ continue;
+
+ /*
* Shrink the slab caches in the same proportion that
* the eligible LRU pages were scanned.
*/
@@ -2380,8 +2390,11 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
if (sc->nr_reclaimed - nr_reclaimed)
reclaimable = true;

- } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
- sc->nr_scanned - nr_scanned, sc));
+ if (!should_continue_reclaim(zone,
+ sc->nr_reclaimed - nr_reclaimed,
+ sc->nr_scanned - nr_scanned, sc))
+ break;
+ } while (true);

return reclaimable;
}

[...]

Other than that the patch looks OK and I am happy this has moved
forward finally.
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Javier Martinez Canillas: "Re: [PATCH RESEND 1/4] ARM: dts: Add power and lid GPIO keys pinctrl for exynos5250-snow"
Previous message: Cyrill Gorcunov: "Re: [RFC][PATCH] procfs: Add /proc/<pid>/mapped_files"
In reply to: Vladimir Davydov: "Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory"
Next in thread: Johannes Weiner: "Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]