Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
From: Michal Hocko
Date: Tue Jan 20 2015 - 11:31:45 EST
On Tue 20-01-15 10:31:55, Johannes Weiner wrote:
> Introduce the basic control files to account, partition, and limit
> memory using cgroups in default hierarchy mode.
>
> This interface versioning allows us to address fundamental design
> issues in the existing memory cgroup interface, further explained
> below. The old interface will be maintained indefinitely, but a
> clearer model and improved workload performance should encourage
> existing users to switch over to the new one eventually.
>
> The control files are thus:
>
> - memory.current shows the current consumption of the cgroup and its
> descendants, in bytes.
>
> - memory.low configures the lower end of the cgroup's expected
> memory consumption range. The kernel considers memory below that
> boundary to be a reserve - the minimum that the workload needs in
> order to make forward progress - and generally avoids reclaiming
> it, unless there is an imminent risk of entering an OOM situation.
>
> - memory.high configures the upper end of the cgroup's expected
> memory consumption range. A cgroup whose consumption grows beyond
> this threshold is forced into direct reclaim, to work off the
> excess and to throttle new allocations heavily, but is generally
> allowed to continue and the OOM killer is not invoked.
>
> - memory.max configures the hard maximum amount of memory that the
> cgroup is allowed to consume before the OOM killer is invoked.
>
> - memory.events shows event counters that indicate how often the
> cgroup was reclaimed while below memory.low, how often it was
> forced to reclaim excess beyond memory.high, how often it hit
> memory.max, and how often it entered OOM due to memory.max. This
> allows users to identify configuration problems when observing a
> degradation in workload performance. An overcommitted system will
> have an increased rate of low boundary breaches, whereas increased
> rates of high limit breaches, maximum hits, or even OOM situations
> will indicate internally overcommitted cgroups.
>
> For existing users of memory cgroups, the following deviations from
> the current interface are worth pointing out and explaining:
>
> - The original lower boundary, the soft limit, is defined as a limit
> that is per default unset. As a result, the set of cgroups that
> global reclaim prefers is opt-in, rather than opt-out. The costs
> for optimizing these mostly negative lookups are so high that the
> implementation, despite its enormous size, does not even provide
> the basic desirable behavior. First off, the soft limit has no
> hierarchical meaning. All configured groups are organized in a
> global rbtree and treated like equal peers, regardless where they
> are located in the hierarchy. This makes subtree delegation
> impossible. Second, the soft limit reclaim pass is so aggressive
> that it not just introduces high allocation latencies into the
> system, but also impacts system performance due to overreclaim, to
> the point where the feature becomes self-defeating.
>
> The memory.low boundary on the other hand is a top-down allocated
> reserve. A cgroup enjoys reclaim protection when it and all its
> ancestors are below their low boundaries, which makes delegation
> of subtrees possible. Secondly, new cgroups have no reserve per
> default and in the common case most cgroups are eligible for the
> preferred reclaim pass. This allows the new low boundary to be
> efficiently implemented with just a minor addition to the generic
> reclaim code, without the need for out-of-band data structures and
> reclaim passes. Because the generic reclaim code considers all
> cgroups except for the ones running low in the preferred first
> reclaim pass, overreclaim of individual groups is eliminated as
> well, resulting in much better overall workload performance.
>
> - The original high boundary, the hard limit, is defined as a strict
> limit that can not budge, even if the OOM killer has to be called.
> But this generally goes against the goal of making the most out of
> the available memory. The memory consumption of workloads varies
> during runtime, and that requires users to overcommit. But doing
> that with a strict upper limit requires either a fairly accurate
> prediction of the working set size or adding slack to the limit.
> Since working set size estimation is hard and error prone, and
> getting it wrong results in OOM kills, most users tend to err on
> the side of a looser limit and end up wasting precious resources.
>
> The memory.high boundary on the other hand can be set much more
> conservatively. When hit, it throttles allocations by forcing
> them into direct reclaim to work off the excess, but it never
> invokes the OOM killer. As a result, a high boundary that is
> chosen too aggressively will not terminate the processes, but
> instead it will lead to gradual performance degradation. The user
> can monitor this and make corrections until the minimal memory
> footprint that still gives acceptable performance is found.
>
> In extreme cases, with many concurrent allocations and a complete
> breakdown of reclaim progress within the group, the high boundary
> can be exceeded. But even then it's mostly better to satisfy the
> allocation from the slack available in other groups or the rest of
> the system than killing the group. Otherwise, memory.max is there
> to limit this type of spillover and ultimately contain buggy or
> even malicious applications.
>
> - The original control file names are unwieldy and inconsistent in
> many different ways. For example, the upper boundary hit count is
> exported in the memory.failcnt file, but an OOM event count has to
> be manually counted by listening to memory.oom_control events, and
> lower boundary / soft limit events have to be counted by first
> setting a threshold for that value and then counting those events.
> Also, usage and limit files encode their units in the filename.
> That makes the filenames very long, even though this is not
> information that a user needs to be reminded of every time they
> type out those names.
>
> To address these naming issues, as well as to signal clearly that
> the new interface carries a new configuration model, the naming
> conventions in it necessarily differ from the old interface.
>
> - The original limit files indicate the state of an unset limit with
> a very high number, and a configured limit can be unset by echoing
> -1 into those files. But that very high number is implementation
> and architecture dependent and not very descriptive. And while -1
> can be understood as an underflow into the highest possible value,
> -2 or -10M etc. do not work, so it's not inconsistent.
>
> memory.low, memory.high, and memory.max will use the string
> "infinity" to indicate and set the highest possible value.
>
> [akpm@xxxxxxxxxxxxxxxxxxxx: use seq_puts() for basic strings]
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxx>
> Cc: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx>
> Cc: Greg Thelen <gthelen@xxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Acked-by: Michal Hocko <mhocko@xxxxxxx>
> ---
> Documentation/cgroups/unified-hierarchy.txt | 79 ++++++++++
> include/linux/memcontrol.h | 32 ++++
> mm/memcontrol.c | 229 ++++++++++++++++++++++++++--
> mm/vmscan.c | 22 ++-
> 4 files changed, 348 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
> index 4f4563277864..71daa35ec2d9 100644
> --- a/Documentation/cgroups/unified-hierarchy.txt
> +++ b/Documentation/cgroups/unified-hierarchy.txt
> @@ -327,6 +327,85 @@ supported and the interface files "release_agent" and
> - use_hierarchy is on by default and the cgroup file for the flag is
> not created.
>
> +- The original lower boundary, the soft limit, is defined as a limit
> + that is per default unset. As a result, the set of cgroups that
> + global reclaim prefers is opt-in, rather than opt-out. The costs
> + for optimizing these mostly negative lookups are so high that the
> + implementation, despite its enormous size, does not even provide the
> + basic desirable behavior. First off, the soft limit has no
> + hierarchical meaning. All configured groups are organized in a
> + global rbtree and treated like equal peers, regardless where they
> + are located in the hierarchy. This makes subtree delegation
> + impossible. Second, the soft limit reclaim pass is so aggressive
> + that it not just introduces high allocation latencies into the
> + system, but also impacts system performance due to overreclaim, to
> + the point where the feature becomes self-defeating.
> +
> + The memory.low boundary on the other hand is a top-down allocated
> + reserve. A cgroup enjoys reclaim protection when it and all its
> + ancestors are below their low boundaries, which makes delegation of
> + subtrees possible. Secondly, new cgroups have no reserve per
> + default and in the common case most cgroups are eligible for the
> + preferred reclaim pass. This allows the new low boundary to be
> + efficiently implemented with just a minor addition to the generic
> + reclaim code, without the need for out-of-band data structures and
> + reclaim passes. Because the generic reclaim code considers all
> + cgroups except for the ones running low in the preferred first
> + reclaim pass, overreclaim of individual groups is eliminated as
> + well, resulting in much better overall workload performance.
> +
> +- The original high boundary, the hard limit, is defined as a strict
> + limit that can not budge, even if the OOM killer has to be called.
> + But this generally goes against the goal of making the most out of
> + the available memory. The memory consumption of workloads varies
> + during runtime, and that requires users to overcommit. But doing
> + that with a strict upper limit requires either a fairly accurate
> + prediction of the working set size or adding slack to the limit.
> + Since working set size estimation is hard and error prone, and
> + getting it wrong results in OOM kills, most users tend to err on the
> + side of a looser limit and end up wasting precious resources.
> +
> + The memory.high boundary on the other hand can be set much more
> + conservatively. When hit, it throttles allocations by forcing them
> + into direct reclaim to work off the excess, but it never invokes the
> + OOM killer. As a result, a high boundary that is chosen too
> + aggressively will not terminate the processes, but instead it will
> + lead to gradual performance degradation. The user can monitor this
> + and make corrections until the minimal memory footprint that still
> + gives acceptable performance is found.
> +
> + In extreme cases, with many concurrent allocations and a complete
> + breakdown of reclaim progress within the group, the high boundary
> + can be exceeded. But even then it's mostly better to satisfy the
> + allocation from the slack available in other groups or the rest of
> + the system than killing the group. Otherwise, memory.max is there
> + to limit this type of spillover and ultimately contain buggy or even
> + malicious applications.
> +
> +- The original control file names are unwieldy and inconsistent in
> + many different ways. For example, the upper boundary hit count is
> + exported in the memory.failcnt file, but an OOM event count has to
> + be manually counted by listening to memory.oom_control events, and
> + lower boundary / soft limit events have to be counted by first
> + setting a threshold for that value and then counting those events.
> + Also, usage and limit files encode their units in the filename.
> + That makes the filenames very long, even though this is not
> + information that a user needs to be reminded of every time they type
> + out those names.
> +
> + To address these naming issues, as well as to signal clearly that
> + the new interface carries a new configuration model, the naming
> + conventions in it necessarily differ from the old interface.
> +
> +- The original limit files indicate the state of an unset limit with a
> + Very High Number, and a configured limit can be unset by echoing -1
> + into those files. But that very high number is implementation and
> + architecture dependent and not very descriptive. And while -1 can
> + be understood as an underflow into the highest possible value, -2 or
> + -10M etc. do not work, so it's not consistent.
> +
> + memory.low, memory.high, and memory.max will use the string
> + "infinity" to indicate and set the highest possible value.
>
> 5. Planned Changes
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 76f489fad640..72dff5fb0d0c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,27 @@ struct mem_cgroup_reclaim_cookie {
> unsigned int generation;
> };
>
> +enum mem_cgroup_events_index {
> + MEM_CGROUP_EVENTS_PGPGIN, /* # of pages paged in */
> + MEM_CGROUP_EVENTS_PGPGOUT, /* # of pages paged out */
> + MEM_CGROUP_EVENTS_PGFAULT, /* # of page-faults */
> + MEM_CGROUP_EVENTS_PGMAJFAULT, /* # of major page-faults */
> + MEM_CGROUP_EVENTS_NSTATS,
> + /* default hierarchy events */
> + MEMCG_LOW = MEM_CGROUP_EVENTS_NSTATS,
> + MEMCG_HIGH,
> + MEMCG_MAX,
> + MEMCG_OOM,
> + MEMCG_NR_EVENTS,
> +};
> +
> #ifdef CONFIG_MEMCG
> +void mem_cgroup_events(struct mem_cgroup *memcg,
> + enum mem_cgroup_events_index idx,
> + unsigned int nr);
> +
> +bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
> +
> int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask, struct mem_cgroup **memcgp);
> void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> @@ -175,6 +195,18 @@ void mem_cgroup_split_huge_fixup(struct page *head);
> #else /* CONFIG_MEMCG */
> struct mem_cgroup;
>
> +static inline void mem_cgroup_events(struct mem_cgroup *memcg,
> + enum mem_cgroup_events_index idx,
> + unsigned int nr)
> +{
> +}
> +
> +static inline bool mem_cgroup_low(struct mem_cgroup *root,
> + struct mem_cgroup *memcg)
> +{
> + return false;
> +}
> +
> static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> gfp_t gfp_mask,
> struct mem_cgroup **memcgp)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a3592a756ad9..5730886e3b0e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -97,14 +97,6 @@ static const char * const mem_cgroup_stat_names[] = {
> "swap",
> };
>
> -enum mem_cgroup_events_index {
> - MEM_CGROUP_EVENTS_PGPGIN, /* # of pages paged in */
> - MEM_CGROUP_EVENTS_PGPGOUT, /* # of pages paged out */
> - MEM_CGROUP_EVENTS_PGFAULT, /* # of page-faults */
> - MEM_CGROUP_EVENTS_PGMAJFAULT, /* # of major page-faults */
> - MEM_CGROUP_EVENTS_NSTATS,
> -};
> -
> static const char * const mem_cgroup_events_names[] = {
> "pgpgin",
> "pgpgout",
> @@ -138,7 +130,7 @@ enum mem_cgroup_events_target {
>
> struct mem_cgroup_stat_cpu {
> long count[MEM_CGROUP_STAT_NSTATS];
> - unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
> + unsigned long events[MEMCG_NR_EVENTS];
> unsigned long nr_page_events;
> unsigned long targets[MEM_CGROUP_NTARGETS];
> };
> @@ -284,6 +276,10 @@ struct mem_cgroup {
> struct page_counter memsw;
> struct page_counter kmem;
>
> + /* Normal memory consumption range */
> + unsigned long low;
> + unsigned long high;
> +
> unsigned long soft_limit;
>
> /* vmpressure notifications */
> @@ -2327,6 +2323,8 @@ retry:
> if (!(gfp_mask & __GFP_WAIT))
> goto nomem;
>
> + mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
> +
> nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> gfp_mask, may_swap);
>
> @@ -2368,6 +2366,8 @@ retry:
> if (fatal_signal_pending(current))
> goto bypass;
>
> + mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1);
> +
> mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages));
> nomem:
> if (!(gfp_mask & __GFP_NOFAIL))
> @@ -2379,6 +2379,16 @@ done_restock:
> css_get_many(&memcg->css, batch);
> if (batch > nr_pages)
> refill_stock(memcg, batch - nr_pages);
> + /*
> + * If the hierarchy is above the normal consumption range,
> + * make the charging task trim their excess contribution.
> + */
> + do {
> + if (page_counter_read(&memcg->memory) <= memcg->high)
> + continue;
> + mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> + try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
> + } while ((memcg = parent_mem_cgroup(memcg)));
> done:
> return ret;
> }
> @@ -4304,7 +4314,7 @@ out_kfree:
> return ret;
> }
>
> -static struct cftype mem_cgroup_files[] = {
> +static struct cftype mem_cgroup_legacy_files[] = {
> {
> .name = "usage_in_bytes",
> .private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> @@ -4580,6 +4590,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> if (parent_css == NULL) {
> root_mem_cgroup = memcg;
> page_counter_init(&memcg->memory, NULL);
> + memcg->high = PAGE_COUNTER_MAX;
> memcg->soft_limit = PAGE_COUNTER_MAX;
> page_counter_init(&memcg->memsw, NULL);
> page_counter_init(&memcg->kmem, NULL);
> @@ -4625,6 +4636,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>
> if (parent->use_hierarchy) {
> page_counter_init(&memcg->memory, &parent->memory);
> + memcg->high = PAGE_COUNTER_MAX;
> memcg->soft_limit = PAGE_COUNTER_MAX;
> page_counter_init(&memcg->memsw, &parent->memsw);
> page_counter_init(&memcg->kmem, &parent->kmem);
> @@ -4635,6 +4647,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
> */
> } else {
> page_counter_init(&memcg->memory, NULL);
> + memcg->high = PAGE_COUNTER_MAX;
> memcg->soft_limit = PAGE_COUNTER_MAX;
> page_counter_init(&memcg->memsw, NULL);
> page_counter_init(&memcg->kmem, NULL);
> @@ -4710,6 +4723,8 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
> mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
> mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
> memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
> + memcg->low = 0;
> + memcg->high = PAGE_COUNTER_MAX;
> memcg->soft_limit = PAGE_COUNTER_MAX;
> }
>
> @@ -5296,6 +5311,147 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
> mem_cgroup_from_css(root_css)->use_hierarchy = true;
> }
>
> +static u64 memory_current_read(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + return mem_cgroup_usage(mem_cgroup_from_css(css), false);
> +}
> +
> +static int memory_low_show(struct seq_file *m, void *v)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> + unsigned long low = ACCESS_ONCE(memcg->low);
> +
> + if (low == PAGE_COUNTER_MAX)
> + seq_puts(m, "infinity\n");
> + else
> + seq_printf(m, "%llu\n", (u64)low * PAGE_SIZE);
> +
> + return 0;
> +}
> +
> +static ssize_t memory_low_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> + unsigned long low;
> + int err;
> +
> + buf = strstrip(buf);
> + err = page_counter_memparse(buf, "infinity", &low);
> + if (err)
> + return err;
> +
> + memcg->low = low;
> +
> + return nbytes;
> +}
> +
> +static int memory_high_show(struct seq_file *m, void *v)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> + unsigned long high = ACCESS_ONCE(memcg->high);
> +
> + if (high == PAGE_COUNTER_MAX)
> + seq_puts(m, "infinity\n");
> + else
> + seq_printf(m, "%llu\n", (u64)high * PAGE_SIZE);
> +
> + return 0;
> +}
> +
> +static ssize_t memory_high_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> + unsigned long high;
> + int err;
> +
> + buf = strstrip(buf);
> + err = page_counter_memparse(buf, "infinity", &high);
> + if (err)
> + return err;
> +
> + memcg->high = high;
> +
> + return nbytes;
> +}
> +
> +static int memory_max_show(struct seq_file *m, void *v)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> + unsigned long max = ACCESS_ONCE(memcg->memory.limit);
> +
> + if (max == PAGE_COUNTER_MAX)
> + seq_puts(m, "infinity\n");
> + else
> + seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE);
> +
> + return 0;
> +}
> +
> +static ssize_t memory_max_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> + unsigned long max;
> + int err;
> +
> + buf = strstrip(buf);
> + err = page_counter_memparse(buf, "infinity", &max);
> + if (err)
> + return err;
> +
> + err = mem_cgroup_resize_limit(memcg, max);
> + if (err)
> + return err;
> +
> + return nbytes;
> +}
> +
> +static int memory_events_show(struct seq_file *m, void *v)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +
> + seq_printf(m, "low %lu\n", mem_cgroup_read_events(memcg, MEMCG_LOW));
> + seq_printf(m, "high %lu\n", mem_cgroup_read_events(memcg, MEMCG_HIGH));
> + seq_printf(m, "max %lu\n", mem_cgroup_read_events(memcg, MEMCG_MAX));
> + seq_printf(m, "oom %lu\n", mem_cgroup_read_events(memcg, MEMCG_OOM));
> +
> + return 0;
> +}
> +
> +static struct cftype memory_files[] = {
> + {
> + .name = "current",
> + .read_u64 = memory_current_read,
> + },
> + {
> + .name = "low",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_low_show,
> + .write = memory_low_write,
> + },
> + {
> + .name = "high",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_high_show,
> + .write = memory_high_write,
> + },
> + {
> + .name = "max",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_max_show,
> + .write = memory_max_write,
> + },
> + {
> + .name = "events",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_events_show,
> + },
> + { } /* terminate */
> +};
> +
> struct cgroup_subsys memory_cgrp_subsys = {
> .css_alloc = mem_cgroup_css_alloc,
> .css_online = mem_cgroup_css_online,
> @@ -5306,7 +5462,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
> .cancel_attach = mem_cgroup_cancel_attach,
> .attach = mem_cgroup_move_task,
> .bind = mem_cgroup_bind,
> - .legacy_cftypes = mem_cgroup_files,
> + .dfl_cftypes = memory_files,
> + .legacy_cftypes = mem_cgroup_legacy_files,
> .early_init = 0,
> };
>
> @@ -5341,6 +5498,56 @@ static void __init enable_swap_cgroup(void)
> }
> #endif
>
> +/**
> + * mem_cgroup_events - count memory events against a cgroup
> + * @memcg: the memory cgroup
> + * @idx: the event index
> + * @nr: the number of events to account for
> + */
> +void mem_cgroup_events(struct mem_cgroup *memcg,
> + enum mem_cgroup_events_index idx,
> + unsigned int nr)
> +{
> + this_cpu_add(memcg->stat->events[idx], nr);
> +}
> +
> +/**
> + * mem_cgroup_low - check if memory consumption is below the normal range
> + * @root: the highest ancestor to consider
> + * @memcg: the memory cgroup to check
> + *
> + * Returns %true if memory consumption of @memcg, and that of all
> + * configurable ancestors up to @root, is below the normal range.
> + */
> +bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
> +{
> + if (mem_cgroup_disabled())
> + return false;
> +
> + /*
> + * The toplevel group doesn't have a configurable range, so
> + * it's never low when looked at directly, and it is not
> + * considered an ancestor when assessing the hierarchy.
> + */
> +
> + if (memcg == root_mem_cgroup)
> + return false;
> +
> + if (page_counter_read(&memcg->memory) > memcg->low)
> + return false;
> +
> + while (memcg != root) {
> + memcg = parent_mem_cgroup(memcg);
> +
> + if (memcg == root_mem_cgroup)
> + break;
> +
> + if (page_counter_read(&memcg->memory) > memcg->low)
> + return false;
> + }
> + return true;
> +}
> +
> #ifdef CONFIG_MEMCG_SWAP
> /**
> * mem_cgroup_swapout - transfer a memsw charge to swap
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b89097185f46..f62ec654d4c5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -91,6 +91,9 @@ struct scan_control {
> /* Can pages be swapped as part of reclaim? */
> unsigned int may_swap:1;
>
> + /* Can cgroups be reclaimed below their normal consumption range? */
> + unsigned int may_thrash:1;
> +
> unsigned int hibernation_mode:1;
>
> /* One of the zones is ready for compaction */
> @@ -2333,6 +2336,12 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> struct lruvec *lruvec;
> int swappiness;
>
> + if (mem_cgroup_low(root, memcg)) {
> + if (!sc->may_thrash)
> + continue;
> + mem_cgroup_events(memcg, MEMCG_LOW, 1);
> + }
> +
> lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> swappiness = mem_cgroup_swappiness(memcg);
> scanned = sc->nr_scanned;
> @@ -2360,8 +2369,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> mem_cgroup_iter_break(root, memcg);
> break;
> }
> - memcg = mem_cgroup_iter(root, memcg, &reclaim);
> - } while (memcg);
> + } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>
> /*
> * Shrink the slab caches in the same proportion that
> @@ -2559,10 +2567,11 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> struct scan_control *sc)
> {
> + int initial_priority = sc->priority;
> unsigned long total_scanned = 0;
> unsigned long writeback_threshold;
> bool zones_reclaimable;
> -
> +retry:
> delayacct_freepages_start();
>
> if (global_reclaim(sc))
> @@ -2612,6 +2621,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> if (sc->compaction_ready)
> return 1;
>
> + /* Untapped cgroup reserves? Don't OOM, retry. */
> + if (!sc->may_thrash) {
> + sc->priority = initial_priority;
> + sc->may_thrash = 1;
> + goto retry;
> + }
> +
> /* Any of the zones still reclaimable? Don't OOM. */
> if (zones_reclaimable)
> return 1;
> --
> 2.2.0
>
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/