Re: [patch] mm: memcontrol: support transparent huge pages under pressure

From: Greg Thelen
Date: Tue Sep 23 2014 - 20:08:44 EST



On Tue, Sep 23 2014, Johannes Weiner wrote:

> On Mon, Sep 22, 2014 at 10:52:50PM -0700, Greg Thelen wrote:
>>
>> On Fri, Sep 19 2014, Johannes Weiner wrote:
>>
>> > In a memcg with even just moderate cache pressure, success rates for
>> > transparent huge page allocations drop to zero, wasting a lot of
>> > effort that the allocator puts into assembling these pages.
>> >
>> > The reason for this is that the memcg reclaim code was never designed
>> > for higher-order charges. It reclaims in small batches until there is
>> > room for at least one page. Huge pages charges only succeed when
>> > these batches add up over a series of huge faults, which is unlikely
>> > under any significant load involving order-0 allocations in the group.
>> >
>> > Remove that loop on the memcg side in favor of passing the actual
>> > reclaim goal to direct reclaim, which is already set up and optimized
>> > to meet higher-order goals efficiently.
>> >
>> > This brings memcg's THP policy in line with the system policy: if the
>> > allocator painstakingly assembles a hugepage, memcg will at least make
>> > an honest effort to charge it. As a result, transparent hugepage
>> > allocation rates amid cache activity are drastically improved:
>> >
>> > vanilla patched
>> > pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
>> > pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
>> > pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
>> > thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
>> > thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)
>> >
>> > [ Note: this may in turn increase memory consumption from internal
>> > fragmentation, which is an inherent risk of transparent hugepages.
>> > Some setups may have to adjust the memcg limits accordingly to
>> > accomodate this - or, if the machine is already packed to capacity,
>> > disable the transparent huge page feature. ]
>>
>> We're using an earlier version of this patch, so I approve of the
>> general direction. But I have some feedback.
>>
>> The memsw aspect of this change seems somewhat separate. Can it be
>> split into a different patch?
>>
>> The memsw aspect of this patch seems to change behavior. Is this
>> intended? If so, a mention of it in the commit log would assuage the
>> reader. I'll explain... Assume a machine with swap enabled and
>> res.limit==memsw.limit, thus memsw_is_minimum is true. My understanding
>> is that memsw.usage represents sum(ram_usage, swap_usage). So when
>> memsw_is_minimum=true, then both swap_usage=0 and
>> memsw.usage==res.usage. In this condition, if res usage is at limit
>> then there's no point in swapping because memsw.usage is already
>> maximal. Prior to this patch I think the kernel did the right thing,
>> but not afterwards.
>>
>> Before this patch:
>> if res.usage == res.limit, try_charge() indirectly calls
>> try_to_free_mem_cgroup_pages(noswap=true)
>>
>> After this patch:
>> if res.usage == res.limit, try_charge() calls
>> try_to_free_mem_cgroup_pages(may_swap=true)
>>
>> Notice the inverted swap-is-allowed value.
>
> For some reason I had myself convinced that this is dead code due to a
> change in callsites a long time ago, but you are right that currently
> try_charge() relies on it, thanks for pointing it out.
>
> However, memsw is always equal to or bigger than the memory limit - so
> instead of keeping a separate state variable to track when memory
> failure implies memsw failure, couldn't we just charge memsw first?
>
> How about the following? But yeah, I'd split this into a separate
> patch now.

Looks good to me. Thanks.

Acked-by: Greg Thelen <gthelen@xxxxxxxxxx>

> ---
> mm/memcontrol.c | 15 ++++++++-------
> 1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2def11f1ec1..7c9a8971d0f4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2497,16 +2497,17 @@ retry:
> goto done;
>
> size = batch * PAGE_SIZE;
> - if (!res_counter_charge(&memcg->res, size, &fail_res)) {
> - if (!do_swap_account)
> + if (!do_swap_account ||
> + !res_counter_charge(&memcg->memsw, size, &fail_res)) {
> + if (!res_counter_charge(&memcg->res, size, &fail_res))
> goto done_restock;
> - if (!res_counter_charge(&memcg->memsw, size, &fail_res))
> - goto done_restock;
> - res_counter_uncharge(&memcg->res, size);
> + if (do_swap_account)
> + res_counter_uncharge(&memcg->memsw, size);
> + mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> + } else {
> mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
> may_swap = false;
> - } else
> - mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> + }
>
> if (batch > nr_pages) {
> batch = nr_pages;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/