Re: [patch] memcg: skip scanning active lists based on individualsize

From: Johannes Weiner
Date: Mon Sep 05 2011 - 14:25:33 EST


On Thu, Sep 01, 2011 at 03:31:48PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 1 Sep 2011 08:15:40 +0200
> Johannes Weiner <jweiner@xxxxxxxxxx> wrote:
>
> > On Thu, Sep 01, 2011 at 09:09:31AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Wed, 31 Aug 2011 19:13:34 +0900
> > > Minchan Kim <minchan.kim@xxxxxxxxx> wrote:
> > >
> > > > On Wed, Aug 31, 2011 at 6:08 PM, Johannes Weiner <jweiner@xxxxxxxxxx> wrote:
> > > > > Reclaim decides to skip scanning an active list when the corresponding
> > > > > inactive list is above a certain size in comparison to leave the
> > > > > assumed working set alone while there are still enough reclaim
> > > > > candidates around.
> > > > >
> > > > > The memcg implementation of comparing those lists instead reports
> > > > > whether the whole memcg is low on the requested type of inactive
> > > > > pages, considering all nodes and zones.
> > > > >
> > > > > This can lead to an oversized active list not being scanned because of
> > > > > the state of the other lists in the memcg, as well as an active list
> > > > > being scanned while its corresponding inactive list has enough pages.
> > > > >
> > > > > Not only is this wrong, it's also a scalability hazard, because the
> > > > > global memory state over all nodes and zones has to be gathered for
> > > > > each memcg and zone scanned.
> > > > >
> > > > > Make these calculations purely based on the size of the two LRU lists
> > > > > that are actually affected by the outcome of the decision.
> > > > >
> > > > > Signed-off-by: Johannes Weiner <jweiner@xxxxxxxxxx>
> > > > > Cc: Rik van Riel <riel@xxxxxxxxxx>
> > > > > Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
> > > > > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> > > > > Cc: Daisuke Nishimura <nishimura@xxxxxxxxxxxxxxxxx>
> > > > > Cc: Balbir Singh <bsingharora@xxxxxxxxx>
> > > >
> > > > Reviewed-by: Minchan Kim <minchan.kim@xxxxxxxxx>
> > > >
> > > > I can't understand why memcg is designed for considering all nodes and zones.
> > > > Is it a mistake or on purpose?
> > >
> > > It's purpose. memcg just takes care of the amount of pages.
> >
> > This mechanism isn't about memcg at all, it's an aging decision at a
> > much lower level. Can you tell me how the old implementation is
> > supposed to work?
> >
> Old implemenation was supporsed to make vmscan to see only memcg and
> ignore zones. memcg doesn't take care of any zones. Then, it uses
> global numbers rather than zones.
>
> Assume a system with 2 nodes and the whole memcg's inactive/active ratio
> is unbalaned.
>
> Node 0 1
> Active 800M 30M
> Inactive 100M 200M
>
> If we judge 'unbalance' based on zones, Node1's Active will not rotate
> even if it's not accessed for a while.
> If we judge unbalance based on total stat, Both of Node0 and Node 1
> will be rotated.

But why should we deactivate on Node 1? We have good reasons not to
on the global level, why should memcgs silently behave differently?

I mostly don't understand it on a semantic level. vmscan needs to
know whether a certain inactive LRU list has enough reclaim candidates
to skip scanning its corresponding active list. The global state is
not useful to find out if a single inactive list has enough pages.

> Hmm, old one doesn't work as I expexted ?
>
> But okay, as time goes, I think Node1's inactive will decreased
> and then, rotate will happen even with zone based ones.

Yes, that's how the mechanism is intended to work: with a constant
influx of used-once pages, we don't want to touch the active list.
But when the workload changes and inactive pages get either activated
or all reclaimed, the ratio changes and eventually we fall back to
deactivating pages again.

That's reclaim behaviour that has been around for a while and it
shouldn't make a difference if your workload is running in
root_mem_cgroup or another memcg.

> > > But, hmm, this change may be good for softlimit and your work.
> >
> > Yes, I noticed those paths showing up in a profile with my patches.
> > Lots of memcgs on a multi-node machine will trigger it too. But it's
> > secondary, my primary reasoning was: this does not make sense at all.
>
> your word sounds always too strong to me ;) please be soft.

Sorry, I'll try to be less harsh. Please don't take it personally :)

What I meant was that the computational overhead was not the primary
reason for this patch. Although a reduction there is very welcome,
it's that deciding to skip the list based on the list size seems more
correct than deciding based on the overall state of the memcg, which
can only by accident show the same proportion of inactive/active.

It's a correctness fix for existing code, not an optimization or
preparation for future changes.

> > > I'll ack when you add performance numbers in changelog.
> >
> > It's not exactly a performance optimization but I'll happily run some
> > workloads. Do you have suggestions what to test for? I.e. where
> > would you expect regressions?
> >
> Some comparison about amount of swap-out before/after change will be good.
>
> Hm. If I do...
> - set up x86-64 NUMA box. (fake numa is ok.)
> - create memcg with 500M limit.
> - running kernel make with make -j 6(or more)
>
> see time of make and amount of swap-out.

4G ram, 500M swap on SSD, numa=fake=16, 10 runs of make -j11 in 500M
memcg, standard deviation in parens:

seconds pswpin pswpout
vanilla: 175.359(0.106) 6906.900(1779.135) 8913.200(1917.369)
patched: 176.144(0.243) 8581.500(1833.432) 10872.400(2124.104)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/