Re: [PATCH v1 0/7] mm/memcontrol: recharge mlocked pages

From: Roman Gushchin
Date: Thu Sep 05 2019 - 19:11:49 EST


On Thu, Sep 05, 2019 at 10:38:16AM +0300, Konstantin Khlebnikov wrote:
> On 04/09/2019 17.37, Michal Hocko wrote:
> > On Wed 04-09-19 16:53:08, Konstantin Khlebnikov wrote:
> > > Currently mlock keeps pages in cgroups where they were accounted.
> > > This way one container could affect another if they share file cache.
> > > Typical case is writing (downloading) file in one container and then
> > > locking in another. After that first container cannot get rid of cache.
> > > Also removed cgroup stays pinned by these mlocked pages.
> > >
> > > This patchset implements recharging pages to cgroup of mlock user.
> > >
> > > There are three cases:
> > > * recharging at first mlock
> > > * recharging at munlock to any remaining mlock
> > > * recharging at 'culling' in reclaimer to any existing mlock
> > >
> > > To keep things simple recharging ignores memory limit. After that memory
> > > usage temporary could be higher than limit but cgroup will reclaim memory
> > > later or trigger oom, which is valid outcome when somebody mlock too much.
> >
> > I assume that this is mlock specific because the pagecache which has the
> > same problem is reclaimable and the problem tends to resolve itself
> > after some time.
> >
> > Anyway, how big of a problem this really is? A lingering memcg is
> > certainly not nice but pages are usually not mlocked for ever. Or is
> > this a way to protect from an hostile actor?
>
> We're using mlock mostly to avoid non-deterministic behaviour in cache.
> For example some of our applications mlock index structures in databases
> to limit count of major faults in worst case.
>
> Surprisingly mlock fixates unwanted effects of non-predictable cache sharing.
>
> So, it seems makes sense to make mlock behaviour simple and completely
> deterministic because this isn't cheap operation and needs careful
> resource planning.
>

Totally agree.

>
>
> On 05/09/2019 02.13, Roman Gushchin wrote:> On Wed, Sep 04, 2019 at 04:53:08PM +0300, Konstantin Khlebnikov wrote:
> >> Currently mlock keeps pages in cgroups where they were accounted.
> >> This way one container could affect another if they share file cache.
> >> Typical case is writing (downloading) file in one container and then
> >> locking in another. After that first container cannot get rid of cache.
> >
> > Yeah, it's a valid problem, and it's not about mlocked pages only,
> > the same thing is true for generic pagecache. The only difference is that
> > in theory memory pressure should fix everything. But in reality
> > pagecache used by the second container can be very hot, so the first
> > once can't really get rid of it.
> > In other words, there is no way to pass a pagecache page between cgroups
> > without evicting it and re-reading from a storage, which is sub-optimal
> > in many cases.
> >
> > We thought about new madvise(), which will uncharge pagecache but set
> > a new page flag, which will mean something like "whoever first starts using
> > the page, should be charged for it". But it never materialized in a patchset.
>
> I've implemented something similar in OpenVZ kernel - "shadow" LRU sets for
> abandoned cache which automatically changes ownership at first activation.
>
> I'm thinking about fadvise() or fcntl() for moving cache into current memory cgroup.
> This should give enough control to solve all our problems.

Idk, it feels a bit fragile: because only one cgroup can own a page, there
should be a strong coordination, otherwise cgroups may just spend non-trivial
amount of cpu time stealing pages back and forth.

I'm not strictly against such fadvise() though, it can be useful in certain
cases. But in general the abandoning semantics makes more sense to me.
If a cgroup doesn't plan to use the page anymore, it marks it in a special way,
so that the next one who want to use it pays the whole price. So it works
exactly as if the page had been evicted, but more efficiently.

Thanks!