Re: [PATCH -mm 1/4] memcg, slab: do not schedule cache destruction when last page goes away

From: Johannes Weiner
Date: Mon Apr 14 2014 - 22:16:35 EST


On Wed, Apr 09, 2014 at 07:02:30PM +0400, Vladimir Davydov wrote:
> After the memcg is offlined, we mark its kmem caches that cannot be
> deleted right now due to pending objects as dead by setting the
> memcg_cache_params::dead flag, so that memcg_release_pages will schedule
> cache destruction (memcg_cache_params::destroy) as soon as the last slab
> of the cache is freed (memcg_cache_params::nr_pages drops to zero).
>
> I guess the idea was to destroy the caches as soon as possible, i.e.
> immediately after freeing the last object. However, it just doesn't work
> that way, because kmem caches always preserve some pages for the sake of
> performance, so that nr_pages never gets to zero unless the cache is
> shrunk explicitly using kmem_cache_shrink. Of course, we could account
> the total number of objects on the cache or check if all the slabs
> allocated for the cache are empty on kmem_cache_free and schedule
> destruction if so, but that would be too costly.
>
> Thus we have a piece of code that works only when we explicitly call
> kmem_cache_shrink, but complicates the whole picture a lot. Moreover,
> it's racy in fact. For instance, kmem_cache_shrink may free the last
> slab and thus schedule cache destruction before it finishes checking
> that the cache is empty, which can lead to use-after-free.
>
> So I propose to remove this async cache destruction from
> memcg_release_pages, and check if the cache is empty explicitly after
> calling kmem_cache_shrink instead. This will simplify things a lot w/o
> introducing any functional changes.
>
> And regarding dead memcg caches (i.e. those that are left hanging around
> after memcg offline for they have objects), I suppose we should reap
> them either periodically or on vmpressure as Glauber suggested
> initially. I'm going to implement this later.

memcg_release_pages() can be called after cgroup destruction, and thus
it *must* ensure that the now-empty cache is destroyed - or we'll leak
it.

There is no excuse to downgrade to periodic reaping when we already
directly hook into the event that makes the cache empty. If slab
needs to hold on to the cache for slightly longer than the final
memcg_release_pages(), then it should grab a refcount to it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/