Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists

From: Vladimir Davydov
Date: Thu Dec 12 2013 - 15:24:34 EST


On 12/12/2013 01:50 PM, Vladimir Davydov wrote:
>>>>> +int memcg_list_lru_init(struct list_lru *lru)
>>>>> +{
>>>>> + int err = 0;
>>>>> + int i;
>>>>> + struct mem_cgroup *memcg;
>>>>> +
>>>>> + lru->memcg = NULL;
>>>>> + lru->memcg_old = NULL;
>>>>> +
>>>>> + mutex_lock(&memcg_create_mutex);
>>>>> + if (!memcg_kmem_enabled())
>>>>> + goto out_list_add;
>>>>> +
>>>>> + lru->memcg = kcalloc(memcg_limited_groups_array_size,
>>>>> + sizeof(*lru->memcg), GFP_KERNEL);
>>>>> + if (!lru->memcg) {
>>>>> + err = -ENOMEM;
>>>>> + goto out;
>>>>> + }
>>>>> +
>>>>> + for_each_mem_cgroup(memcg) {
>>>>> + int memcg_id;
>>>>> +
>>>>> + memcg_id = memcg_cache_id(memcg);
>>>>> + if (memcg_id < 0)
>>>>> + continue;
>>>>> +
>>>>> + err = list_lru_memcg_alloc(lru, memcg_id);
>>>>> + if (err) {
>>>>> + mem_cgroup_iter_break(NULL, memcg);
>>>>> + goto out_free_lru_memcg;
>>>>> + }
>>>>> + }
>>>>> +out_list_add:
>>>>> + list_add(&lru->list, &all_memcg_lrus);
>>>>> +out:
>>>>> + mutex_unlock(&memcg_create_mutex);
>>>>> + return err;
>>>>> +
>>>>> +out_free_lru_memcg:
>>>>> + for (i = 0; i < memcg_limited_groups_array_size; i++)
>>>>> + list_lru_memcg_free(lru, i);
>>>>> + kfree(lru->memcg);
>>>>> + goto out;
>>>>> +}
>>>> That will probably scale even worse. Think about what happens when we
>>>> try to mount a bunch of filesystems in parallel - they will now
>>>> serialise completely on this memcg_create_mutex instantiating memcg
>>>> lists inside list_lru_init().
>>> Yes, the scalability seems to be the main problem here. I have a couple
>>> of thoughts on how it could be improved. Here they go:
>>>
>>> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
>>> rw-semaphore, which we would take for modifying list_lru's) and take it
>>> for reading in memcg_list_lru_init() and for writing when we create a
>>> new memcg (memcg_init_all_lrus()).
>>> This would remove the bottleneck from the mount path, but every memcg
>>> creation would still iterate over all LRUs under a memcg mutex. So I
>>> guess it is not an option, isn't it?
>> Right - it's not so much that there is a mutex to protect the init,
>> it's how long it's held that will be the issue. I mean, we don't
>> need to hold the memcg_create_mutex until we've completely
>> initialised the lru structure and are ready to add it to the
>> all_memcg_lrus list, right?
>>
>> i.e. restructing it so that you don't need to hold the mutex until
>> you make the LRU list globally visible would solve the problem just
>> as well. if we can iterate the memcgs lists without holding a lock,
>> then we can init the per-memcg lru lists without holding a lock
>> because nobody will access them through the list_lru structure
>> because it's not yet been published.
>>
>> That keeps the locking simple, and we get scalability because we've
>> reduced the lock's scope to just a few instructures instead of a
>> memcg iteration and a heap of memory allocation....
> Unfortunately that's not that easy as it seems to be :-(
>
> Currently I hold the memcg_create_mutex while initializing per-memcg
> LRUs in memcg_list_lru_init() in order to be sure that I won't miss a
> memcg that is added during initialization.
>
> I mean, let's try to move per-memcg LRUs allocation outside the lock and
> only register the LRU there:
>
> memcg_list_lru_init():
> 1) allocate lru->memcg array
> 2) for_each_kmem_active_memcg(m)
> allocate lru->memcg[m]
> 3) lock memcg_create_mutex
> add lru to all_memcg_lrus_list
> unlock memcg_create_mutex
>
> Then if a new kmem-active memcg is added after step 2 and before step 3,
> it won't see the new lru, because it has not been registered yet, and
> thus won't initialize its list in this lru. As a result, we will end up
> with a partially initialized list_lru. Note that this will happen even
> if the whole memcg initialization proceeds under the memcg_create_mutex.
>
> Provided we could freeze memcg_limited_groups_array_size, it would be
> possible to fix this problem by swapping steps 2 and 3 and making step 2
> initialize lru->memcg[m] using cmpxchg() only if it was not initialized.
> However we still have to hold the memcg_create_mutex during the whole
> kmemcg activation path (memcg_init_all_lrus()).
>
> Let's see if we can get rid of the lock in memcg_init_all_lrus() by
> making the all_memcg_lrus RCU-protected so that we could iterate over
> all list_lrus w/o holding any locks and turn memcg_init_all_lrus() to
> something like this:
>
> memcg_init_all_lrus():
> 1) for_each_list_lru_rcu(lru)
> allocate lru->memcg[new_memcg_id]
> 2) mark new_memcg as kmem-active
>
> The problem is that if memcg_list_lru_init(new_lru) starts and completes
> between steps 1 and 2, we will not initialize
> new_lru->memcg[new_memcg_id] neither in memcg_init_all_lrus() nor in
> memcg_list_lru_init().
>
> The problem here is that on kmemcg creation (memcg_init_all_lrus()) we
> have to iterate over all list_lrus while on list_lru creation
> (memcg_list_lru_init()) we have to iterate over all memcgs. Currently I
> can't figure out how we can do it w/o holding any mutexes at least while
> calling one of these functions, but I'm keep thinking on it.
>

Seems I got it. We could add a memcg state bit, say "activating",
meaning that a memcg is going to become kmem-active, but it is not yet,
so it should not be accounted to; keep all list_lrus in a rcu-protected
list; and implement memcg_init_all_lrus() and memcg_list_lru_init() as
follows:

memcg_init_all_lrus():
set activating
for_each_list_lru_rcu(lru)
cmpxchg(&lru->memcg[new_memcg_id], NULL, new list_lru_node);
set active

memcg_list_lru_init():
add the new_lru to the all_memcg_lrus list
for_each_memcg(memcg):
if memcg is activating or active:
cmpxchg(&new_lru->memcg[memcg_id], NULL, new list_lru_node)

At first glance, it looks correct, because:

If we skip an lru while iterating over all_memcg_lrus in
memcg_init_all_lrus(), it means it was created after the "activating"
bit had been set and thus the per-memcg lru will be initialized in
memcg_list_lru_init(). If we skip a memcg in memcg_list_lru_init(), this
memg will "see" this lru while iterating over the all_memcg_lrus list,
because the lru must have been created before the "activating" bit set.

Although it is to be elaborated, because I haven't examined the
destruction paths yet, and this doesn't take into account the fact that
memcg_limited_groups_array_size is not a constant (I guess I'll have to
introduce an rw semaphore for it, but it'll be OK, because its updates
are very-very rare), I guess I am on the right way.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/