Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

From: Eric W. Biederman
Date: Wed Nov 15 2017 - 11:30:08 EST


Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes:

> On 15.11.2017 09:25, Eric W. Biederman wrote:
>> Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes:
>>
>>> Curently mutex is used to protect pernet operations list. It makes
>>> cleanup_net() to execute ->exit methods of the same operations set,
>>> which was used on the time of ->init, even after net namespace is
>>> unlinked from net_namespace_list.
>>>
>>> But the problem is it's need to synchronize_rcu() after net is removed
>>> from net_namespace_list():
>>>
>>> Destroy net_ns:
>>> cleanup_net()
>>> mutex_lock(&net_mutex)
>>> list_del_rcu(&net->list)
>>> synchronize_rcu() <--- Sleep there for ages
>>> list_for_each_entry_reverse(ops, &pernet_list, list)
>>> ops_exit_list(ops, &net_exit_list)
>>> list_for_each_entry_reverse(ops, &pernet_list, list)
>>> ops_free_list(ops, &net_exit_list)
>>> mutex_unlock(&net_mutex)
>>>
>>> This primitive is not fast, especially on the systems with many processors
>>> and/or when preemptible RCU is enabled in config. So, all the time, while
>>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>>
>>> Create net_ns:
>>> copy_net_ns()
>>> mutex_lock_killable(&net_mutex) <--- Sleep there for ages
>>>
>>> The solution is to convert net_mutex to the rw_semaphore. Then,
>>> pernet_operations::init/::exit methods, modifying the net-related data,
>>> will require down_read() locking only, while down_write() will be used
>>> for changing pernet_list.
>>>
>>> This gives signify performance increase, like you may see below. There
>>> is measured sequential net namespace creation in a cycle, in single
>>> thread, without other tasks (single user mode):
>>>
>>> 1)int main(int argc, char *argv[])
>>> {
>>> unsigned nr;
>>> if (argc < 2) {
>>> fprintf(stderr, "Provide nr iterations arg\n");
>>> return 1;
>>> }
>>> nr = atoi(argv[1]);
>>> while (nr-- > 0) {
>>> if (unshare(CLONE_NEWNET)) {
>>> perror("Can't unshare");
>>> return 1;
>>> }
>>> }
>>> return 0;
>>> }
>>>
>>> Origin, 100000 unshare():
>>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>>
>>> Patched, 100000 unshare():
>>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>>
>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>>
>>> Origin:
>>> real 1m24,190s
>>> user 0m6,225s
>>> sys 0m15,132s
>>>
>>> Patched:
>>> real 0m18,235s (4.6 times faster)
>>> user 0m4,544s
>>> sys 0m13,796s
>>>
>>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
>>> from Linus tree (not in net-next yet).
>>
>> Using a rwsem to protect the list of operations makes sense.
>>
>> That should allow removing the sing
>>
>> I am not wild about taking a the rwsem down_write in
>> rtnl_link_unregister, and net_ns_barrier. I think that works but it
>> goes from being a mild hack to being a pretty bad hack and something
>> else that can kill the parallelism you are seeking it add.
>>
>> There are about 204 instances of struct pernet_operations. That is a
>> lot of code to have carefully audited to ensure it can in parallel all
>> at once. The existence of the exit_batch method, net_ns_barrier,
>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
>> to the fact that there are data structures accessed by multiple network
>> namespaces.
>>
>> My preference would be to:
>>
>> - Add the net_sem in addition to net_mutex with down_write only held in
>> register and unregister, and maybe net_ns_barrier and
>> rtnl_link_unregister.
>>
>> - Factor out struct pernet_ops out of struct pernet_operations. With
>> struct pernet_ops not having the exit_batch method. With pernet_ops
>> being embedded an anonymous member of the old struct pernet_operations.
>>
>> - Add [un]register_pernet_{sys,dev} functions that take a struct
>> pernet_ops, that don't take net_mutex. Have them order the
>> pernet_list as:
>>
>> pernet_sys
>> pernet_subsys
>> pernet_device
>> pernet_dev
>>
>> With the chunk in the middle taking the net_mutex.
>
> I think this approach will work. Thanks for the suggestion. Some more
> thoughts to the plan below.
>
> The only difficult thing there will be to choose the right order
> to move ops from pernet_subsys to pernet_sys and from pernet_device
> to pernet_dev one by one.
>
> This is rather easy in case of tristate drivers, as modules may be loaded
> at any time, and the only important order is dependences between them.
> So, it's possible to start from a module, who has no dependences,
> and move it to pernet_sys, and then continue with modules,
> who have no more dependences in pernet_subsys. For pernet_device
> it's vise versa.
>
> In case of bool drivers, the situation may be worse, because
> the order is not so clear there. The same priority initcalls
> (for example, initcalls, registered via core_initcall()) may require
> the certain order, driven by linking order. I know one example from
> device mapper code, which lives here: drivers/md/Makefile.
> This problem is also solvable, even if such places do not contain
> comments about linking order. It's just need to respect Makefile
> order, when choosing a new candidate to move.
>
>> I think I would enforce the ordering with a failure to register
>> if a subsystem or a device tried to register out of order.
>>
>> - Disable use of the single threaded workqueue if nothing needs the
>> net_mutex.
>
> We may use per-cpu worqueues in the future. The idea to refuse using
> worqueue doesn't seem good for me, because asynchronous net destroying
> looks very useful.

per-cpu workqueues are fine, and definitely what I am expecting. If we
are doing this I want to get us off the single threaded workqueue that
serializes all of the cleanup. That has a huge potential for
simplifying things and reducing maintenance if running everything in
parallel is actually safe.

I forget how the modern per-cpu workqueues work with respect to sleeps
and locking (I don't remember if when piece of work sleeps for a long
time we spin up another thread per-cpu workqueue thread, and thus avoid
priority inversion problems).

If locks between workqueues are not a problem we could start the
transition off of the single-threaded serializing workqueue sooner
rather than later.

>> - Add a test mode that deliberartely spawns threads on multiple
>> processors and deliberately creates multiple network namespaces
>> at the same time.
>>
>> - Add a test mode that deliberately spawns threads on multiple
>> processors and delibearate destrosy multiple network namespaces
>> at the same time.>
>> - Convert the code to unlocked operation one pernet_operations to at a
>> time. Being careful with the loopback device it's order in the list
>> strongly matters.
>>
>> - Finally remove the unnecessary code.
>>
>>
>> At the end of the day because all of the operations for one network
>> namespace will run in parallel with all of the operations for another
>> network namespace all of the sophistication that goes into batching the
>> cleanup of multiple network namespaces can be removed. As different
>> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
>> without slowing each other down.
>>
>> I think we can remove the batching but I am afraid that will lead us into
>> rtnl_lock contention.
>
> I've looked into this lock. It used in many places for many reasons.
> It's a little strange, nobody tried to break it up in several small locks..

It gets tricky. At this point getting net_mutex is enough to start
with. Mostly the rtnl_lock covers the slow path which tends to keep it
from rising to the top of the priority list.

>> I am definitely not comfortable with changing the locking on so much
>> code without an explanation at all why it is safe in the commit comments
>> in all 204 instances. Which combined equal most of the well maintained
>> and regularly used part of the networking stack.
>
> David,
>
> could you please check the plan above and say whether 1)it's OK for you,
> and 2)if so, will you expect all the changes are made in one big 200+ patch set
> or we may go sequentially(50 patches a time, for example)?

I think I would start with the something covering the core networking
pieces so the improvements can be seen.


Eric