Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

From: Kirill Tkhai
Date: Fri Nov 17 2017 - 11:47:33 EST

On 15.11.2017 19:29, Eric W. Biederman wrote:
> Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes:
>> On 15.11.2017 09:25, Eric W. Biederman wrote:
>>> Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes:
>>>> Curently mutex is used to protect pernet operations list. It makes
>>>> cleanup_net() to execute ->exit methods of the same operations set,
>>>> which was used on the time of ->init, even after net namespace is
>>>> unlinked from net_namespace_list.
>>>> But the problem is it's need to synchronize_rcu() after net is removed
>>>> from net_namespace_list():
>>>> Destroy net_ns:
>>>> cleanup_net()
>>>> mutex_lock(&net_mutex)
>>>> list_del_rcu(&net->list)
>>>> synchronize_rcu() <--- Sleep there for ages
>>>> list_for_each_entry_reverse(ops, &pernet_list, list)
>>>> ops_exit_list(ops, &net_exit_list)
>>>> list_for_each_entry_reverse(ops, &pernet_list, list)
>>>> ops_free_list(ops, &net_exit_list)
>>>> mutex_unlock(&net_mutex)
>>>> This primitive is not fast, especially on the systems with many processors
>>>> and/or when preemptible RCU is enabled in config. So, all the time, while
>>>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>>>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>>> Create net_ns:
>>>> copy_net_ns()
>>>> mutex_lock_killable(&net_mutex) <--- Sleep there for ages
>>>> The solution is to convert net_mutex to the rw_semaphore. Then,
>>>> pernet_operations::init/::exit methods, modifying the net-related data,
>>>> will require down_read() locking only, while down_write() will be used
>>>> for changing pernet_list.
>>>> This gives signify performance increase, like you may see below. There
>>>> is measured sequential net namespace creation in a cycle, in single
>>>> thread, without other tasks (single user mode):
>>>> 1)int main(int argc, char *argv[])
>>>> {
>>>> unsigned nr;
>>>> if (argc < 2) {
>>>> fprintf(stderr, "Provide nr iterations arg\n");
>>>> return 1;
>>>> }
>>>> nr = atoi(argv[1]);
>>>> while (nr-- > 0) {
>>>> if (unshare(CLONE_NEWNET)) {
>>>> perror("Can't unshare");
>>>> return 1;
>>>> }
>>>> }
>>>> return 0;
>>>> }
>>>> Origin, 100000 unshare():
>>>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>>> Patched, 100000 unshare():
>>>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>>> Origin:
>>>> real 1m24,190s
>>>> user 0m6,225s
>>>> sys 0m15,132s
>>>> Patched:
>>>> real 0m18,235s (4.6 times faster)
>>>> user 0m4,544s
>>>> sys 0m13,796s
>>>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
>>>> from Linus tree (not in net-next yet).
>>> Using a rwsem to protect the list of operations makes sense.
>>> That should allow removing the sing
>>> I am not wild about taking a the rwsem down_write in
>>> rtnl_link_unregister, and net_ns_barrier. I think that works but it
>>> goes from being a mild hack to being a pretty bad hack and something
>>> else that can kill the parallelism you are seeking it add.
>>> There are about 204 instances of struct pernet_operations. That is a
>>> lot of code to have carefully audited to ensure it can in parallel all
>>> at once. The existence of the exit_batch method, net_ns_barrier,
>>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
>>> to the fact that there are data structures accessed by multiple network
>>> namespaces.
>>> My preference would be to:
>>> - Add the net_sem in addition to net_mutex with down_write only held in
>>> register and unregister, and maybe net_ns_barrier and
>>> rtnl_link_unregister.
>>> - Factor out struct pernet_ops out of struct pernet_operations. With
>>> struct pernet_ops not having the exit_batch method. With pernet_ops
>>> being embedded an anonymous member of the old struct pernet_operations.
>>> - Add [un]register_pernet_{sys,dev} functions that take a struct
>>> pernet_ops, that don't take net_mutex. Have them order the
>>> pernet_list as:
>>> pernet_sys
>>> pernet_subsys
>>> pernet_device
>>> pernet_dev
>>> With the chunk in the middle taking the net_mutex.
>> I think this approach will work. Thanks for the suggestion. Some more
>> thoughts to the plan below.
>> The only difficult thing there will be to choose the right order
>> to move ops from pernet_subsys to pernet_sys and from pernet_device
>> to pernet_dev one by one.
>> This is rather easy in case of tristate drivers, as modules may be loaded
>> at any time, and the only important order is dependences between them.
>> So, it's possible to start from a module, who has no dependences,
>> and move it to pernet_sys, and then continue with modules,
>> who have no more dependences in pernet_subsys. For pernet_device
>> it's vise versa.
>> In case of bool drivers, the situation may be worse, because
>> the order is not so clear there. The same priority initcalls
>> (for example, initcalls, registered via core_initcall()) may require
>> the certain order, driven by linking order. I know one example from
>> device mapper code, which lives here: drivers/md/Makefile.
>> This problem is also solvable, even if such places do not contain
>> comments about linking order. It's just need to respect Makefile
>> order, when choosing a new candidate to move.
>>> I think I would enforce the ordering with a failure to register
>>> if a subsystem or a device tried to register out of order.
>>> - Disable use of the single threaded workqueue if nothing needs the
>>> net_mutex.
>> We may use per-cpu worqueues in the future. The idea to refuse using
>> worqueue doesn't seem good for me, because asynchronous net destroying
>> looks very useful.
> per-cpu workqueues are fine, and definitely what I am expecting. If we
> are doing this I want to get us off the single threaded workqueue that
> serializes all of the cleanup. That has a huge potential for
> simplifying things and reducing maintenance if running everything in
> parallel is actually safe.
> I forget how the modern per-cpu workqueues work with respect to sleeps
> and locking (I don't remember if when piece of work sleeps for a long
> time we spin up another thread per-cpu workqueue thread, and thus avoid
> priority inversion problems).
> If locks between workqueues are not a problem we could start the
> transition off of the single-threaded serializing workqueue sooner
> rather than later.
>>> - Add a test mode that deliberartely spawns threads on multiple
>>> processors and deliberately creates multiple network namespaces
>>> at the same time.
>>> - Add a test mode that deliberately spawns threads on multiple
>>> processors and delibearate destrosy multiple network namespaces
>>> at the same time.>
>>> - Convert the code to unlocked operation one pernet_operations to at a
>>> time. Being careful with the loopback device it's order in the list
>>> strongly matters.
>>> - Finally remove the unnecessary code.
>>> At the end of the day because all of the operations for one network
>>> namespace will run in parallel with all of the operations for another
>>> network namespace all of the sophistication that goes into batching the
>>> cleanup of multiple network namespaces can be removed. As different
>>> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
>>> without slowing each other down.
>>> I think we can remove the batching but I am afraid that will lead us into
>>> rtnl_lock contention.
>> I've looked into this lock. It used in many places for many reasons.
>> It's a little strange, nobody tried to break it up in several small locks..
> It gets tricky. At this point getting net_mutex is enough to start
> with. Mostly the rtnl_lock covers the slow path which tends to keep it
> from rising to the top of the priority list.
>>> I am definitely not comfortable with changing the locking on so much
>>> code without an explanation at all why it is safe in the commit comments
>>> in all 204 instances. Which combined equal most of the well maintained
>>> and regularly used part of the networking stack.
>> David,
>> could you please check the plan above and say whether 1)it's OK for you,
>> and 2)if so, will you expect all the changes are made in one big 200+ patch set
>> or we may go sequentially(50 patches a time, for example)?
> I think I would start with the something covering the core networking
> pieces so the improvements can be seen.

Since the main problem is mutex held during synchronize_rcu(), the performance
improvements may become seen only after it's completely removed.