Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

From: Eric W. Biederman
Date: Fri Nov 17 2017 - 13:57:56 EST

Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes:

> On 15.11.2017 19:29, Eric W. Biederman wrote:
>> Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes:
>>> On 15.11.2017 09:25, Eric W. Biederman wrote:
>>>> Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes:
>>>>> Curently mutex is used to protect pernet operations list. It makes
>>>>> cleanup_net() to execute ->exit methods of the same operations set,
>>>>> which was used on the time of ->init, even after net namespace is
>>>>> unlinked from net_namespace_list.
>>>>> But the problem is it's need to synchronize_rcu() after net is removed
>>>>> from net_namespace_list():
>>>>> Destroy net_ns:
>>>>> cleanup_net()
>>>>> mutex_lock(&net_mutex)
>>>>> list_del_rcu(&net->list)
>>>>> synchronize_rcu() <--- Sleep there for ages
>>>>> list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>> ops_exit_list(ops, &net_exit_list)
>>>>> list_for_each_entry_reverse(ops, &pernet_list, list)
>>>>> ops_free_list(ops, &net_exit_list)
>>>>> mutex_unlock(&net_mutex)
>>>>> This primitive is not fast, especially on the systems with many processors
>>>>> and/or when preemptible RCU is enabled in config. So, all the time, while
>>>>> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
>>>>> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>>>>> Create net_ns:
>>>>> copy_net_ns()
>>>>> mutex_lock_killable(&net_mutex) <--- Sleep there for ages
>>>>> The solution is to convert net_mutex to the rw_semaphore. Then,
>>>>> pernet_operations::init/::exit methods, modifying the net-related data,
>>>>> will require down_read() locking only, while down_write() will be used
>>>>> for changing pernet_list.
>>>>> This gives signify performance increase, like you may see below. There
>>>>> is measured sequential net namespace creation in a cycle, in single
>>>>> thread, without other tasks (single user mode):
>>>>> 1)int main(int argc, char *argv[])
>>>>> {
>>>>> unsigned nr;
>>>>> if (argc < 2) {
>>>>> fprintf(stderr, "Provide nr iterations arg\n");
>>>>> return 1;
>>>>> }
>>>>> nr = atoi(argv[1]);
>>>>> while (nr-- > 0) {
>>>>> if (unshare(CLONE_NEWNET)) {
>>>>> perror("Can't unshare");
>>>>> return 1;
>>>>> }
>>>>> }
>>>>> return 0;
>>>>> }
>>>>> Origin, 100000 unshare():
>>>>> 0.03user 23.14system 1:39.85elapsed 23%CPU
>>>>> Patched, 100000 unshare():
>>>>> 0.03user 67.49system 1:08.34elapsed 98%CPU
>>>>> 2)for i in {1..10000}; do unshare -n bash -c exit; done
>>>>> Origin:
>>>>> real 1m24,190s
>>>>> user 0m6,225s
>>>>> sys 0m15,132s
>>>>> Patched:
>>>>> real 0m18,235s (4.6 times faster)
>>>>> user 0m4,544s
>>>>> sys 0m13,796s
>>>>> This patch requires commit 76f8507f7a64 "locking/rwsem: Add down_read_killable()"
>>>>> from Linus tree (not in net-next yet).
>>>> Using a rwsem to protect the list of operations makes sense.
>>>> That should allow removing the sing
>>>> I am not wild about taking a the rwsem down_write in
>>>> rtnl_link_unregister, and net_ns_barrier. I think that works but it
>>>> goes from being a mild hack to being a pretty bad hack and something
>>>> else that can kill the parallelism you are seeking it add.
>>>> There are about 204 instances of struct pernet_operations. That is a
>>>> lot of code to have carefully audited to ensure it can in parallel all
>>>> at once. The existence of the exit_batch method, net_ns_barrier,
>>>> for_each_net and taking of net_mutex in rtnl_link_unregister all testify
>>>> to the fact that there are data structures accessed by multiple network
>>>> namespaces.
>>>> My preference would be to:
>>>> - Add the net_sem in addition to net_mutex with down_write only held in
>>>> register and unregister, and maybe net_ns_barrier and
>>>> rtnl_link_unregister.
>>>> - Factor out struct pernet_ops out of struct pernet_operations. With
>>>> struct pernet_ops not having the exit_batch method. With pernet_ops
>>>> being embedded an anonymous member of the old struct pernet_operations.
>>>> - Add [un]register_pernet_{sys,dev} functions that take a struct
>>>> pernet_ops, that don't take net_mutex. Have them order the
>>>> pernet_list as:
>>>> pernet_sys
>>>> pernet_subsys
>>>> pernet_device
>>>> pernet_dev
>>>> With the chunk in the middle taking the net_mutex.
>>> I think this approach will work. Thanks for the suggestion. Some more
>>> thoughts to the plan below.
>>> The only difficult thing there will be to choose the right order
>>> to move ops from pernet_subsys to pernet_sys and from pernet_device
>>> to pernet_dev one by one.
>>> This is rather easy in case of tristate drivers, as modules may be loaded
>>> at any time, and the only important order is dependences between them.
>>> So, it's possible to start from a module, who has no dependences,
>>> and move it to pernet_sys, and then continue with modules,
>>> who have no more dependences in pernet_subsys. For pernet_device
>>> it's vise versa.
>>> In case of bool drivers, the situation may be worse, because
>>> the order is not so clear there. The same priority initcalls
>>> (for example, initcalls, registered via core_initcall()) may require
>>> the certain order, driven by linking order. I know one example from
>>> device mapper code, which lives here: drivers/md/Makefile.
>>> This problem is also solvable, even if such places do not contain
>>> comments about linking order. It's just need to respect Makefile
>>> order, when choosing a new candidate to move.
>>>> I think I would enforce the ordering with a failure to register
>>>> if a subsystem or a device tried to register out of order.
>>>> - Disable use of the single threaded workqueue if nothing needs the
>>>> net_mutex.
>>> We may use per-cpu worqueues in the future. The idea to refuse using
>>> worqueue doesn't seem good for me, because asynchronous net destroying
>>> looks very useful.
>> per-cpu workqueues are fine, and definitely what I am expecting. If we
>> are doing this I want to get us off the single threaded workqueue that
>> serializes all of the cleanup. That has a huge potential for
>> simplifying things and reducing maintenance if running everything in
>> parallel is actually safe.
>> I forget how the modern per-cpu workqueues work with respect to sleeps
>> and locking (I don't remember if when piece of work sleeps for a long
>> time we spin up another thread per-cpu workqueue thread, and thus avoid
>> priority inversion problems).
>> If locks between workqueues are not a problem we could start the
>> transition off of the single-threaded serializing workqueue sooner
>> rather than later.
>>>> - Add a test mode that deliberartely spawns threads on multiple
>>>> processors and deliberately creates multiple network namespaces
>>>> at the same time.
>>>> - Add a test mode that deliberately spawns threads on multiple
>>>> processors and delibearate destrosy multiple network namespaces
>>>> at the same time.>
>>>> - Convert the code to unlocked operation one pernet_operations to at a
>>>> time. Being careful with the loopback device it's order in the list
>>>> strongly matters.
>>>> - Finally remove the unnecessary code.
>>>> At the end of the day because all of the operations for one network
>>>> namespace will run in parallel with all of the operations for another
>>>> network namespace all of the sophistication that goes into batching the
>>>> cleanup of multiple network namespaces can be removed. As different
>>>> tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
>>>> without slowing each other down.
>>>> I think we can remove the batching but I am afraid that will lead us into
>>>> rtnl_lock contention.
>>> I've looked into this lock. It used in many places for many reasons.
>>> It's a little strange, nobody tried to break it up in several small locks..
>> It gets tricky. At this point getting net_mutex is enough to start
>> with. Mostly the rtnl_lock covers the slow path which tends to keep it
>> from rising to the top of the priority list.
>>>> I am definitely not comfortable with changing the locking on so much
>>>> code without an explanation at all why it is safe in the commit comments
>>>> in all 204 instances. Which combined equal most of the well maintained
>>>> and regularly used part of the networking stack.
>>> David,
>>> could you please check the plan above and say whether 1)it's OK for you,
>>> and 2)if so, will you expect all the changes are made in one big 200+ patch set
>>> or we may go sequentially(50 patches a time, for example)?
>> I think I would start with the something covering the core networking
>> pieces so the improvements can be seen.
> Since the main problem is mutex held during synchronize_rcu(), the performance
> improvements may become seen only after it's completely removed.

In a distro kernel build sure. But I think with just a few conversions
some built in network devices the core networking stack and a few ipv4
pieces you should be able to have a minimal but usualbe configuration
that does not need the lock at all.

Further dropping the lock earlier means the code can be actively
exercised running in parallel in case something is missed in review.