Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

From: Andrei Vagin
Date: Tue Nov 14 2017 - 13:38:56 EST


On Tue, Nov 14, 2017 at 09:04:06PM +0300, Kirill Tkhai wrote:
> On 14.11.2017 20:44, Andrei Vagin wrote:
> > On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
> >> Curently mutex is used to protect pernet operations list. It makes
> >> cleanup_net() to execute ->exit methods of the same operations set,
> >> which was used on the time of ->init, even after net namespace is
> >> unlinked from net_namespace_list.
> >>
> >> But the problem is it's need to synchronize_rcu() after net is removed
> >> from net_namespace_list():
> >>
> >> Destroy net_ns:
> >> cleanup_net()
> >> mutex_lock(&net_mutex)
> >> list_del_rcu(&net->list)
> >> synchronize_rcu() <--- Sleep there for ages
> >> list_for_each_entry_reverse(ops, &pernet_list, list)
> >> ops_exit_list(ops, &net_exit_list)
> >> list_for_each_entry_reverse(ops, &pernet_list, list)
> >> ops_free_list(ops, &net_exit_list)
> >> mutex_unlock(&net_mutex)
> >>
> >> This primitive is not fast, especially on the systems with many processors
> >> and/or when preemptible RCU is enabled in config. So, all the time, while
> >> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
> >> is not possible, the tasks, who makes it, are sleeping on the same mutex:
> >>
> >> Create net_ns:
> >> copy_net_ns()
> >> mutex_lock_killable(&net_mutex) <--- Sleep there for ages
> >>
> >> The solution is to convert net_mutex to the rw_semaphore. Then,
> >> pernet_operations::init/::exit methods, modifying the net-related data,
> >> will require down_read() locking only, while down_write() will be used
> >> for changing pernet_list.
> >>
> >> This gives signify performance increase, like you may see below. There
> >> is measured sequential net namespace creation in a cycle, in single
> >> thread, without other tasks (single user mode):
> >>
> >> 1)int main(int argc, char *argv[])
> >> {
> >> unsigned nr;
> >> if (argc < 2) {
> >> fprintf(stderr, "Provide nr iterations arg\n");
> >> return 1;
> >> }
> >> nr = atoi(argv[1]);
> >> while (nr-- > 0) {
> >> if (unshare(CLONE_NEWNET)) {
> >> perror("Can't unshare");
> >> return 1;
> >> }
> >> }
> >> return 0;
> >> }
> >>
> >> Origin, 100000 unshare():
> >> 0.03user 23.14system 1:39.85elapsed 23%CPU
> >>
> >> Patched, 100000 unshare():
> >> 0.03user 67.49system 1:08.34elapsed 98%CPU
> >>
> >> 2)for i in {1..10000}; do unshare -n bash -c exit; done
> >
> > Hi Kirill,
> >
> > This mutex has another role. You know that net namespaces are destroyed
> > asynchronously, and the net mutex gurantees that a backlog will be not
> > big. If we have something in backlog, we know that it will be handled
> > before creating a new net ns.
> >
> > As far as I remember net namespaces are created much faster than
> > they are destroyed, so with this changes we can create a really big
> > backlog, can't we?
>
> I don't think limitation is a good goal or a gool for the mutex,
> because it's very easy to create many net namespaces in case of
> the mutex exists. You may open /proc/[pid]/ns/net like a file,
> and net_ns counter will increment. Then, do unshare(), and
> the mutex has no a way to protect against that.

You are right, but with the mutex a user can not support a big backlog
for a long time, it is shrunk to zero periodically. With these changes
he can support a big backlog for a long time.

A big backlog affects other users. If someone creates namespaces, he
probably expects that they will be destroyed for a reasonable time.

But currently someone else can increase a destroy time to a really big
values. This problem was before your patches, but they may do this
problem worse. The question here is: Should we think about this problem
in the context of these patches?


> Anyway, mutex
> can't limit a number of something in general, I've never seen
> a (good) example in kernel.

I'm agree with you here.


>
> As I see, the real limitation happen in inc_net_namespaces(),
> which is decremented after RCU grace period in cleanup_net(),
> and it has not changed.

ucount limits are to big to handle this problem.


>
> > There was a discussion a few month ago:
> > https://lists.onap.org/pipermail/containers/2016-October/037509.html
> >
> >
> >>
> >> Origin:
> >> real 1m24,190s
> >> user 0m6,225s
> >> sys 0m15,132s
> >
> > Here you measure time of creating and destroying net namespaces.
> >
> >>
> >> Patched:
> >> real 0m18,235s (4.6 times faster)
> >> user 0m4,544s
> >> sys 0m13,796s
> >
> > But here you measure time of crearing namespaces and you know nothing
> > when they will be destroyed.
>
> You're right, and I predict, the sum time, spent on cpu, will remain the same,
> but the think is that now creation and destroying may be executed in parallel.