Re: [PATCH] Re: [RFC PATCH] namespaces: fix leak on fork() failure

From: Eric W. Biederman
Date: Sat May 05 2012 - 07:33:38 EST


Mike Galbraith <efault@xxxxxx> writes:

> On Sat, 2012-05-05 at 08:08 +0200, Mike Galbraith wrote:
>
>> egrep 'synchronize|rcu_barrier' /trace
>>
>> vsftpd-7981 [003] .... 577.164997: synchronize_sched <-switch_task_namespaces
>> vsftpd-7981 [003] .... 577.164998: _cond_resched <-synchronize_sched
>> vsftpd-7981 [003] .... 577.164998: wait_rcu_gp <-synchronize_sched
>> vsftpd-7982 [003] .... 577.166583: synchronize_sched <-switch_task_namespaces
>> vsftpd-7982 [003] .... 577.166583: _cond_resched <-synchronize_sched
>
>> vsftpd-7977 [003] .... 577.171519: rcu_barrier_sched <-rcu_barrier
>> vsftpd-7977 [003] .... 577.171519: _rcu_barrier.isra.31 <-rcu_barrier_sched
>> vsftpd-7977 [003] .... 577.171519: mutex_lock <-_rcu_barrier.isra.31
>> vsftpd-7977 [003] .... 577.171520: __init_waitqueue_head <-_rcu_barrier.isra.31
>> vsftpd-7977 [003] .... 577.171520: on_each_cpu <-_rcu_barrier.isra.31
>> vsftpd-7977 [003] d... 577.171532: rcu_barrier_func <-on_each_cpu
>> vsftpd-7977 [003] d... 577.171532: call_rcu_sched <-rcu_barrier_func
>> vsftpd-7977 [003] .... 577.171533: wait_for_completion <-_rcu_barrier.isra.31
>> ksoftirqd/3-16 [003] ..s. 577.171691: rcu_barrier_callback <-__rcu_process_callbacks
>> vsftpd-7977 [003] .... 577.176443: mutex_unlock <-_rcu_barrier.isra.31
> ...
>
> Ok, so CLONE_NEWPID | SIGCHLD + waitpid is a bad idea given extreme
> unmount synchronization, but why does it take four softirqs? Seems this
> could have gone a lot faster.

It is just taking one 4millisecond or 1 jiffy at 250hz which seems
correct operation for rcu_barrier.

To recap for anyone watching. We have:

sys_wait4
do_wait
...
release_task
proc_flush_task
pid_ns_release_proc
kern_unmount
mntput
mntput_no_expire
mntfree
deactivate_super
deactivate_locked_super
rcu_barrier

So each instance of sys_wait4 winds up taking 4ms sad. But that
does explain what it takes so long to reap the zombies we are
synchronous.

The ipc namespace is also going to suffer from this deactivate_super
delay but more likely in exit_namespaces() so the delay should not
be synchronized across a bunch of processes. Aka the wait should
be done before the parent is notified.

I had a nefarious plan to combine the proc mount reference count with
the pid namespace reference count (to break the loop). I will see if I
can reawaken that. If that plan comes to fruition the final put_pid on
the pid namespace should happen in a call_rcu after release_task so
wait4 should not be bottlenecked.

I am still mystified why adding the rest of the namespaces adds so much
of a slowdown. Those task existing should have been parallized before
do_wait..

The rcu_barrier is new as of 2.6.38-rc5 with commit d863b50ab on Feb 10 2011:

commit d863b50ab01333659314c2034890cb76d9fdc3c7
Author: Boaz Harrosh <bharrosh@xxxxxxxxxxx>
Date: Thu Feb 10 15:01:20 2011 -0800

vfs: call rcu_barrier after ->kill_sb()

In commit fa0d7e3de6d6 ("fs: icache RCU free inodes"), we use rcu free
inode instead of freeing the inode directly. It causes a crash when we
rmmod immediately after we umount the volume[1].

So we need to call rcu_barrier after we kill_sb so that the inode is
freed before we do rmmod. The idea is inspired by Aneesh Kumar.
rcu_barrier will wait for all callbacks to end before preceding. The
original patch was done by Tao Ma, but synchronize_rcu() is not enough
here.

1. http://marc.info/?l=linux-fsdevel&m=129680863330185&w=2

Tested-by: Tao Ma <boyu.mt@xxxxxxxxxx>
Signed-off-by: Boaz Harrosh <bharrosh@xxxxxxxxxxx>
Cc: Nick Piggin <npiggin@xxxxxxxxx>
Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Cc: Chris Mason <chris.mason@xxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>

diff --git a/fs/super.c b/fs/super.c
index 74e149e..7e9dd4c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -177,6 +177,11 @@ void deactivate_locked_super(struct super_block *s)
struct file_system_type *fs = s->s_type;
if (atomic_dec_and_test(&s->s_active)) {
fs->kill_sb(s);
+ /*
+ * We need to call rcu_barrier so all the delayed rcu free
+ * inodes are flushed before we release the fs module.
+ */
+ rcu_barrier();
put_filesystem(fs);
put_super(s);
} else {


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/