Re: [GIT PULL] Detaching mounts on unlink for 3.15

From: Al Viro
Date: Thu Apr 17 2014 - 20:37:45 EST


On Thu, Apr 17, 2014 at 11:12:03PM +0100, Al Viro wrote:

> That's all. And yes, I believe that such series would make sense on its
> own and once it survives beating (see above about docker - that bastard has
> surprised me quite a bit re stressing namespace-related codepaths), I would
> be quite willing to help with getting it merged.

FWIW, the tricky part around auto-close of acct is that we really want to
preserve the following property:

In usual setups, umount(2) will not return until fs has been
shut down.

fput() being async is not a problem - it will be processed before we
return to userland. I agree that we don't need the loop anymore (it's
basically a stack depth reduction measure that was needed with sync
fput() - without "add one more and deal with it when we return" we
would be getting mntput_no_expire -> fput -> mntput -> fs shutdown
back then). But offloading that fput() to workqueue makes it really
possible to have actual fs shutdown happen after umount(2) returns,
without any extra mounts of the same fs, etc. And since that shutdown
*can* take a long time (lots of dirty pages around, slow device or
slow network, etc.), we really might be talking about e.g. umount(8)
being finished before fs shutdown finishes. It's an expected situation
when we have the same thing still mounted elsewhere or lazy-umounted
and busy, but this changes behaviour on setups where we had been
guaranteed that umount -a *will* wait until all filesystems except
root are shut down and root is remounted r/o. So this change really
can cause data loss on reboot(8)/halt(8) on existing boxen...

IOW, workqueue is not the right tool here. OTOH, it looks like we do have
a problem with kernel/acct.c vs. umount; it just requires a race between
auto-closing and acct_process_in_ns(). It's narrow, so it doesn't bite
us all the time, but it's there... Damn, it had been a long time since
I really looked at that code ;-/

Actually, there's another reason why workqueue is bogus - we call
do_acct_process(), same as we do on acct(NULL) (which might or might
not be a good idea), but at least with do that from the context of
real process doing umount(2). Doing that from workqueue is going to
produce a really bogus record...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/