Re: [GIT PULL] Detaching mounts on unlink for 3.15

From: Al Viro
Date: Fri Apr 18 2014 - 21:35:45 EST


On Fri, Apr 18, 2014 at 01:37:26AM +0100, Al Viro wrote:
> FWIW, the tricky part around auto-close of acct is that we really want to
> preserve the following property:
>
> In usual setups, umount(2) will not return until fs has been
> shut down.
>
> fput() being async is not a problem - it will be processed before we
> return to userland. I agree that we don't need the loop anymore (it's
> basically a stack depth reduction measure that was needed with sync
> fput() - without "add one more and deal with it when we return" we
> would be getting mntput_no_expire -> fput -> mntput -> fs shutdown
> back then). But offloading that fput() to workqueue makes it really
> possible to have actual fs shutdown happen after umount(2) returns,
> without any extra mounts of the same fs, etc. And since that shutdown
> *can* take a long time (lots of dirty pages around, slow device or
> slow network, etc.), we really might be talking about e.g. umount(8)
> being finished before fs shutdown finishes. It's an expected situation
> when we have the same thing still mounted elsewhere or lazy-umounted
> and busy, but this changes behaviour on setups where we had been
> guaranteed that umount -a *will* wait until all filesystems except
> root are shut down and root is remounted r/o. So this change really
> can cause data loss on reboot(8)/halt(8) on existing boxen...

My apologies for confusion - I have not looked at your last commit.
I *really* don't like that solution, but it probably does close that
particular problem. Consider that objection withdrawn (modulo "you
have created a bisect hazard here, that part of series needs to be
reordered").

I still don't think that this is the right approach. Hell knows ;-/
Maybe you are right about offloading auto-close to a wq... The whole
mnt_pin/mnt_unpin is bloody ugly, especially with multiple references
held by kernel/acct.c-opened files ;-/ And yes, this horror is originally
mine; the only excuse I have is that previous variant had been even more
kludgy.

_If_ we push it to wq, we really want to stop calling do_process_acct()
on auto-close. With umount(8) it's borderline defensible, but with
workqueue it makes no sense whatsoever. Moreover, it makes no sense
on acct_exit_ns() - it's done when PID 1 of that pidns dies, after having
called acct_process(). What's the point of having accounting in
non-root pidns spew two entries for its init?

Assuming we can stop generating records on auto-close, do we need
the wq thing at all? After all, we could rip the affected acct->file
out, then flush and fput them all right there. The usual logics for fput()
will make sure that they (and all of them will be final) will be
processed before we leave umount(2). And let the last one trigger the
final mntput, now that mnt_pinned is zero...

PS: AFAICT, *BSD do not spew that even on explicit acct(NULL). 4.3 variants
found on PUPS do not, ditto for 4.4BSD-Alpha2 there; in Net2 kern_acct.c
is one of the castration victims, so it doesn't do anything at all and in
current {Free,Net,Open}BSD kernels it also doesn't happen.

Why do we bother with that at all? It had been that way all way back to
2.1.68pre1 when kernel/acct.c had been merged, and it's probably too
late by now to try and ask the author about the reasons...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/