Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

From: John Hubbard
Date: Mon Aug 19 2019 - 20:07:53 EST


On 8/19/19 2:24 AM, Dave Chinner wrote:
On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
On Sat 17-08-19 12:26:03, Dave Chinner wrote:
On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
On Wed 14-08-19 11:08:49, Ira Weiny wrote:
On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
...
The last close is an interesting case because the __fput() call
actually runs from task_work() context, not where the last reference
is actually dropped. So it already has certain specific interactions
with signals and task exit processing via task_add_work() and
task_work_run().

task_add_work() calls set_notify_resume(task), so if nothing else
triggers when returning to userspace we run this path:

exit_to_usermode_loop()
tracehook_notify_resume()
task_work_run()
__fput()
locks_remove_file()
locks_remove_lease()
....

It's worth noting that locks_remove_lease() does a
percpu_down_read() which means we can already block in this context
removing leases....

If there is a signal pending, the task work is run this way (before
the above notify path):

exit_to_usermode_loop()
do_signal()
get_signal()
task_work_run()
__fput()

We can detect this case via signal_pending() and even SIGKILL via
fatal_signal_pending(), and so we can decide not to block based on
the fact the process is about to be reaped and so the lease largely
doesn't matter anymore. I'd argue that it is close and we can't
easily back out, so we'd only break the block on a fatal signal....

And then, of course, is the call path through do_exit(), which has
the PF_EXITING task flag set:

do_exit()
exit_task_work()
task_work_run()
__fput()

and so it's easy to avoid blocking in this case, too.

Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
memory with FOLL_LONGTERM, and wondering how to make that work here.

These are close to files, in how they're handled, but just different
enough that it's not clear to me how to make work with this system.



So that leaves just the normal close() syscall exit case, where the
application has full control of the order in which resources are
released. We've already established that we can block in this
context. Blocking in an interruptible state will allow fatal signal
delivery to wake us, and then we fall into the
fatal_signal_pending() case if we get a SIGKILL while blocking.

Hence I think blocking in this case would be OK - it indicates an
application bug (releasing a lease before releasing the resources)
but leaves SIGKILL available to administrators to resolve situations
involving buggy applications.

This requires applications to follow the rules: any process
that pins physical resources must have an active reference to a
layout lease, either via a duplicated fd or it's own private lease.
If the app doesn't play by the rules, it hangs in close() until it
is killed.

+1 for these rules, assuming that we can make them work. They are
easy to explain and intuitive.


thanks,
--
John Hubbard
NVIDIA