[PATCH/RFC 00/19] Support loop-back NFS mounts

From: NeilBrown
Date: Wed Apr 16 2014 - 00:18:03 EST


Loop-back NFS mounts are when the NFS client and server run on the
same host.

The use-case for this is a high availability cluster with shared
storage. The shared filesystem is mounted on any one machine and
NFS-mounted on the others.
If the nfs server fails, some other node will take over that service,
and then it will have a loop-back NFS mount which needs to keep
working.

This patch set addresses the "keep working" bit and specifically
addresses deadlocks and livelocks.
Allowing the fail-over itself to be deadlock free is a separate
challenge for another day.

The short description of how this works is:

deadlocks:
- Elevate PF_FSTRANS to apply globally instead of just in NFS and XFS.
PF_FSTRANS disables __GFP_NS in the same way that PF_MEMALLOC_NOIO
disables __GFP_IO.
- Set PF_FSTRANS in nfsd when handling requests related to
memory reclaim, or requests which could block requests related
to memory reclaim.
- Use lockdep to find all consequent deadlocks from some other
thread allocating memory while holding a lock that nfsd might
want.
- Fix those other deadlocks by setting PF_FSTRANS or using GFP_NOFS
as appropriate.

livelocks:
- identify throttling during reclaim and bypass it when
PF_LESS_THROTTLE is set
- only set PF_LESS_THROTTLE for nfsd when handling write requests
from the local host.

The last 12 patches address various deadlocks due to locking chains.
11 were found by lockdep, 2 by testing. There is a reasonable chance
that there are more, I just need to exercise more code while
testing....

There is one issue that lockdep reports which I haven't fixed (I've
just hacked the code out for my testing). That issue relates to
freeze_super().
I may not be interpreting the lockdep reports perfectly, but I think
they are basically saying that if I were to freeze a filesystem that
was exported to the local host, then we could end up deadlocking.
This is to be expected. The NFS filesystem would need to be frozen
first. I don't know how to tell lockdep that I know that is a problem
and I don't want to be warned about it. Suggestions welcome.
Until this is addressed I cannot really ask others to test the code
with lockdep enabled.

There are more subsidiary places that I needed to add PF_FSTRANS than
I would have liked. The thought keeps crossing my mind that maybe we
can get rid of __GFP_FS and require that memory reclaim never ever
block on a filesystem. Then most of these patches go away.

Now that writeback doesn't happen from reclaim (but from kswapd) much
of the calls from reclaim to FS are gone.
The ->releasepage call is the only one that I *know* causes me
problems so I'd like to just say that that must never block. I don't
really understand the consequences of that though.
There are a couple of other places where __GFP_FS is used and I'd need
to carefully analyze those. But if someone just said "no, that is
impossible", I could be happy and stick with the current approach....

I've cc:ed Peter Zijlstra and Ingo Molnar only on the lockdep-related
patches, Ming Lei only on the PF_MEMALLOC_NOIO related patches,
and net-dev only on the network-related patches.
There are probably other people I should CC. Apologies if I missed you.
I'll ensure better coverage if the nfs/mm/xfs people are reasonably happy.

Comments, criticisms, etc most welcome.

Thanks,
NeilBrown


---

NeilBrown (19):
Promote current_{set,restore}_flags_nested from xfs to global.
lockdep: lockdep_set_current_reclaim_state should save old value
lockdep: improve scenario messages for RECLAIM_FS errors.
Make effect of PF_FSTRANS to disable __GFP_FS universal.
SUNRPC: track whether a request is coming from a loop-back interface.
nfsd: set PF_FSTRANS for nfsd threads.
nfsd and VM: use PF_LESS_THROTTLE to avoid throttle in shrink_inactive_list.
Set PF_FSTRANS while write_cache_pages calls ->writepage
XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.
NET: set PF_FSTRANS while holding sk_lock
FS: set PF_FSTRANS while holding mmap_sem in exec.c
NET: set PF_FSTRANS while holding rtnl_lock
MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.
driver core: set PF_FSTRANS while holding gdp_mutex
nfsd: set PF_FSTRANS when client_mutex is held.
VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.
VFS: set PF_FSTRANS while namespace_sem is held.
nfsd: set PF_FSTRANS during nfsd4_do_callback_rpc.
XFS: set PF_FSTRANS while ilock is held in xfs_free_eofblocks


drivers/base/core.c | 3 ++
drivers/base/power/runtime.c | 6 ++---
drivers/block/nbd.c | 6 ++---
drivers/md/dm-bufio.c | 6 ++---
drivers/md/dm-ioctl.c | 6 ++---
drivers/mtd/nand/nandsim.c | 28 ++++++---------------
drivers/scsi/iscsi_tcp.c | 6 ++---
drivers/usb/core/hub.c | 6 ++---
fs/dcache.c | 4 ++-
fs/exec.c | 6 +++++
fs/fs-writeback.c | 5 ++--
fs/namespace.c | 4 +++
fs/nfs/file.c | 3 +-
fs/nfsd/nfs4callback.c | 5 ++++
fs/nfsd/nfs4state.c | 3 ++
fs/nfsd/nfssvc.c | 24 ++++++++++++++----
fs/nfsd/vfs.c | 6 +++++
fs/xfs/kmem.h | 2 --
fs/xfs/xfs_aops.c | 7 -----
fs/xfs/xfs_bmap_util.c | 4 +++
fs/xfs/xfs_file.c | 12 +++++++++
fs/xfs/xfs_linux.h | 7 -----
include/linux/lockdep.h | 8 +++---
include/linux/sched.h | 32 +++++++++---------------
include/linux/sunrpc/svc.h | 2 ++
include/linux/sunrpc/svc_xprt.h | 1 +
include/net/sock.h | 1 +
kernel/locking/lockdep.c | 51 ++++++++++++++++++++++++++++-----------
kernel/softirq.c | 6 ++---
mm/migrate.c | 9 +++----
mm/page-writeback.c | 3 ++
mm/page_alloc.c | 18 ++++++++------
mm/percpu.c | 4 +++
mm/slab.c | 2 ++
mm/slob.c | 2 ++
mm/slub.c | 1 +
mm/vmscan.c | 31 +++++++++++++++---------
net/core/dev.c | 6 ++---
net/core/rtnetlink.c | 9 ++++++-
net/core/sock.c | 8 ++++--
net/sunrpc/sched.c | 5 ++--
net/sunrpc/svc.c | 6 +++++
net/sunrpc/svcsock.c | 10 ++++++++
net/sunrpc/xprtrdma/transport.c | 5 ++--
net/sunrpc/xprtsock.c | 17 ++++++++-----
45 files changed, 247 insertions(+), 149 deletions(-)

--
Signature

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/