Re: 9pfs hangs since 4.7

From: Al Viro
Date: Mon Jan 02 2017 - 11:23:17 EST


On Mon, Jan 02, 2017 at 10:20:35AM +0200, Tuomas Tynkkynen wrote:
> Hi fsdevel,
>
> I tracked this problem down a bit and it seems that it started happening
> after the VFS merge in 4.7-rc1: 7f427d3a6029331304f91ef4d7cf646f054216d2:
>
> Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
>
> Pull parallel filesystem directory handling update from Al Viro.
>
> Al, do you have any ideas about this problem with 9pfs I've been observing
> (quoted below in full)? Many thanks in advance!

d_alloc_parallel() is basically waiting for another lookup on the same
name in the same directory to finish. In earlier kernels it would've
been waiting on ->i_mutex of parent, now - on the waitqueue pointed to
be ->d_wait of in-lookup dentry.

p9_client_rpc ones are more interesting; the question is what are they
waiting for. Might be waiting for reply from server, might be - for
a tag to get freed. We had at least one bug with tag lifetimes, BTW
(see commit a84b69cb6e0a for details) and that kind of crap can end up
with responses from server getting lost on client.

What I'd like to see is a log of 9p traffic in those; to hell with the
payloads, just the type and tag of from each message (octets at offsets
4, 5 and 6) plus otag (octets at offsets 7 and 8) when the type is equal to
108 (message[4] == TFLUSH). That's 3 bytes for most of the messages and
once in a while 2 more. Should be easier to do dumping on the qemu side
of things; in principle, there are tracing mechanisms in there (grep for
trace_v9fs_ in hw/9pfs/virtio-9p.c), but when I had to do that kind of
work, I simply did manual dumping (with write(2), nothing fancier
than that) in handle_9p_output() (out.id and out.tag_le for type and tag
of incoming messages), v9fs_flush() (tag for otag of TFLUSH messages) and
complete_pdu() (id and pdu->tag for type and tag of replies).