Re: 9pfs hangs since 4.7

From: Al Viro
Date: Tue Jan 03 2017 - 20:48:08 EST


On Wed, Jan 04, 2017 at 01:34:50AM +0200, Tuomas Tynkkynen wrote:

> I got these logs from the server & client with 9p tracepoints enabled:
>
> https://gist.githubusercontent.com/dezgeg/02447100b3182167403099fe7de4d941/raw/3772e408ddf586fb662ac9148fc10943529a6b99/dmesg%2520with%25209p%2520trace
> https://gist.githubusercontent.com/dezgeg/e1e0c7f354042e1d9bdf7e9135934a65/raw/3a0e3b4f7a5229fd0be032c6839b578d47a21ce4/qemu.log
>

Lovely...

27349:out 8 0001 TSTATFS, tag 1
27350:out 12 0001 TLOPEN, tag 1
27351:complete 9 0001 RSTATFS, tag 1
27352:complete 13 0001 RLOPEN, tag 1

27677:out 8 0001 TSTATFS, tag 1
27678:out 12 0001 TLOPEN, tag 1
27679:complete 9 0001 RSTATFS, tag 1
27680:complete 13 0001 RLOPEN, tag 1

29667:out 8 0001 TSTATFS, tag 1
29668:out 110 0001 TWALK, tag 1
29669:complete 9 0001 RSTATFS, tag 1
29670:complete 7 0001 RLERROR, tag 1

42317:out 110 0001 TWALK, tag 1
42318:out 8 0001 TSTATFS, tag 1
42319:complete 9 0001 RSTATFS, tag 1
42320:complete 7 0001 RERROR, tag 1

Those are outright protocol violations: tag can be reused only after either
a reply bearing the same tag has arrived *or* TFLUSH for that tag had been
issued and successful (type == RFLUSH) reply bearing the tag of that TFLUSH
has arrived. Your log doesn't contain any TFLUSH (108) at all, so it should've
been plain and simple "no reuse until server sends a reply with matching tag".

Otherwise the the dump looks sane. In the end all issued requests had been
served, so it's not as if the client had been waiting for a free tag or for
a response to be produced by the server.

Interesting... dmesg doesn't seem to contain the beginning of the 9P trace -
only 89858 out of 108818 in the dump, even though it claims to have lost
only 4445 events. The last exchanges in the trace are
> P9_TLOPEN tag 90 err 0
> P9_TLOPEN tag 25 err 0
> P9_TLOPEN tag 15 err 0
> P9_TLOPEN tag 104 err 0
> P9_TLOPEN tag 102 err 0
> P9_TLOPEN tag 91 err 0
> P9_TLOPEN tag 12 err 0
< P9_TREADLINK tag 12

which should've been
complete 13 005a
complete 13 0019
complete 13 000f
complete 13 0068
complete 13 0066
complete 13 005b
out 22 000c

and the only plausible match for that is

108784:complete 13 005a
108798:complete 13 000f
108799:complete 13 0019
108800:complete 13 0068
108801:complete 13 0066
108802:complete 13 005b

but there's a bunch of replies in between that hadn't been logged *and* such
a TREADLINK request isn't seen after
108341:out 22 000c

... and sure enough, there's a whole lot of processes stuck in readlink.

Interesting... Which version of qemu/what arguments are you using there?