Re: 9pfs hangs since 4.7

From: Tuomas Tynkkynen
Date: Wed Jan 04 2017 - 15:07:06 EST


On Wed, 4 Jan 2017 01:47:53 +0000
Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:

> On Wed, Jan 04, 2017 at 01:34:50AM +0200, Tuomas Tynkkynen wrote:
>
> > I got these logs from the server & client with 9p tracepoints enabled:
> >
> > https://gist.githubusercontent.com/dezgeg/02447100b3182167403099fe7de4d941/raw/3772e408ddf586fb662ac9148fc10943529a6b99/dmesg%2520with%25209p%2520trace
> > https://gist.githubusercontent.com/dezgeg/e1e0c7f354042e1d9bdf7e9135934a65/raw/3a0e3b4f7a5229fd0be032c6839b578d47a21ce4/qemu.log
> >
>
> Lovely...
>
> 27349:out 8 0001 TSTATFS, tag 1
> 27350:out 12 0001 TLOPEN, tag 1
> 27351:complete 9 0001 RSTATFS, tag 1
> 27352:complete 13 0001 RLOPEN, tag 1
>
> 27677:out 8 0001 TSTATFS, tag 1
> 27678:out 12 0001 TLOPEN, tag 1
> 27679:complete 9 0001 RSTATFS, tag 1
> 27680:complete 13 0001 RLOPEN, tag 1
>
> 29667:out 8 0001 TSTATFS, tag 1
> 29668:out 110 0001 TWALK, tag 1
> 29669:complete 9 0001 RSTATFS, tag 1
> 29670:complete 7 0001 RLERROR, tag 1
>
> 42317:out 110 0001 TWALK, tag 1
> 42318:out 8 0001 TSTATFS, tag 1
> 42319:complete 9 0001 RSTATFS, tag 1
> 42320:complete 7 0001 RERROR, tag 1
>
> Those are outright protocol violations: tag can be reused only after either
> a reply bearing the same tag has arrived *or* TFLUSH for that tag had been
> issued and successful (type == RFLUSH) reply bearing the tag of that TFLUSH
> has arrived. Your log doesn't contain any TFLUSH (108) at all, so it should've
> been plain and simple "no reuse until server sends a reply with matching tag".
>

Argh, I had completely forgotten there is another 9p mount involved, so the log
is mixed between the two mounts. Apologies, there is now a new trace attached
with the V9fsState pointer dumped in the mix as well.

> Otherwise the the dump looks sane. In the end all issued requests had been
> served, so it's not as if the client had been waiting for a free tag or for
> a response to be produced by the server.
>
> Interesting... dmesg doesn't seem to contain the beginning of the 9P trace -
> only 89858 out of 108818 in the dump, even though it claims to have lost
> only 4445 events. [...]

Here's logs that should be complete this time:

https://gist.githubusercontent.com/dezgeg/08629d4c8ca79da794bc087e5951e518/raw/a1a82b9bc24e5282c82beb43a9dc91974ffcf75a/9p.qemu.log
https://gist.githubusercontent.com/dezgeg/1d5f1cc0647e336c59989fc62780eb2e/raw/d7623755a895b0441c608ddba366d20bbf47ab0b/9p.dmesg.log

>
> Interesting... Which version of qemu/what arguments are you using there?

This is QEMU 2.7.0, with these on the server side:
-virtfs local,path=/nix/store,security_model=none,mount_tag=store
-virtfs local,path=$TMPDIR/xchg,security_model=none,mount_tag=xchg

and on the client side:

store on /nix/store type 9p (rw,relatime,dirsync,trans=virtio,version=9p2000.L,cache=loose)
xchg on /tmp/xchg type 9p (rw,relatime,dirsync,trans=virtio,version=9p2000.L,cache=loose)

'store' is the one receiving the hammering, 'xchg' I think is mostly sitting idle.