Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)

From: Christian Schoenebeck
Date: Sun Apr 10 2022 - 12:19:44 EST


On Samstag, 9. April 2022 13:16:11 CEST Christian Schoenebeck wrote:
> On Mittwoch, 30. März 2022 14:21:16 CEST Christian Schoenebeck wrote:
> > I made some tests & benchmarks regarding the fs-cache issue of 9p, running
> > different kernel versions and kernel configurations in comparison.
> [...]
> > Case Linux kernel version .config msize cache duration host cpu errors/warnings
> >
> > A) 5.17.0+[2] + msize patches[1] debug 4186112 mmap 20m 40s ~80% none
> > B) 5.17.0+[2] + msize patches[1] debug 4186112 loose 31m 28s ~35% several errors (compilation completed)
> > C) 5.17.0+[2] + msize patches[1] debug 507904 mmap 20m 25s ~84% none
> > D) 5.17.0+[2] + msize patches[1] debug 507904 loose 31m 2s ~33% several errors (compilation completed)
> > E) 5.17.0+[2] debug 512000 mmap 23m 45s ~75% none
> > F) 5.17.0+[2] debug 512000 loose 32m 6s ~31% several errors (compilation completed)
> > G) 5.17.0+[2] release 512000 mmap 23m 18s ~76% none
> > H) 5.17.0+[2] release 512000 loose 32m 33s ~31% several errors (compilation completed)
> > I) 5.17.0+[2] + msize patches[1] release 4186112 mmap 20m 30s ~83% none
> > J) 5.17.0+[2] + msize patches[1] release 4186112 loose 31m 21s ~31% several errors (compilation completed)
> > K) 5.10.84 release 512000 mmap 39m 20s ~80% none
> > L) 5.10.84 release 512000 loose 13m 40s ~55% none
> [...]
> > About the errors: I actually already see errors with cache=loose and recent
> > kernel version just when booting the guest OS. For these tests I chose some
> > sources which allowed me to complete the build to capture some benchmark as
> > well, I got some "soft" errors with those, but the build completed at least.
> > I had other sources OTOH which did not complete though and aborted with
> > certain invalid file descriptor errors, which I obviously could not use for
> > those benchmarks here.
>
> I used git-bisect to identify the commit that broke 9p behaviour, and it is
> indeed this one:
>
> commit eb497943fa215897f2f60fd28aa6fe52da27ca6c (HEAD, refs/bisect/bad)
> Author: David Howells <dhowells@xxxxxxxxxx>
> Date: Tue Nov 2 08:29:55 2021 +0000
>
> 9p: Convert to using the netfs helper lib to do reads and caching
>
> Convert the 9p filesystem to use the netfs helper lib to handle readpage,
> readahead and write_begin, converting those into a common issue_op for the
> filesystem itself to handle. The netfs helper lib also handles reading
> from fscache if a cache is available, and interleaving reads from both
> sources.

I looked into the errors I get, and as far as I can see it, all misbehaviours
that I see, boil down to "Bad file descriptor" (EBADF) errors being the
originating cause.

The easiest misbehaviours on the guest system I can look into, are errors
with the git client. For instance 'git fetch origin' fails this way:

...
write(3, "d16782889ee07005d1f57eb884f4a06b"..., 40) = 40
write(3, "\n", 1) = 1
close(3) = 0
access(".git/hooks/reference-transaction", X_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, ".git/logs/refs/remotes/origin/master", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
fstat(7, {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
read(7, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\t\0\0\0\t\0\0\0\0"..., 8192) = 2326
lseek(7, -1467, SEEK_CUR) = 859
read(7, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\t\0\0\0\t\0\0\0\0"..., 8192) = 1467
close(7) = 0
write(3, "d8a68c5027ef629d93b9d9519ff4da95"..., 168) = -1 EBADF (Bad file descriptor)
...
error: cannot update the ref 'refs/remotes/origin/master': unable to append to '.git/logs/refs/remotes/origin/master': Bad file descriptor

I tried to manually replicate those file access operations on that
.git/logs/refs/remotes/origin/master file in question, and it worked. But when
I look at the strace output above, I see there is a close(3) call just before
the subsequent openat(".git/logs/refs/remotes/origin/master") call returning 3,
which makes me wonder, is this maybe a concurrency issue on file descriptor
management?

Ideas anyone?

Best regards,
Christian Schoenebeck