fuse uring / wake_up on the same core

From: Bernd Schubert
Date: Fri Mar 24 2023 - 15:50:21 EST


Ingo, Peter,

I would like to ask how to wake up from a waitq on the same core. I have
tried __wake_up_sync()/WF_SYNC, but I do not see any effect.

I'm currently working on fuse/uring communication patches, besides uring
communication there is also a queue per core. Basic bonnie++ benchmarks
with a zero file size to just create/read(0)/delete show a ~3x IOPs
difference between CPU bound bonnie++ and unbound - i.e. with these
patches it _not_ fuse-daemon that needs to be bound, but the application
doing IO to the file system. We basically have

bonnie -> vfs (app/vfs)
fuse_req (app/fuse.ko)
qid = task_cpu(current) (app/fuse.ko)
ring(qid) / SQE completion (fuse.ko) (app/fuse.ko/uring)
wait_event(req->waitq, ...) (app/fuse.ko)
[app wait]
daemon ring / handle CQE (daemon)
send-back result as SQE (daemon/uring)
fuse_request_end (daemon/uring/fuse.ko)
wake_up() ---> random core (daemon/uring/fuse.ko)
[app wakeup/fuse/vfs/syscall return]
bonnie ==> different core


1) bound

[root@imesrv1 ~]# numactl --localalloc --physcpubind=0 bonnie++ -q -x 1
-s0 -d /scratch/dest/ -n 20:1:1:20 -r 0 -u 0 | bon_csv2txt
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
imesrv1 20:1:1:20 6229 28 11289 41 12785 24 6615 28 7769 40
10020 25
Latency 411us 824us 816us 298us 10473us
200ms


2) not bound

[root@imesrv1 ~]# bonnie++ -q -x 1 -s0 -d /scratch/dest/ -n 20:1:1:20
-r 0 -u 0 | bon_csv2txt
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
imesrv1 20:1:1:20 2064 33 2923 43 4556 28 2061 33 2186 42
4245 30
Latency 850us 3914us 2496us 738us 758us
6469us


With less files the difference becomes a bit smaller, but is still very
visible. Besides cache line bouncing, I'm sure that CPU frequency and
C-states will matter - I could tune that it in the lab, but in the end I
want to test what users do (I had recently checked with large HPC center
- Forschungszentrum Juelich - their HPC compute nodes are not tuned up,
to save energy).
Also, in order to really tune down latencies, I want want to add a
struct file_operations::uring_cmd_iopoll thread, which will spin for a
short time and avoid most of kernel/userspace communication. If
applications (with n-nthreads < n-cores) then get scheduled on different
core differnent rings will be used, result in
n-threads-spinning > n-threads-application


There was already a related thread about fuse before

https://lore.kernel.org/lkml/1638780405-38026-1-git-send-email-quic_pragalla@xxxxxxxxxxx/

With the fuse-uring patches that part is basically solved - the waitq
that that thread is about is not used anymore. But as per above,
remaining is the waitq of the incoming workq (not mentioned in the
thread above). As I wrote, I have tried
__wake_up_sync((x), TASK_NORMAL), but it does not make a difference for
me - similar to Miklos' testing before. I have also tried struct
completion / swait - does not make a difference either.
I can see task_struct has wake_cpu, but there doesn't seem to be a good
interface to set it.

Any ideas?


Thanks,
Bernd