Re: rq lock contention due to commit af7f588d8f73

From: Mathieu Desnoyers
Date: Mon Mar 27 2023 - 15:57:35 EST

Next message: Borislav Petkov: "Re: [PATCH 1/1] x86/acpi: acpi_is_processor_usable() dropping possible cpus"
Previous message: Eric Biggers: "Re: [syzbot] Monthly io-uring report"
In reply to: Mathieu Desnoyers: "Re: rq lock contention due to commit af7f588d8f73"
Next in thread: Aaron Lu: "Re: rq lock contention due to commit af7f588d8f73"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2023-03-27 10:04, Aaron Lu wrote:

On Mon, Mar 27, 2023 at 09:20:44AM -0400, Mathieu Desnoyers wrote:

On 2023-03-27 04:05, Aaron Lu wrote:

Hi Mathieu,

I was doing some optimization work[1] for kernel scheduler using a
database workload: sysbench+postgres and before I submit my work, I
rebased my patch on top of latest v6.3-rc kernels to see if everything
still works expected and then I found rq's lock became very heavily
contended as compared to v6.2 based kernels.

Using the above mentioned workload, before commit af7f588d8f73("sched:
Introduce per-memory-map concurrency ID"), the profile looked like:

7.30% 0.71% [kernel.vmlinux] [k] __schedule
0.03% 0.03% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

After that commit:

49.01% 0.87% [kernel.vmlinux] [k] __schedule
43.20% 43.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

The above profile was captured with sysbench's nr_threads set to 56; if
I used more thread number, the contention would be more severe on that
2sockets/112core/224cpu Intel Sapphire Rapids server.

The docker image I used to do optimization work is not available outside
but I managed to reproduce this problem using only publicaly available
stuffs, here it goes:
1 docker pull postgres
2 sudo docker run --rm --name postgres-instance -e POSTGRES_PASSWORD=mypass -e POSTGRES_USER=sbtest -d postgres -c shared_buffers=80MB -c max_connections=250
3 go inside the container
sudo docker exec -it $the_just_started_container_id bash
4 install sysbench inside container
sudo apt update and sudo apt install sysbench
5 prepare
root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua prepare
6 run
root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua run

Let it warm up a little bit and after 10-20s you can do profile and see
the increased rq lock contention. You may need a machine that has at
least 56 cpus to see this, I didn't try on other machines.

Feel free to let me know if you need any other info.

While I setup my dev machine with this reproducer, here are a few
questions to help figure out the context:

I understand that pgsql is a multi-process database. Is it strictly
single-threaded per-process, or does each process have more than
one thread ?

I do not know the details of Postgres, according to this:
https://wiki.postgresql.org/wiki/FAQ#How_does_PostgreSQL_use_CPU_resources.3F
I think it is single-threaded per-process.

The client, sysbench, is single process multi-threaded IIUC.

I understand that your workload is scheduling between threads which
belong to different processes. Are there more heavily active threads
than there are scheduler runqueues (CPUs) on your machine ?

In the reproducer I described above, 56 threads are started on the
client side and if each client thread is served by a server process,
there would be about 112 tasks. I don't think the client thread and
the server process are active at the same time but even if they are,
112 is still smaller than the machine's CPU number: 224.

When I developed the mm_cid feature, I originally implemented two additional
optimizations:

Additional optimizations can be done if the spin locks added when
context switching between threads belonging to different memory maps end
up being a performance bottleneck. Those are left out of this patch
though. A performance impact would have to be clearly demonstrated to
justify the added complexity.

I suspect that your workload demonstrates the need for at least one of those
optimizations. I just wonder if we are in a purely single-threaded scenario
for each process, or if each process has many threads.

My understanding is: the server side is single threaded and the client
side is multi threaded.

OK.

I've just resuscitated my per-runqueue concurrency ID cache patch from an older
patchset, and posted it as RFC. So far it passed one round of rseq selftests. Can
you test it in your environment to see if I'm on the right track ?

https://lore.kernel.org/lkml/20230327195318.137094-1-mathieu.desnoyers@xxxxxxxxxxxx/

Thanks!

Mathieu

Thanks,
Aaron

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Next message: Borislav Petkov: "Re: [PATCH 1/1] x86/acpi: acpi_is_processor_usable() dropping possible cpus"
Previous message: Eric Biggers: "Re: [syzbot] Monthly io-uring report"
In reply to: Mathieu Desnoyers: "Re: rq lock contention due to commit af7f588d8f73"
Next in thread: Aaron Lu: "Re: rq lock contention due to commit af7f588d8f73"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]