Re: [PATCH] proc: Avoid a thundering herd of threads freeing proc dentries

From: Junxiao Bi
Date: Thu Jun 25 2020 - 18:11:52 EST


On 6/22/20 5:47 PM, Matthew Wilcox wrote:

On Sun, Jun 21, 2020 at 10:15:39PM -0700, Junxiao Bi wrote:
On 6/20/20 9:27 AM, Matthew Wilcox wrote:
On Fri, Jun 19, 2020 at 05:42:45PM -0500, Eric W. Biederman wrote:
Junxiao Bi <junxiao.bi@xxxxxxxxxx> writes:
Still high lock contention. Collect the following hot path.
A different location this time.

I know of at least exit_signal and exit_notify that take thread wide
locks, and it looks like exit_mm is another. Those don't use the same
locks as flushing proc.


So I think you are simply seeing a result of the thundering herd of
threads shutting down at once. Given that thread shutdown is fundamentally
a slow path there is only so much that can be done.

If you are up for a project to working through this thundering herd I
expect I can help some. It will be a long process of cleaning up
the entire thread exit process with an eye to performance.
Wengang had some tests which produced wall-clock values for this problem,
which I agree is more informative.

I'm not entirely sure what the customer workload is that requires a
highly threaded workload to also shut down quickly. To my mind, an
overall workload is normally composed of highly-threaded tasks that run
for a long time and only shut down rarely (thus performance of shutdown
is not important) and single-threaded tasks that run for a short time.
The real workload is a Java application working in server-agent mode, issue
happened in agent side, all it do is waiting works dispatching from server
and execute. To execute one work, agent will start lots of short live
threads, there could be a lot of threads exit same time if there were a lots
of work to execute, the contention on the exit path caused a high %sys time
which impacted other workload.
How about this for a micro? Executes in about ten seconds on my laptop.
You might need to tweak it a bit to get better timing on a server.

// gcc -pthread -O2 -g -W -Wall
#include <pthread.h>
#include <unistd.h>

void *worker(void *arg)
{
int i = 0;
int *p = arg;

for (;;) {
while (i < 1000 * 1000) {
i += *p;
}
sleep(1);
}
}

int main(int argc, char **argv)
{
pthread_t threads[20][100];

Tuning 100 to 1000 here and the following 2 loops.

Test it on 2-socket server with 104 cpu. Perf is similar on v5.7 and v5.7 with Eric's fix. The spin lock was shifted to spin lock in futex, so the fix didn't help.


ÂÂÂ 46.41%ÂÂÂÂ 0.11%Â perf_testÂÂÂÂÂÂÂ [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
ÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂ --46.30%--entry_SYSCALL_64_after_hwframe
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ --46.12%--do_syscall_64
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |--30.47%--__x64_sys_futex
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂ --30.45%--do_futex
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |--18.04%--futex_wait
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |--16.94%--futex_wait_setup
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂ --16.61%--_raw_spin_lock
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ --16.30%--native_queued_spin_lock_slowpath
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ --0.81%--call_function_interrupt
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | | --0.79%--smp_call_function_interrupt
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | | --0.62%--generic_smp_call_function_single_interrupt
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂ --1.04%--futex_wait_queue_me
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ --0.96%--schedule
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ --0.94%--__schedule
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | | --0.51%--pick_next_task_fair
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | --12.38%--futex_wake
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |--11.00%--_raw_spin_lock
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂ --10.76%--native_queued_spin_lock_slowpath
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ --0.55%--call_function_interrupt
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | | --0.53%--smp_call_function_interrupt
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ | |
|ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ --1.11%--wake_up_q
|ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |
| --1.10%--try_to_wake_up
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |


Result of v5.7

=========

[root@jubi-bm-ol8 upstream]# time ./perf_test

realÂÂÂ 0m4.850s
userÂÂÂ 0m14.499s
sysÂÂÂ 0m12.116s
[root@jubi-bm-ol8 upstream]# time ./perf_test

realÂÂÂ 0m4.949s
userÂÂÂ 0m14.285s
sysÂÂÂ 0m18.408s
[root@jubi-bm-ol8 upstream]# time ./perf_test

realÂÂÂ 0m4.885s
userÂÂÂ 0m14.193s
sysÂÂÂ 0m17.888s
[root@jubi-bm-ol8 upstream]# time ./perf_test

realÂÂÂ 0m4.872s
userÂÂÂ 0m14.451s
sysÂÂÂ 0m18.717s
[root@jubi-bm-ol8 upstream]# uname -a
Linux jubi-bm-ol8 5.7.0-1700.20200601.el8uek.base.x86_64 #1 SMP Fri Jun 19 07:41:06 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux


Result of v5.7 with Eric's fix

=================

[root@jubi-bm-ol8 upstream]# time ./perf_test

realÂÂÂ 0m4.889s
userÂÂÂ 0m14.215s
sysÂÂÂ 0m16.203s
[root@jubi-bm-ol8 upstream]# time ./perf_test

realÂÂÂ 0m4.872s
userÂÂÂ 0m14.431s
sysÂÂÂ 0m17.737s
[root@jubi-bm-ol8 upstream]# time ./perf_test

realÂÂÂ 0m4.908s
userÂÂÂ 0m14.274s
sysÂÂÂ 0m15.377s
[root@jubi-bm-ol8 upstream]# time ./perf_test

realÂÂÂ 0m4.937s
userÂÂÂ 0m14.632s
sysÂÂÂ 0m16.255s
[root@jubi-bm-ol8 upstream]# uname -a
Linux jubi-bm-ol8 5.7.0-1700.20200601.el8uek.procfix.x86_64 #1 SMP Fri Jun 19 07:42:16 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Thanks,

Junxiao.

int i, j, one = 1;

for (i = 0; i < 1000; i++) {
for (j = 0; j < 100; j++)
pthread_create(&threads[i % 20][j], NULL, worker, &one);
if (i < 5)
continue;
for (j = 0; j < 100; j++)
pthread_cancel(threads[(i - 5) %20][j]);
}

return 0;
}