Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression

From: Qi Zheng
Date: Sun May 28 2023 - 22:39:39 EST

Next message: Kirill A. Shutemov: "Re: [patch v3 31/36] x86/apic: Provide cpu_primary_thread mask"
Previous message: Chris Packham: "Re: New kernel warning after updating from LTS 5.15.110 to 5.15.112 (and 5.15.113)"
In reply to: Paul E. McKenney: "Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression"
Next in thread: Paul E. McKenney: "Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Paul,

On 2023/5/27 19:14, Paul E. McKenney wrote:

On Thu, May 25, 2023 at 12:03:16PM +0800, Qi Zheng wrote:

On 2023/5/24 19:56, Qi Zheng wrote:

On 2023/5/24 19:08, Qi Zheng wrote:

[...]

Well, I just ran the following command and reproduced the result:

stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &

1) with commit 42c9db3970483:

stress-ng: info: [11023] setting to a 60 second run per stressor
stress-ng: info: [11023] dispatching hogs: 9 ramfs
stress-ng: info: [11023] stressor       bogo ops real time usr
time sys time   bogo ops/s     bogo ops/s
stress-ng: info: [11023]                           (secs)    (secs)
(secs)   (real time) (usr+sys time)
stress-ng: info: [11023] ramfs            774966     60.00
10.18 169.45     12915.89        4314.26
stress-ng: info: [11023] for a 60.00s run time:
stress-ng: info: [11023]    1920.11s available CPU time
stress-ng: info: [11023]      10.18s user time   ( 0.53%)
stress-ng: info: [11023]     169.44s system time ( 8.82%)
stress-ng: info: [11023]     179.62s total time ( 9.35%)
stress-ng: info: [11023] load average: 8.99 2.69 0.93
stress-ng: info: [11023] successful run completed in 60.00s (1 min,
0.00 secs)

2) with commit f95bdb700bc6b:

stress-ng: info: [37676] dispatching hogs: 9 ramfs
stress-ng: info: [37676] stressor       bogo ops real time usr
time sys time   bogo ops/s     bogo ops/s
stress-ng: info: [37676]                           (secs)    (secs)
(secs)   (real time) (usr+sys time)
stress-ng: info: [37676] ramfs            168673     60.00
1.61   39.66      2811.08        4087.47
stress-ng: info: [37676] for a 60.10s run time:
stress-ng: info: [37676]    1923.36s available CPU time
stress-ng: info: [37676]       1.60s user time   ( 0.08%)
stress-ng: info: [37676]      39.66s system time ( 2.06%)
stress-ng: info: [37676]      41.26s total time ( 2.15%)
stress-ng: info: [37676] load average: 7.69 3.63 2.36
stress-ng: info: [37676] successful run completed in 60.10s (1 min,
0.10 secs)

The bogo ops/s (real time) did drop significantly.

And the memory reclaimation was not triggered in the whole process. so
theoretically no one is in the read critical section of shrinker_srcu.

Then I found that some stress-ng-ramfs processes were in
TASK_UNINTERRUPTIBLE state for a long time:

root       42313 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
stress-ng-ramfs [run]
root       42314 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
stress-ng-ramfs [run]
root       42315 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
stress-ng-ramfs [run]
root       42316 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
stress-ng-ramfs [run]
root       42317 7.8 0.0 69592 1812 pts/0    D    19:00   0:02
stress-ng-ramfs [run]
root       42318 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
stress-ng-ramfs [run]
root       42319 7.8 0.0 69592 1812 pts/0    D    19:00   0:02
stress-ng-ramfs [run]
root       42320 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
stress-ng-ramfs [run]
root       42321 7.8 0.0 69592 1812 pts/0    D    19:00   0:02
stress-ng-ramfs [run]
root       42322 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
stress-ng-ramfs [run]
root       42323 7.8 0.0 69592 1812 pts/0    D    19:00   0:02
stress-ng-ramfs [run]
root       42324 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
stress-ng-ramfs [run]
root       42325 7.8 0.0 69592 1812 pts/0    D    19:00   0:02
stress-ng-ramfs [run]
root       42326 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
stress-ng-ramfs [run]
root       42327 7.9 0.0 69592 1812 pts/0    D    19:00   0:02
stress-ng-ramfs [run]
root       42328 7.9 0.0 69592 1812 pts/0    D    19:00   0:02
stress-ng-ramfs [run]
root       42329 7.9 0.0 69592 1812 pts/0    D    19:00   0:02
stress-ng-ramfs [run]
root       42330 7.9 0.0 69592 1556 pts/0    D    19:00   0:02
stress-ng-ramfs [run]

Their call stack is as follows:

cat /proc/42330/stack

[<0>] __synchronize_srcu.part.21+0x83/0xb0
[<0>] unregister_shrinker+0x85/0xb0
[<0>] deactivate_locked_super+0x27/0x70
[<0>] cleanup_mnt+0xb8/0x140
[<0>] task_work_run+0x65/0x90
[<0>] exit_to_user_mode_prepare+0x1ba/0x1c0
[<0>] syscall_exit_to_user_mode+0x1b/0x40
[<0>] do_syscall_64+0x44/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

+ RCU folks, Is this result as expected? I would have thought that
synchronize_srcu() should return quickly if no one is in the read
critical section. :(

In theory, it would indeed be nice if synchronize_srcu() would do that.
In practice, the act of checking to see if there is anyone in an SRCU
read-side critical section is a heavy-weight operation, involving at
least one cache miss per CPU along with a number of full memory barriers.

So SRCU has to be careful to not check too frequently.

Got it.

However, if SRCU has been idle for some time, normal synchronize_srcu()
will do an immediate check. And this will of course mark SRCU as
non-idle.

With the following changes, ops/s can return to previous levels:

Or just set rcu_expedited to 1:
echo 1 > /sys/kernel/rcu_expedited

This does cause SRCU to be much more aggressive. This can be a good
choice for small systems, but please keep in mind that this affects normal
RCU as well as SRCU. It will cause RCU to also be much more aggressive,
sending IPIs to CPUs that are (or might be) in RCU read-side critical
sections. Depending on your workload, this might or might not be what
you want RCU to be doing. For example, if you are running aggressive
real-time workloads, it most definitely is not what you want.

Yeah, that's not what I want, a shrinker might run for a long time.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index db2ed6e08f67..90f541b07cd1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker)
        debugfs_entry = shrinker_debugfs_remove(shrinker);
        up_write(&shrinker_rwsem);

-       synchronize_srcu(&shrinker_srcu);
+       synchronize_srcu_expedited(&shrinker_srcu);

If shrinkers are unregistered only occasionally, this is an entirely
reasonable change.

        debugfs_remove_recursive(debugfs_entry);

stress-ng: info: [13159] dispatching hogs: 9 ramfs
stress-ng: info: [13159] stressor       bogo ops real time usr time
sys time   bogo ops/s     bogo ops/s
stress-ng: info: [13159]                           (secs)    (secs)
(secs)   (real time) (usr+sys time)
stress-ng: info: [13159] ramfs            710062     60.00      9.63
157.26     11834.18        4254.75
stress-ng: info: [13159] for a 60.00s run time:
stress-ng: info: [13159]    1920.14s available CPU time
stress-ng: info: [13159]       9.62s user time   ( 0.50%)
stress-ng: info: [13159]     157.26s system time ( 8.19%)
stress-ng: info: [13159]     166.88s total time ( 8.69%)
stress-ng: info: [13159] load average: 9.49 4.02 1.65
stress-ng: info: [13159] successful run completed in 60.00s (1 min,
0.00 secs)

Can we make synchronize_srcu() call synchronize_srcu_expedited() when no
one is in the read critical section?

Yes, in theory we could, but this would be a bad thing in practice.
After all, the point of having synchronize_srcu() be separate from
synchronize_srcu_expedited() is to allow uses that are OK with longer
latency avoid consuming too much CPU. In addition, that longer
SRCU grace-period latency allows the next grace period to handle more
synchronize_srcu() and call_srcu() requests. This amortizes the
overhead of that next grace period over a larger number of updates.

However, your use of synchronize_srcu_expedited() does have that effect,
but only for this call point. Which has the advantage of avoiding
burning excessive quantities of CPU for the other 50+ call points.

Thanks for such a detailed explanation.

Now I think we can continue to try to complete the idea[1] from
Kirill Tkhai. The patch moves heavy synchronize_srcu() to delayed
work, so it doesn't affect on user-visible unregistration speed.

[1]. https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/

Thanks,
Qi

Thanx, Paul

Next message: Kirill A. Shutemov: "Re: [patch v3 31/36] x86/apic: Provide cpu_primary_thread mask"
Previous message: Chris Packham: "Re: New kernel warning after updating from LTS 5.15.110 to 5.15.112 (and 5.15.113)"
In reply to: Paul E. McKenney: "Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression"
Next in thread: Paul E. McKenney: "Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]