Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression

From: Paul E. McKenney
Date: Mon May 29 2023 - 08:51:29 EST

Next message: Zhangjin Wu: "[PATCH v2 0/7] nolibc: add generic part1 of prepare for rv32"
Previous message: Christophe JAILLET: "Re: [PATCH v3] hv_netvsc: Allocate rx indirection table size dynamically"
In reply to: Qi Zheng: "Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression"
Next in thread: Qi Zheng: "Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, May 29, 2023 at 10:39:21AM +0800, Qi Zheng wrote:
> Hi Paul,
>
> On 2023/5/27 19:14, Paul E. McKenney wrote:
> > On Thu, May 25, 2023 at 12:03:16PM +0800, Qi Zheng wrote:
> > > On 2023/5/24 19:56, Qi Zheng wrote:
> > > > On 2023/5/24 19:08, Qi Zheng wrote:
> > > >
> > > > [...]
> > > >
> > > > >
> > > > > Well, I just ran the following command and reproduced the result:
> > > > >
> > > > > stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> > > > >
> > > > > 1) with commit 42c9db3970483:
> > > > >
> > > > > stress-ng: info: [11023] setting to a 60 second run per stressor
> > > > > stress-ng: info: [11023] dispatching hogs: 9 ramfs
> > > > > stress-ng: info: [11023] stressor       bogo ops real time usr
> > > > > time sys time   bogo ops/s     bogo ops/s
> > > > > stress-ng: info: [11023]                           (secs)    (secs)
> > > > > (secs)   (real time) (usr+sys time)
> > > > > stress-ng: info: [11023] ramfs            774966     60.00
> > > > > 10.18 169.45     12915.89        4314.26
> > > > > stress-ng: info: [11023] for a 60.00s run time:
> > > > > stress-ng: info: [11023]    1920.11s available CPU time
> > > > > stress-ng: info: [11023]      10.18s user time   ( 0.53%)
> > > > > stress-ng: info: [11023]     169.44s system time ( 8.82%)
> > > > > stress-ng: info: [11023]     179.62s total time ( 9.35%)
> > > > > stress-ng: info: [11023] load average: 8.99 2.69 0.93
> > > > > stress-ng: info: [11023] successful run completed in 60.00s (1 min,
> > > > > 0.00 secs)
> > > > >
> > > > > 2) with commit f95bdb700bc6b:
> > > > >
> > > > > stress-ng: info: [37676] dispatching hogs: 9 ramfs
> > > > > stress-ng: info: [37676] stressor       bogo ops real time usr
> > > > > time sys time   bogo ops/s     bogo ops/s
> > > > > stress-ng: info: [37676]                           (secs)    (secs)
> > > > > (secs)   (real time) (usr+sys time)
> > > > > stress-ng: info: [37676] ramfs            168673     60.00
> > > > > 1.61   39.66      2811.08        4087.47
> > > > > stress-ng: info: [37676] for a 60.10s run time:
> > > > > stress-ng: info: [37676]    1923.36s available CPU time
> > > > > stress-ng: info: [37676]       1.60s user time   ( 0.08%)
> > > > > stress-ng: info: [37676]      39.66s system time ( 2.06%)
> > > > > stress-ng: info: [37676]      41.26s total time ( 2.15%)
> > > > > stress-ng: info: [37676] load average: 7.69 3.63 2.36
> > > > > stress-ng: info: [37676] successful run completed in 60.10s (1 min,
> > > > > 0.10 secs)
> > > > >
> > > > > The bogo ops/s (real time) did drop significantly.
> > > > >
> > > > > And the memory reclaimation was not triggered in the whole process. so
> > > > > theoretically no one is in the read critical section of shrinker_srcu.
> > > > >
> > > > > Then I found that some stress-ng-ramfs processes were in
> > > > > TASK_UNINTERRUPTIBLE state for a long time:
> > > > >
> > > > > root       42313 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42314 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42315 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42316 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42317 7.8 0.0 69592 1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42318 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42319 7.8 0.0 69592 1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42320 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42321 7.8 0.0 69592 1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42322 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42323 7.8 0.0 69592 1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42324 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42325 7.8 0.0 69592 1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42326 0.0 0.0 69592 2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42327 7.9 0.0 69592 1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42328 7.9 0.0 69592 1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42329 7.9 0.0 69592 1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42330 7.9 0.0 69592 1556 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > >
> > > > > Their call stack is as follows:
> > > > >
> > > > > cat /proc/42330/stack
> > > > >
> > > > > [<0>] __synchronize_srcu.part.21+0x83/0xb0
> > > > > [<0>] unregister_shrinker+0x85/0xb0
> > > > > [<0>] deactivate_locked_super+0x27/0x70
> > > > > [<0>] cleanup_mnt+0xb8/0x140
> > > > > [<0>] task_work_run+0x65/0x90
> > > > > [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0
> > > > > [<0>] syscall_exit_to_user_mode+0x1b/0x40
> > > > > [<0>] do_syscall_64+0x44/0x80
> > > > > [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > > >
> > > > > + RCU folks, Is this result as expected? I would have thought that
> > > > > synchronize_srcu() should return quickly if no one is in the read
> > > > > critical section. :(
> >
> > In theory, it would indeed be nice if synchronize_srcu() would do that.
> > In practice, the act of checking to see if there is anyone in an SRCU
> > read-side critical section is a heavy-weight operation, involving at
> > least one cache miss per CPU along with a number of full memory barriers.
> >
> > So SRCU has to be careful to not check too frequently.
>
> Got it.
>
> >
> > However, if SRCU has been idle for some time, normal synchronize_srcu()
> > will do an immediate check. And this will of course mark SRCU as
> > non-idle.
> >
> > > > With the following changes, ops/s can return to previous levels:
> > >
> > > Or just set rcu_expedited to 1:
> > > echo 1 > /sys/kernel/rcu_expedited
> >
> > This does cause SRCU to be much more aggressive. This can be a good
> > choice for small systems, but please keep in mind that this affects normal
> > RCU as well as SRCU. It will cause RCU to also be much more aggressive,
> > sending IPIs to CPUs that are (or might be) in RCU read-side critical
> > sections. Depending on your workload, this might or might not be what
> > you want RCU to be doing. For example, if you are running aggressive
> > real-time workloads, it most definitely is not what you want.
>
> Yeah, that's not what I want, a shrinker might run for a long time.
>
> >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index db2ed6e08f67..90f541b07cd1 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker)
> > > >         debugfs_entry = shrinker_debugfs_remove(shrinker);
> > > >         up_write(&shrinker_rwsem);
> > > >
> > > > -       synchronize_srcu(&shrinker_srcu);
> > > > +       synchronize_srcu_expedited(&shrinker_srcu);
> >
> > If shrinkers are unregistered only occasionally, this is an entirely
> > reasonable change.
> >
> > > >         debugfs_remove_recursive(debugfs_entry);
> > > >
> > > > stress-ng: info: [13159] dispatching hogs: 9 ramfs
> > > > stress-ng: info: [13159] stressor       bogo ops real time usr time
> > > > sys time   bogo ops/s     bogo ops/s
> > > > stress-ng: info: [13159]                           (secs)    (secs)
> > > > (secs)   (real time) (usr+sys time)
> > > > stress-ng: info: [13159] ramfs            710062     60.00      9.63
> > > > 157.26     11834.18        4254.75
> > > > stress-ng: info: [13159] for a 60.00s run time:
> > > > stress-ng: info: [13159]    1920.14s available CPU time
> > > > stress-ng: info: [13159]       9.62s user time   ( 0.50%)
> > > > stress-ng: info: [13159]     157.26s system time ( 8.19%)
> > > > stress-ng: info: [13159]     166.88s total time ( 8.69%)
> > > > stress-ng: info: [13159] load average: 9.49 4.02 1.65
> > > > stress-ng: info: [13159] successful run completed in 60.00s (1 min,
> > > > 0.00 secs)
> > > >
> > > > Can we make synchronize_srcu() call synchronize_srcu_expedited() when no
> > > > one is in the read critical section?
> >
> > Yes, in theory we could, but this would be a bad thing in practice.
> > After all, the point of having synchronize_srcu() be separate from
> > synchronize_srcu_expedited() is to allow uses that are OK with longer
> > latency avoid consuming too much CPU. In addition, that longer
> > SRCU grace-period latency allows the next grace period to handle more
> > synchronize_srcu() and call_srcu() requests. This amortizes the
> > overhead of that next grace period over a larger number of updates.
> >
> > However, your use of synchronize_srcu_expedited() does have that effect,
> > but only for this call point. Which has the advantage of avoiding
> > burning excessive quantities of CPU for the other 50+ call points.
>
> Thanks for such a detailed explanation.
>
> Now I think we can continue to try to complete the idea[1] from
> Kirill Tkhai. The patch moves heavy synchronize_srcu() to delayed
> work, so it doesn't affect on user-visible unregistration speed.
>
> [1]. https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/

A blast from the past! ;-)

But yes, moving the long-latency synchronize_srcu() off the user-visible
critical code path can be even better.

Thanx, Paul

Next message: Zhangjin Wu: "[PATCH v2 0/7] nolibc: add generic part1 of prepare for rv32"
Previous message: Christophe JAILLET: "Re: [PATCH v3] hv_netvsc: Allocate rx indirection table size dynamically"
In reply to: Qi Zheng: "Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression"
Next in thread: Qi Zheng: "Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]