Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression

From: Paul E. McKenney
Date: Sat May 27 2023 - 07:14:18 EST


On Thu, May 25, 2023 at 12:03:16PM +0800, Qi Zheng wrote:
> On 2023/5/24 19:56, Qi Zheng wrote:
> > On 2023/5/24 19:08, Qi Zheng wrote:
> >
> > [...]
> >
> > >
> > > Well, I just ran the following command and reproduced the result:
> > >
> > > stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> > >
> > > 1) with commit 42c9db3970483:
> > >
> > > stress-ng: info:  [11023] setting to a 60 second run per stressor
> > > stress-ng: info:  [11023] dispatching hogs: 9 ramfs
> > > stress-ng: info:  [11023] stressor       bogo ops real time  usr
> > > time sys time   bogo ops/s     bogo ops/s
> > > stress-ng: info:  [11023]                           (secs)    (secs)
> > > (secs)   (real time) (usr+sys time)
> > > stress-ng: info:  [11023] ramfs            774966     60.00    
> > > 10.18 169.45     12915.89        4314.26
> > > stress-ng: info:  [11023] for a 60.00s run time:
> > > stress-ng: info:  [11023]    1920.11s available CPU time
> > > stress-ng: info:  [11023]      10.18s user time   (  0.53%)
> > > stress-ng: info:  [11023]     169.44s system time (  8.82%)
> > > stress-ng: info:  [11023]     179.62s total time  (  9.35%)
> > > stress-ng: info:  [11023] load average: 8.99 2.69 0.93
> > > stress-ng: info:  [11023] successful run completed in 60.00s (1 min,
> > > 0.00 secs)
> > >
> > > 2) with commit f95bdb700bc6b:
> > >
> > > stress-ng: info:  [37676] dispatching hogs: 9 ramfs
> > > stress-ng: info:  [37676] stressor       bogo ops real time  usr
> > > time sys time   bogo ops/s     bogo ops/s
> > > stress-ng: info:  [37676]                           (secs)    (secs)
> > > (secs)   (real time) (usr+sys time)
> > > stress-ng: info:  [37676] ramfs            168673     60.00     
> > > 1.61   39.66      2811.08        4087.47
> > > stress-ng: info:  [37676] for a 60.10s run time:
> > > stress-ng: info:  [37676]    1923.36s available CPU time
> > > stress-ng: info:  [37676]       1.60s user time   (  0.08%)
> > > stress-ng: info:  [37676]      39.66s system time (  2.06%)
> > > stress-ng: info:  [37676]      41.26s total time  (  2.15%)
> > > stress-ng: info:  [37676] load average: 7.69 3.63 2.36
> > > stress-ng: info:  [37676] successful run completed in 60.10s (1 min,
> > > 0.10 secs)
> > >
> > > The bogo ops/s (real time) did drop significantly.
> > >
> > > And the memory reclaimation was not triggered in the whole process. so
> > > theoretically no one is in the read critical section of shrinker_srcu.
> > >
> > > Then I found that some stress-ng-ramfs processes were in
> > > TASK_UNINTERRUPTIBLE state for a long time:
> > >
> > > root       42313  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42314  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42315  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42316  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42317  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42318  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42319  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42320  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42321  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42322  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42323  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42324  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42325  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42326  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42327  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42328  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42329  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42330  7.9  0.0  69592  1556 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > >
> > > Their call stack is as follows:
> > >
> > > cat /proc/42330/stack
> > >
> > > [<0>] __synchronize_srcu.part.21+0x83/0xb0
> > > [<0>] unregister_shrinker+0x85/0xb0
> > > [<0>] deactivate_locked_super+0x27/0x70
> > > [<0>] cleanup_mnt+0xb8/0x140
> > > [<0>] task_work_run+0x65/0x90
> > > [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0
> > > [<0>] syscall_exit_to_user_mode+0x1b/0x40
> > > [<0>] do_syscall_64+0x44/0x80
> > > [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > >
> > > + RCU folks, Is this result as expected? I would have thought that
> > > synchronize_srcu() should return quickly if no one is in the read
> > > critical section. :(

In theory, it would indeed be nice if synchronize_srcu() would do that.
In practice, the act of checking to see if there is anyone in an SRCU
read-side critical section is a heavy-weight operation, involving at
least one cache miss per CPU along with a number of full memory barriers.

So SRCU has to be careful to not check too frequently.

However, if SRCU has been idle for some time, normal synchronize_srcu()
will do an immediate check. And this will of course mark SRCU as
non-idle.

> > With the following changes, ops/s can return to previous levels:
>
> Or just set rcu_expedited to 1:
> echo 1 > /sys/kernel/rcu_expedited

This does cause SRCU to be much more aggressive. This can be a good
choice for small systems, but please keep in mind that this affects normal
RCU as well as SRCU. It will cause RCU to also be much more aggressive,
sending IPIs to CPUs that are (or might be) in RCU read-side critical
sections. Depending on your workload, this might or might not be what
you want RCU to be doing. For example, if you are running aggressive
real-time workloads, it most definitely is not what you want.

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index db2ed6e08f67..90f541b07cd1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker)
> >         debugfs_entry = shrinker_debugfs_remove(shrinker);
> >         up_write(&shrinker_rwsem);
> >
> > -       synchronize_srcu(&shrinker_srcu);
> > +       synchronize_srcu_expedited(&shrinker_srcu);

If shrinkers are unregistered only occasionally, this is an entirely
reasonable change.

> >         debugfs_remove_recursive(debugfs_entry);
> >
> > stress-ng: info:  [13159] dispatching hogs: 9 ramfs
> > stress-ng: info:  [13159] stressor       bogo ops real time  usr time
> > sys time   bogo ops/s     bogo ops/s
> > stress-ng: info:  [13159]                           (secs)    (secs)
> > (secs)   (real time) (usr+sys time)
> > stress-ng: info:  [13159] ramfs            710062     60.00      9.63
> > 157.26     11834.18        4254.75
> > stress-ng: info:  [13159] for a 60.00s run time:
> > stress-ng: info:  [13159]    1920.14s available CPU time
> > stress-ng: info:  [13159]       9.62s user time   (  0.50%)
> > stress-ng: info:  [13159]     157.26s system time (  8.19%)
> > stress-ng: info:  [13159]     166.88s total time  (  8.69%)
> > stress-ng: info:  [13159] load average: 9.49 4.02 1.65
> > stress-ng: info:  [13159] successful run completed in 60.00s (1 min,
> > 0.00 secs)
> >
> > Can we make synchronize_srcu() call synchronize_srcu_expedited() when no
> > one is in the read critical section?

Yes, in theory we could, but this would be a bad thing in practice.
After all, the point of having synchronize_srcu() be separate from
synchronize_srcu_expedited() is to allow uses that are OK with longer
latency avoid consuming too much CPU. In addition, that longer
SRCU grace-period latency allows the next grace period to handle more
synchronize_srcu() and call_srcu() requests. This amortizes the
overhead of that next grace period over a larger number of updates.

However, your use of synchronize_srcu_expedited() does have that effect,
but only for this call point. Which has the advantage of avoiding
burning excessive quantities of CPU for the other 50+ call points.

Thanx, Paul