RE: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow

From: Shameerali Kolothum Thodi
Date: Mon Jun 13 2022 - 17:15:23 EST




> -----Original Message-----
> From: Paul E. McKenney [mailto:paulmck@xxxxxxxxxx]
> Sent: 13 June 2022 15:59
> To: zhangfei.gao@xxxxxxxxxxx
> Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>; Zhangfei Gao
> <zhangfei.gao@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
> rcu@xxxxxxxxxxxxxxx; Lai Jiangshan <jiangshanlai@xxxxxxxxx>; Josh Triplett
> <josh@xxxxxxxxxxxxxxxx>; Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx>; Matthew Wilcox <willy@xxxxxxxxxxxxx>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@xxxxxxxxxx>;
> mtosatti@xxxxxxxxxx; Auger Eric <eric.auger@xxxxxxxxxx>
> Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and
> blocking readers from consuming CPU) cause qemu boot slow
>
> On Mon, Jun 13, 2022 at 09:23:50PM +0800, zhangfei.gao@xxxxxxxxxxx
> wrote:
> >
> >
> > On 2022/6/13 下午8:18, Paul E. McKenney wrote:
> > > On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@xxxxxxxxxxx
> wrote:
> > > > Hi, Paul
> > > >
> > > > On 2022/6/13 下午12:16, Paul E. McKenney wrote:
> > > > > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
> > > > > > On Mon, Jun 13, 2022 at 11:04:39AM +0800,
> zhangfei.gao@xxxxxxxxxxx wrote:
> > > > > > > Hi, Paul
> > > > > > >
> > > > > > > On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> > > > > > > > On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini
> wrote:
> > > > > > > > > On 6/12/22 18:40, Paul E. McKenney wrote:
> > > > > > > > > > > Do these reserved memory regions really need to be
> allocated separately?
> > > > > > > > > > > (For example, are they really all non-contiguous? If not,
> that is, if
> > > > > > > > > > > there are a lot of contiguous memory regions, could you
> sort the IORT
> > > > > > > > > > > by address and do one ioctl() for each set of contiguous
> memory regions?)
> > > > > > > > > > >
> > > > > > > > > > > Are all of these reserved memory regions set up before init
> is spawned?
> > > > > > > > > > >
> > > > > > > > > > > Are all of these reserved memory regions set up while
> there is only a
> > > > > > > > > > > single vCPU up and running?
> > > > > > > > > > >
> > > > > > > > > > > Is the SRCU grace period really needed in this case? (I
> freely confess
> > > > > > > > > > > to not being all that familiar with KVM.)
> > > > > > > > > > Oh, and there was a similar many-requests problem with
> networking many
> > > > > > > > > > years ago. This was solved by adding a new
> syscall/ioctl()/whatever
> > > > > > > > > > that permitted many requests to be presented to the kernel
> with a single
> > > > > > > > > > system call.
> > > > > > > > > >
> > > > > > > > > > Could a new ioctl() be introduced that requested a large
> number
> > > > > > > > > > of these memory regions in one go so as to make each call to
> > > > > > > > > > synchronize_rcu_expedited() cover a useful fraction of your
> 9000+
> > > > > > > > > > requests? Adding a few of the KVM guys on CC for their
> thoughts.
> > > > > > > > > Unfortunately not. Apart from this specific case, in general
> the calls to
> > > > > > > > > KVM_SET_USER_MEMORY_REGION are triggered by writes to
> I/O registers in the
> > > > > > > > > guest, and those writes then map to a ioctl. Typically the
> guest sets up a
> > > > > > > > > device at a time, and each setup step causes a
> synchronize_srcu()---and
> > > > > > > > > expedited at that.
> > > > > > > > I was afraid of something like that...
> > > > > > > >
> > > > > > > > > KVM has two SRCUs:
> > > > > > > > >
> > > > > > > > > 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it
> has readers
> > > > > > > > > that are very very small, but it needs extremely fast detection
> of grace
> > > > > > > > > periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> > > > > > > > > KVM_SET_GSI_ROUTING", 2014-05-05) which split it off
> kvm->srcu. Readers are
> > > > > > > > > not so frequent.
> > > > > > > > >
> > > > > > > > > 2) kvm->srcu is nastier because there are readers all the time.
> The
> > > > > > > > > read-side critical section are still short-ish, but they need the
> sleepable
> > > > > > > > > part because they access user memory.
> > > > > > > > Which one of these two is in play in this case?
> > > > > > > >
> > > > > > > > > Writers are not frequent per se; the problem is they come in
> very large
> > > > > > > > > bursts when a guest boots. And while the whole boot path
> overall can be
> > > > > > > > > quadratic, O(n) expensive calls to synchronize_srcu() can have
> a larger
> > > > > > > > > impact on runtime than the O(n^2) parts, as demonstrated
> here.
> > > > > > > > >
> > > > > > > > > Therefore, we operated on the assumption that the callers of
> > > > > > > > > synchronized_srcu_expedited were _anyway_ busy running
> CPU-bound guest code
> > > > > > > > > and the desire was to get past the booting phase as fast as
> possible. If
> > > > > > > > > the guest wants to eat host CPU it can "for(;;)" as much as it
> wants;
> > > > > > > > > therefore, as long as expedited GPs didn't eat CPU
> *throughout the whole
> > > > > > > > > system*, a preemptable busy wait in
> synchronize_srcu_expedited() were not
> > > > > > > > > problematic.
> > > > > > > > >
> > > > > > > > > This assumptions did match the SRCU code when kvm->srcu
> and kvm->irq_srcu
> > > > > > > > > were was introduced (respectively in 2009 and 2014). But
> perhaps they do
> > > > > > > > > not hold anymore now that each SRCU is not as independent
> as it used to be
> > > > > > > > > in those years, and instead they use workqueues instead?
> > > > > > > > The problem was not internal to SRCU, but rather due to the fact
> > > > > > > > that kernel live patching (KLP) had problems with the
> CPU-bound tasks
> > > > > > > > resulting from repeated synchronize_rcu_expedited()
> invocations. So I
> > > > > > > > added heuristics to get the occasional sleep in there for KLP's
> benefit.
> > > > > > > > Perhaps these heuristics need to be less aggressive about adding
> sleep.
> > > > > > > >
> > > > > > > > These heuristics have these aspects:
> > > > > > > >
> > > > > > > > 1. The longer readers persist in an expedited SRCU grace period,
> > > > > > > > the longer the wait between successive checks of the reader
> > > > > > > > state. Roughly speaking, we wait as long as the grace period
> > > > > > > > has currently been in effect, capped at ten jiffies.
> > > > > > > >
> > > > > > > > 2. SRCU grace periods have several phases. We reset so that
> each
> > > > > > > > phase starts by not waiting (new phase, new set of readers,
> > > > > > > > so don't penalize this set for the sins of the previous set).
> > > > > > > > But once we get to the point of adding delay, we add the
> > > > > > > > delay based on the beginning of the full grace period.
> > > > > > > >
> > > > > > > > Right now, the checking for grace-period length does not allow
> for the
> > > > > > > > possibility that a grace period might start just before the jiffies
> > > > > > > > counter gets incremented (because I didn't realize that anyone
> cared),
> > > > > > > > so that is one possible thing to change. I can also allow more
> no-delay
> > > > > > > > checks per SRCU grace-period phase.
> > > > > > > >
> > > > > > > > Zhangfei, does something like the patch shown below help?
> > > > > > > >
> > > > > > > > Additional adjustments are likely needed to avoid re-breaking
> KLP,
> > > > > > > > but we have to start somewhere...
> > > > > > > >
> > > > > > > > Thanx, Paul
> > > > > > > >
> > > > > > > > ------------------------------------------------------------------------
> > > > > > > >
> > > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > > > > > > index 50ba70f019dea..6a354368ac1d1 100644
> > > > > > > > --- a/kernel/rcu/srcutree.c
> > > > > > > > +++ b/kernel/rcu/srcutree.c
> > > > > > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct
> srcu_struct *ssp)
> > > > > > > > #define SRCU_INTERVAL 1 // Base delay if no
> expedited GPs pending.
> > > > > > > > #define SRCU_MAX_INTERVAL 10 // Maximum
> incremental delay from slow readers.
> > > > > > > > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum
> per-GP-phase consecutive no-delay instances.
> > > > > > > > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum
> per-GP-phase consecutive no-delay instances.
> > > > > > > > #define SRCU_MAX_NODELAY 100 // Maximum
> consecutive no-delay instances.
> > > > > > > > /*
> > > > > > > > @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct
> srcu_struct *ssp)
> > > > > > > > */
> > > > > > > > static unsigned long srcu_get_delay(struct srcu_struct
> *ssp)
> > > > > > > > {
> > > > > > > > + unsigned long gpstart;
> > > > > > > > + unsigned long j;
> > > > > > > > unsigned long jbase = SRCU_INTERVAL;
> > > > > > > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > > > > > > jbase = 0;
> > > > > > > > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > > > > > > > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> > > > > > > > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > > > > > > > + j = jiffies - 1;
> > > > > > > > + gpstart = READ_ONCE(ssp->srcu_gp_start);
> > > > > > > > + if (time_after(j, gpstart))
> > > > > > > > + jbase += j - gpstart;
> > > > > > > > + }
> > > > > > > > if (!jbase) {
> > > > > > > > WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > > > > > > if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> SRCU_MAX_NODELAY_PHASE)
> > > > > > > Unfortunately, this patch does not helpful.
> > > > > > >
> > > > > > > Then re-add the debug info.
> > > > > > >
> > > > > > > During the qemu boot
> > > > > > > [  232.997667]  __synchronize_srcu loop=1000
> > > > > > >
> > > > > > > [  361.094493]  __synchronize_srcu loop=9000
> > > > > > > [  361.094501] Call trace:
> > > > > > > [  361.094502]  dump_backtrace+0xe4/0xf0
> > > > > > > [  361.094505]  show_stack+0x20/0x70
> > > > > > > [  361.094507]  dump_stack_lvl+0x8c/0xb8
> > > > > > > [  361.094509]  dump_stack+0x18/0x34
> > > > > > > [  361.094511]  __synchronize_srcu+0x120/0x128
> > > > > > > [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> > > > > > > [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> > > > > > > [  361.094519]  kvm_activate_memslot+0x40/0x68
> > > > > > > [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> > > > > > > [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> > > > > > > [  361.094524]  kvm_set_memory_region+0x78/0xb8
> > > > > > > [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> > > > > > > [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> > > > > > > [  361.094530]  invoke_syscall+0x4c/0x110
> > > > > > > [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> > > > > > > [  361.094536]  do_el0_svc+0x34/0xc0
> > > > > > > [  361.094538]  el0_svc+0x30/0x98
> > > > > > > [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> > > > > > > [  361.094544]  el0t_64_sync+0x18c/0x190
> > > > > > > [  363.942817]  kvm_set_memory_region loop=6000
> > > > > > Huh.
> > > > > >
> > > > > > One possibility is that the "if (!jbase)" block needs to be nested
> > > > > > within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {"
> block.
> > > > I test this diff and NO helpful
> > > >
> > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > > index 50ba70f019de..36286a4b74e6 100644
> > > > --- a/kernel/rcu/srcutree.c
> > > > +++ b/kernel/rcu/srcutree.c
> > > > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct
> srcu_struct *ssp)
> > > >
> > > >  #define SRCU_INTERVAL          1       // Base delay if no
> expedited GPs
> > > > pending.
> > > >  #define SRCU_MAX_INTERVAL      10      // Maximum
> incremental delay from
> > > > slow readers.
> > > > -#define SRCU_MAX_NODELAY_PHASE 1       // Maximum
> per-GP-phase consecutive
> > > > no-delay instances.
> > > > +#define SRCU_MAX_NODELAY_PHASE 3       // Maximum
> per-GP-phase consecutive
> > > > no-delay instances.
> > > >  #define SRCU_MAX_NODELAY       100     // Maximum
> consecutive no-delay
> > > > instances.
> > > >
> > > >  /*
> > > > @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct
> srcu_struct
> > > > *ssp)
> > > >   */
> > > >  static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> > > >  {
> > > > +       unsigned long gpstart;
> > > > +       unsigned long j;
> > > >         unsigned long jbase = SRCU_INTERVAL;
> > > >
> > > >         if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> > > > READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > > >                 jbase = 0;
> > > > -       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> > > > -               jbase += jiffies -
> READ_ONCE(ssp->srcu_gp_start);
> > > > -       if (!jbase) {
> > > > -               WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> > > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > > -               if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> > > > SRCU_MAX_NODELAY_PHASE)
> > > > -                       jbase = 1;
> > > > +       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> > > > +               j = jiffies - 1;
> > > > +               gpstart = READ_ONCE(ssp->srcu_gp_start);
> > > > +               if (time_after(j, gpstart))
> > > > +                       jbase += j - gpstart;
> > > > +
> > > > +               if (!jbase) {
> > > >
> +                       WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> > > > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> > > > +                       if
> (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> > > > SRCU_MAX_NODELAY_PHASE)
> > > > +                               jbase = 1;
> > > > +               }
> > > >         }
> > > That is in fact what I was intending you to test, thank you. As you
> > > say, unfortunately it did not help.
> > >
> > > Could you please test removing the "if (!jbase)" block entirely?
> > Remove "if (!jbase)" block is much faster,
> > not measure clearly, qemu (with debug version efi) boot seems normally.
> >
> > From log timestamp:
> > [  114.624713] __synchronize_srcu loop=1000
> > [  124.157011]  __synchronize_srcu loop=9000
> >
> > Several method: timestamps are different.
> >
> > 5.19-rc1
> > [ 94.271350] __synchronize_srcu loop=1001
> > [ 222.621659] __synchronize_srcu loop=9001
> >
> >
> > With your first diff:
> > [ 232.997667] __synchronize_srcu loop=1000
> > [ 361.094493] __synchronize_srcu loop=9000
> >
> > Remove "if (!jbase)" block
> > [ 114.624713] __synchronize_srcu loop=1000
> > [ 124.157011] __synchronize_srcu loop=9000
> >
> >
> > 5.18 method
> > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> > + return 0;
> > + return SRCU_INTERVAL;
> >
> > [ 74.598480] __synchronize_srcu loop=9000
> > [ 68.938297] __synchronize_srcu loop=1000
>
> Thank you for the information!
>
> What happens if you keep the that "if (!jbase)" block", but set the
> value of the SRCU_MAX_NODELAY_PHASE macro very large, say 1000000?

From the setup I have, this is almost similar to that of the previous logic(without
the "if(!jbase)"). In both cases, I think we are not close to 5.18, but definitely much
better compared to 5.19-rc1.

The numbers from my test setup(CONFIG_HZ_250, CONFIG_PREEMPTION=y),

Guest boot time(using 'time'):

5.18-rc4 based ~8sec

5.19-rc1 ~2m43sec

5.19-rc1+fix1 ~19sec

5.19-rc1-fix2 ~19sec

I will wait for Zhangfei to confirm this on his setup, especially the difference
compared to 5.18.

Thanks,
Shameer

> This would be too large for KLP, but my hope is that there is a value
> of SRCU_MAX_NODELAY_PHASE that works for everyone. But first, does
> this help at all? ;-)
>
> > > > > And when I run 10,000 consecutive synchronize_rcu_expedited() calls,
> the
> > > > > above change reduces the overhead by more than an order of
> magnitude.
> > > > > Except that the overhead of the series is far less than one second,
> > > > > not the several minutes that you are seeing. So the per-call
> overhead
> > > > > decreases from about 17 microseconds to a bit more than one
> microsecond.
> > > > >
> > > > > I could imagine an extra order of magnitude if you are running
> HZ=100
> > > > > instead of the HZ=1000 that I am running. But that only gets up to a
> > > > > few seconds.
> > > One possible reason for the difference would be if your code has
> > > SRCU readers.
> > >
> > > Could you please tell me the value of CONFIG_HZ on your system?
> > > Also the value of CONFIG_PREEMPTION?
> > I am using arch/arm64/configs/defconfig
> > make defconfig
> > CONFIG_PREEMPTION=y
> > CONFIG_HZ_250=y
>
> Thank you again!
>
> And if there is a good value of SRCU_MAX_NODELAY_PHASE, it might
> depend
> on HZ. And who knows what all else...
>
> Thanx, Paul
>
> > Thanks
> >
> > >
> > > > > > One additional debug is to apply the patch below on top of the one
> you
> > > > apply the patch below?
> > > > > > just now kindly tested, then use whatever debug technique you wish
> to
> > > > > > work out what fraction of the time during that critical interval that
> > > > > > srcu_get_delay() returns non-zero.
> > > > Sorry, I am confused, no patch right?
> > > Apologies, my omission.
> > >
> > > > Just measure srcu_get_delay return to non-zero?
> > > Exactly, please!
> > >
> > > > By the way, the issue should be only related with qemu apci. not related
> > > > with rmr feature
> > > > Test with: https://github.com/qemu/qemu/tree/stable-6.1
> > > >
> > > > Looks it caused by too many kvm_region_add & kvm_region_del if
> acpi=force,
> > > > If no acpi, no print kvm_region_add/del (1000 times print once)
> > > >
> > > > If with acpi=force,
> > > > During qemu boot
> > > > kvm_region_add region_add = 1000
> > > > kvm_region_del region_del = 1000
> > > > kvm_region_add region_add = 2000
> > > > kvm_region_del region_del = 2000
> > > > kvm_region_add region_add = 3000
> > > > kvm_region_del region_del = 3000
> > > > kvm_region_add region_add = 4000
> > > > kvm_region_del region_del = 4000
> > > > kvm_region_add region_add = 5000
> > > > kvm_region_del region_del = 5000
> > > > kvm_region_add region_add = 6000
> > > > kvm_region_del region_del = 6000
> > > >
> > > > kvm_region_add/kvm_region_del ->
> > > > kvm_set_phys_mem->
> > > > kvm_set_user_memory_region->
> > > > kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
> > > >
> > > > [  361.094493]  __synchronize_srcu loop=9000
> > > > [  361.094501] Call trace:
> > > > [  361.094502]  dump_backtrace+0xe4/0xf0
> > > > [  361.094505]  show_stack+0x20/0x70
> > > > [  361.094507]  dump_stack_lvl+0x8c/0xb8
> > > > [  361.094509]  dump_stack+0x18/0x34
> > > > [  361.094511]  __synchronize_srcu+0x120/0x128
> > > > [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> > > > [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> > > > [  361.094519]  kvm_activate_memslot+0x40/0x68
> > > > [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> > > > [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> > > > [  361.094524]  kvm_set_memory_region+0x78/0xb8
> > > > [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> > > > [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> > > > [  361.094530]  invoke_syscall+0x4c/0x110
> > > > [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> > > > [  361.094536]  do_el0_svc+0x34/0xc0
> > > > [  361.094538]  el0_svc+0x30/0x98
> > > > [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> > > > [  361.094544]  el0t_64_sync+0x18c/0x190
> > > > [  363.942817]  kvm_set_memory_region loop=6000
> > > Good to know, thank you!
> > >
> > > Thanx, Paul
> >