-----Original Message-----
From: Paul E. McKenney [mailto:paulmck@xxxxxxxxxx]
Sent: 13 June 2022 15:59
To: zhangfei.gao@xxxxxxxxxxx
Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>; Zhangfei Gao
<zhangfei.gao@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
rcu@xxxxxxxxxxxxxxx; Lai Jiangshan <jiangshanlai@xxxxxxxxx>; Josh Triplett
<josh@xxxxxxxxxxxxxxxx>; Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx>; Matthew Wilcox <willy@xxxxxxxxxxxxx>;
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@xxxxxxxxxx>;
mtosatti@xxxxxxxxxx; Auger Eric <eric.auger@xxxxxxxxxx>
Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and
blocking readers from consuming CPU) cause qemu boot slow
On Mon, Jun 13, 2022 at 09:23:50PM +0800, zhangfei.gao@xxxxxxxxxxx
wrote:
wrote:
On 2022/6/13 下午8:18, Paul E. McKenney wrote:
On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@xxxxxxxxxxx
zhangfei.gao@xxxxxxxxxxx wrote:Hi, Paul
On 2022/6/13 下午12:16, Paul E. McKenney wrote:
On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
On Mon, Jun 13, 2022 at 11:04:39AM +0800,
wrote:Hi, Paul
On 2022/6/13 上午2:49, Paul E. McKenney wrote:
On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini
allocated separately?On 6/12/22 18:40, Paul E. McKenney wrote:
Do these reserved memory regions really need to be
that is, if(For example, are they really all non-contiguous? If not,
sort the IORTthere are a lot of contiguous memory regions, could you
memory regions?)by address and do one ioctl() for each set of contiguous
is spawned?
Are all of these reserved memory regions set up before init
there is only a
Are all of these reserved memory regions set up while
freely confesssingle vCPU up and running?
Is the SRCU grace period really needed in this case? (I
networking manyto not being all that familiar with KVM.)Oh, and there was a similar many-requests problem with
syscall/ioctl()/whateveryears ago. This was solved by adding a new
with a singlethat permitted many requests to be presented to the kernel
numbersystem call.
Could a new ioctl() be introduced that requested a large
9000+of these memory regions in one go so as to make each call to
synchronize_rcu_expedited() cover a useful fraction of your
thoughts.requests? Adding a few of the KVM guys on CC for their
the calls toUnfortunately not. Apart from this specific case, in general
I/O registers in theKVM_SET_USER_MEMORY_REGION are triggered by writes to
guest sets up aguest, and those writes then map to a ioctl. Typically the
synchronize_srcu()---anddevice at a time, and each setup step causes a
has readersexpedited at that.I was afraid of something like that...
KVM has two SRCUs:
1) kvm->irq_srcu is hardly relying on the "sleepable" part; it
of gracethat are very very small, but it needs extremely fast detection
kvm->srcu. Readers areperiods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
KVM_SET_GSI_ROUTING", 2014-05-05) which split it off
Thenot so frequent.
2) kvm->srcu is nastier because there are readers all the time.
sleepableread-side critical section are still short-ish, but they need the
very largepart because they access user memory.Which one of these two is in play in this case?
Writers are not frequent per se; the problem is they come in
overall can bebursts when a guest boots. And while the whole boot path
a largerquadratic, O(n) expensive calls to synchronize_srcu() can have
here.impact on runtime than the O(n^2) parts, as demonstrated
CPU-bound guest code
Therefore, we operated on the assumption that the callers of
synchronized_srcu_expedited were _anyway_ busy running
possible. Ifand the desire was to get past the booting phase as fast as
wants;the guest wants to eat host CPU it can "for(;;)" as much as it
*throughout the wholetherefore, as long as expedited GPs didn't eat CPU
synchronize_srcu_expedited() were notsystem*, a preemptable busy wait in
and kvm->irq_srcuproblematic.
This assumptions did match the SRCU code when kvm->srcu
perhaps they dowere was introduced (respectively in 2009 and 2014). But
as it used to benot hold anymore now that each SRCU is not as independent
CPU-bound tasksin those years, and instead they use workqueues instead?The problem was not internal to SRCU, but rather due to the fact
that kernel live patching (KLP) had problems with the
invocations. So Iresulting from repeated synchronize_rcu_expedited()
benefit.added heuristics to get the occasional sleep in there for KLP's
sleep.Perhaps these heuristics need to be less aggressive about adding
each
These heuristics have these aspects:
1. The longer readers persist in an expedited SRCU grace period,
the longer the wait between successive checks of the reader
state. Roughly speaking, we wait as long as the grace period
has currently been in effect, capped at ten jiffies.
2. SRCU grace periods have several phases. We reset so that
for thephase starts by not waiting (new phase, new set of readers,
so don't penalize this set for the sins of the previous set).
But once we get to the point of adding delay, we add the
delay based on the beginning of the full grace period.
Right now, the checking for grace-period length does not allow
cared),possibility that a grace period might start just before the jiffies
counter gets incremented (because I didn't realize that anyone
no-delayso that is one possible thing to change. I can also allow more
KLP,checks per SRCU grace-period phase.
Zhangfei, does something like the patch shown below help?
Additional adjustments are likely needed to avoid re-breaking
srcu_struct *ssp)but we have to start somewhere...
Thanx, Paul
------------------------------------------------------------------------
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 50ba70f019dea..6a354368ac1d1 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -513,7 +513,7 @@ static bool srcu_readers_active(struct
expedited GPs pending.#define SRCU_INTERVAL 1 // Base delay if no
incremental delay from slow readers.#define SRCU_MAX_INTERVAL 10 // Maximum
per-GP-phase consecutive no-delay instances.-#define SRCU_MAX_NODELAY_PHASE 1 // Maximum
per-GP-phase consecutive no-delay instances.+#define SRCU_MAX_NODELAY_PHASE 3 // Maximum
consecutive no-delay instances.#define SRCU_MAX_NODELAY 100 // Maximum
srcu_struct *ssp)/*
@@ -522,12 +522,18 @@ static bool srcu_readers_active(struct
*ssp)*/
static unsigned long srcu_get_delay(struct srcu_struct
READ_ONCE(ssp->srcu_gp_seq_needed_exp))){
+ unsigned long gpstart;
+ unsigned long j;
unsigned long jbase = SRCU_INTERVAL;
if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);jbase = 0;
- if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
- jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
+ if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
+ j = jiffies - 1;
+ gpstart = READ_ONCE(ssp->srcu_gp_start);
+ if (time_after(j, gpstart))
+ jbase += j - gpstart;
+ }
if (!jbase) {
WRITE_ONCE(ssp->srcu_n_exp_nodelay,
SRCU_MAX_NODELAY_PHASE)if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
block.Unfortunately, this patch does not helpful.Huh.
Then re-add the debug info.
During the qemu boot
[ 232.997667] __synchronize_srcu loop=1000
[ 361.094493] __synchronize_srcu loop=9000
[ 361.094501] Call trace:
[ 361.094502] dump_backtrace+0xe4/0xf0
[ 361.094505] show_stack+0x20/0x70
[ 361.094507] dump_stack_lvl+0x8c/0xb8
[ 361.094509] dump_stack+0x18/0x34
[ 361.094511] __synchronize_srcu+0x120/0x128
[ 361.094514] synchronize_srcu_expedited+0x2c/0x40
[ 361.094515] kvm_swap_active_memslots+0x130/0x198
[ 361.094519] kvm_activate_memslot+0x40/0x68
[ 361.094520] kvm_set_memslot+0x2f8/0x3b0
[ 361.094523] __kvm_set_memory_region+0x2e4/0x438
[ 361.094524] kvm_set_memory_region+0x78/0xb8
[ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
[ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
[ 361.094530] invoke_syscall+0x4c/0x110
[ 361.094533] el0_svc_common.constprop.0+0x68/0x128
[ 361.094536] do_el0_svc+0x34/0xc0
[ 361.094538] el0_svc+0x30/0x98
[ 361.094541] el0t_64_sync_handler+0xb8/0xc0
[ 361.094544] el0t_64_sync+0x18c/0x190
[ 363.942817] kvm_set_memory_region loop=6000
One possibility is that the "if (!jbase)" block needs to be nested
within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {"
srcu_struct *ssp)I test this diff and NO helpful
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 50ba70f019de..36286a4b74e6 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -513,7 +513,7 @@ static bool srcu_readers_active(struct
expedited GPs
#define SRCU_INTERVAL 1 // Base delay if no
incremental delay frompending.
#define SRCU_MAX_INTERVAL 10 // Maximum
per-GP-phase consecutiveslow readers.
-#define SRCU_MAX_NODELAY_PHASE 1 // Maximum
per-GP-phase consecutiveno-delay instances.
+#define SRCU_MAX_NODELAY_PHASE 3 // Maximum
consecutive no-delayno-delay instances.
#define SRCU_MAX_NODELAY 100 // Maximum
srcu_structinstances.
/*
@@ -522,16 +522,23 @@ static bool srcu_readers_active(struct
READ_ONCE(ssp->srcu_gp_start);*ssp)
*/
static unsigned long srcu_get_delay(struct srcu_struct *ssp)
{
+ unsigned long gpstart;
+ unsigned long j;
unsigned long jbase = SRCU_INTERVAL;
if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
jbase = 0;
- if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
- jbase += jiffies -
+ WRITE_ONCE(ssp->srcu_n_exp_nodelay,- if (!jbase) {
- WRITE_ONCE(ssp->srcu_n_exp_nodelay,
READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
- if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
SRCU_MAX_NODELAY_PHASE)
- jbase = 1;
+ if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
+ j = jiffies - 1;
+ gpstart = READ_ONCE(ssp->srcu_gp_start);
+ if (time_after(j, gpstart))
+ jbase += j - gpstart;
+
+ if (!jbase) {
(READ_ONCE(ssp->srcu_n_exp_nodelay) >READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
+ if
READ_ONCE(ssp->srcu_gp_seq_needed_exp)))Remove "if (!jbase)" block is much faster,SRCU_MAX_NODELAY_PHASE)That is in fact what I was intending you to test, thank you. As you
+ jbase = 1;
+ }
}
say, unfortunately it did not help.
Could you please test removing the "if (!jbase)" block entirely?
not measure clearly, qemu (with debug version efi) boot seems normally.
From log timestamp:
[ 114.624713] __synchronize_srcu loop=1000
[ 124.157011] __synchronize_srcu loop=9000
Several method: timestamps are different.
5.19-rc1
[ 94.271350] __synchronize_srcu loop=1001
[ 222.621659] __synchronize_srcu loop=9001
With your first diff:
[ 232.997667] __synchronize_srcu loop=1000
[ 361.094493] __synchronize_srcu loop=9000
Remove "if (!jbase)" block
[ 114.624713] __synchronize_srcu loop=1000
[ 124.157011] __synchronize_srcu loop=9000
5.18 method
+ if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
+ return 0;
+ return SRCU_INTERVAL;
[ 74.598480] __synchronize_srcu loop=9000
[ 68.938297] __synchronize_srcu loop=1000
Thank you for the information!
What happens if you keep the that "if (!jbase)" block", but set the
value of the SRCU_MAX_NODELAY_PHASE macro very large, say 1000000?
From the setup I have, this is almost similar to that of the previous logic(without
the "if(!jbase)"). In both cases, I think we are not close to 5.18, but definitely much
better compared to 5.19-rc1.
The numbers from my test setup(CONFIG_HZ_250, CONFIG_PREEMPTION=y),
Guest boot time(using 'time'):
5.18-rc4 based ~8sec
5.19-rc1 ~2m43sec
5.19-rc1+fix1 ~19sec
5.19-rc1-fix2 ~19sec
I will wait for Zhangfei to confirm this on his setup, especially the difference
compared to 5.18.
Thanks,
Shameer
This would be too large for KLP, but my hope is that there is a value
of SRCU_MAX_NODELAY_PHASE that works for everyone. But first, does
this help at all? ;-)
theAnd when I run 10,000 consecutive synchronize_rcu_expedited() calls,
magnitude.above change reduces the overhead by more than an order of
overheadExcept that the overhead of the series is far less than one second,
not the several minutes that you are seeing. So the per-call
microsecond.decreases from about 17 microseconds to a bit more than one
HZ=100
I could imagine an extra order of magnitude if you are running
I am using arch/arm64/configs/defconfigOne possible reason for the difference would be if your code hasinstead of the HZ=1000 that I am running. But that only gets up to a
few seconds.
SRCU readers.
Could you please tell me the value of CONFIG_HZ on your system?
Also the value of CONFIG_PREEMPTION?
make defconfig
CONFIG_PREEMPTION=y
CONFIG_HZ_250=y
Thank you again!
And if there is a good value of SRCU_MAX_NODELAY_PHASE, it might
depend
on HZ. And who knows what all else...
Thanx, Paul
Thanksyou
One additional debug is to apply the patch below on top of the one
toapply the patch below?
just now kindly tested, then use whatever debug technique you wish
acpi=force,Apologies, my omission.Sorry, I am confused, no patch right?work out what fraction of the time during that critical interval that
srcu_get_delay() returns non-zero.
Just measure srcu_get_delay return to non-zero?Exactly, please!
By the way, the issue should be only related with qemu apci. not related
with rmr feature
Test with: https://github.com/qemu/qemu/tree/stable-6.1
Looks it caused by too many kvm_region_add & kvm_region_del if
If no acpi, no print kvm_region_add/del (1000 times print once)Good to know, thank you!
If with acpi=force,
During qemu boot
kvm_region_add region_add = 1000
kvm_region_del region_del = 1000
kvm_region_add region_add = 2000
kvm_region_del region_del = 2000
kvm_region_add region_add = 3000
kvm_region_del region_del = 3000
kvm_region_add region_add = 4000
kvm_region_del region_del = 4000
kvm_region_add region_add = 5000
kvm_region_del region_del = 5000
kvm_region_add region_add = 6000
kvm_region_del region_del = 6000
kvm_region_add/kvm_region_del ->
kvm_set_phys_mem->
kvm_set_user_memory_region->
kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
[ 361.094493] __synchronize_srcu loop=9000
[ 361.094501] Call trace:
[ 361.094502] dump_backtrace+0xe4/0xf0
[ 361.094505] show_stack+0x20/0x70
[ 361.094507] dump_stack_lvl+0x8c/0xb8
[ 361.094509] dump_stack+0x18/0x34
[ 361.094511] __synchronize_srcu+0x120/0x128
[ 361.094514] synchronize_srcu_expedited+0x2c/0x40
[ 361.094515] kvm_swap_active_memslots+0x130/0x198
[ 361.094519] kvm_activate_memslot+0x40/0x68
[ 361.094520] kvm_set_memslot+0x2f8/0x3b0
[ 361.094523] __kvm_set_memory_region+0x2e4/0x438
[ 361.094524] kvm_set_memory_region+0x78/0xb8
[ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
[ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
[ 361.094530] invoke_syscall+0x4c/0x110
[ 361.094533] el0_svc_common.constprop.0+0x68/0x128
[ 361.094536] do_el0_svc+0x34/0xc0
[ 361.094538] el0_svc+0x30/0x98
[ 361.094541] el0t_64_sync_handler+0xb8/0xc0
[ 361.094544] el0t_64_sync+0x18c/0x190
[ 363.942817] kvm_set_memory_region loop=6000
Thanx, Paul