Re: [PATCH v3] x86/sgx: Fix RCU Tasks stalls in EPC sanitization loop

From: Huang, Kai

Date: Tue Jun 23 2026 - 07:34:06 EST


(Reminder: you forgot the linux-sgx@xxxxxxxxxxxxxxx).

On Tue, 2026-06-23 at 11:20 +0800, Miao, Jun wrote:
> Large EPC configurations can spend a significant amount of time
> sanitizing EPC pages during SGX initialization. The sanitization
> loop invokes cond_resched() while processing pages, but if the
> scheduler does not request rescheduling, 
>

[...]

> the CPU may remain in the
> kernel for an extended period without reporting a quiescent state to
> RCU Tasks.

the task may never report a quiescent state to RCU-Tasks. ?

>
> cond_resched() only schedules when rescheduling is needed and does
> not guarantee a quiescent state for RCU Tasks. Replace it with
> cond_resched_rcu_qs(), which explicitly reports an RCU Tasks
> quiescent state even when no context switch occurs.

"cond_resched() doesn't guarantee a quiescent state for RCU-Tasks" doesn't
necessarily mean there's a problem. There's bunch of kernel code which does
cond_resched() in loop, and we are fine with them.

I think you need to add the "BPF LSM subsystem can invoke
synchronize_rcu_tasks() at kernel boot time" and "ksgxd() can never be
rescheduled() when doing sanitizing all EPC pages" into the changelog to
justify.

Could you move some context from your v1 and refine together with the above two
paragraphs?

>
> Kai suggested that, this is a common problem at the scheduler and RCU layer,
> but not specific to SGX. 
>

You already added my Suggested-by (thanks), which is good enough and you dont
need to mention it again here.

> More detail please see:
> bde6c3aa9930 ("rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops").
> cee439398933 ("rcu: Rename cond_resched_rcu_qs() to cond_resched_tasks_rcu_qs()")

I am not sure you need this either. To me just mentioning the fact that
cond_resched_tasks_rcu_qs() can just do the job is good enough.

>
> Without this patch, instead, virtual machines (VMs) experience a long OS boot times:

We can make it shorter (given you have already mentioned the problem):

As a result, a VM may take a long time to boot:

>
> [ 4.110549] systemd[1]: Detected architecture x86-64.
> [ 4.115279] systemd[1]: Hostname set to <i2bp1g0g0m0i8406er0g1zX2>.
> [ 4.115554] systemd[1]: Installed transient /etc/machine-id file.
> [ 14.262158] rcu_tasks_wait_gp: rcu_tasks grace period number 1 (since boot) is 10087 jiffies old.
> [ 14.374158] rcu_tasks_wait_gp: rcu_tasks grace period number 1 (since boot) is 40199 jiffies old.
> [ 134.806157] rcu_tasks_wait_gp: rcu_tasks grace period number 1 (since boot) is 130631 jiffies old.
> [ 248.086158] INFO: task systemd:1 blocked for more than 122 seconds.
> [ 248.086491] Not tainted 6.8.0-90-generic #91-Ubuntu
> [ 248.086739] 'echo 0 > /proc/sys/kernel/hung_task_timeout_secs' disables this message.
> [ 248.086993] task:systemd state:D stack:0 pid:1 tpid:1 ppid:0 flags:0x00000002
> [ 248.087274] Call Trace:
> [ 248.087434] <TASK>
> [ 248.087557] __schedule+0x27c/0x6b0
> [ 248.087770] schedule+0x33/0x110
> [ 248.087939] schedule_timeout+0x157/0x170
> [ 248.088120] wait_for_completion+0x88/0x150
> [ 248.088304] __wait_rcu_gp+0x17e/0x190
> [ 248.088481] synchronize_rcu_tasks_generic+0x64/0x60
> [ 248.088672] ? __pfx_call_rcu_tasks+0x10/0x10
> [ 248.088858] ? __pfx_wakeme_after_rcu+0x10/0x10
> [ 248.089047] synchronize_rcu_tasks+0x15/0x20
> [ 248.089260] register_ftrace_direct+0x31f/0x350
> [ 248.089445] ? __pfx_bpf_lsm_file_open+0x10/0x10
> [ 248.089629] bpf_trampoline_update+0x469/0x650
> [ 248.089814] ? 0xffffffffffffffff
> [ 248.089988] ? 0xffffffffffffffff
> [ 248.090153] __bpf_trampoline_link_prog+0x10d/0x330
> [ 248.090339] bpf_trampoline_link_prog+0x33/0x60
> [ 248.090518] bpf_tracing_prog_attach+0x3c5/0x5f0
> [ 248.090699] link_create+0x1a5/0x280
> [ 248.090886] ? security_bpf+0x3c/0x70
> [ 248.091101] __sys_bpf+0x4ae/0x10
> [ 248.091312] __x64_sys_bpf+0x1a/0x30
> [ 248.091477] x64_sys_call+0x199/0x250
> [ 248.091647] do_syscall_64+0x7f/0x180
> [ 248.091818] ? arch_exit_to_user_mode_prepare.isa.0+0x1a/0x60
> [ 248.092022] ? irqentry_exit_to_user_mode+0x38/0x1e0
> [ 248.092246] ? irqentry_exit+0x43/0x50
> [ 248.092401] entry_SYSCALL_64_after_hwframe+0x78/0x80
> [ 248.092590] RIP: 0033:0x7b53e592728d
> [ 248.092756] RSP: 002b:00007ffdaa9d696 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
> [ 248.092856] RAX: ffffffffffffffda RBX: 00007ffdaa9d696 RCX: 00007b53e592728d
> [ 248.092956] RDX: 0000000000000000 RSI: 00007ffdaa9d696 RDI: 0000000000000001
> [ 248.093056] RBP: 00007ffdaa9d696 R08: 00007b53e5a03a8 R09: 00007ffdaa9d696
> [ 248.093156] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 248.093256] R13: 0000000000000000 R14: 00005d81ed2cfd0 R15: 00005d81ed2b7ec0
> [ 248.093406] </TASK>

This is too long. You need to trim it down to only contain relevant info.
E.g., I guess below should be good enough?

rcu_tasks_wait_gp: rcu_tasks grace period number 1 (since boot) is 130631
jiffies old.
INFO: task systemd:1 blocked for more than 122 seconds.
task:systemd state:D stack:0 pid:1 tpid:1 ppid:0 flags:0x00000002
Call Trace:
...
schedule_timeout+0x157/0x170
wait_for_completion+0x88/0x150
__wait_rcu_gp+0x17e/0x190
synchronize_rcu_tasks_generic+0x64/0x60
...
synchronize_rcu_tasks+0x15/0x20
register_ftrace_direct+0x31f/0x350
..
bpf_trampoline_link_prog+0x33/0x60
bpf_tracing_prog_attach+0x3c5/0x5f0
...

>
> After this patch test Results:
> Before fixed: boot time ~50s (with rcu_tasks grace period stall)
> After fixed: boot time ~10.7s (systemd-analyze: 724ms kernel + 1.575s initrd + 8.481s userspace = 10.782s)

It's weird to mention "Before fixed: ..." after you mention "After this patch
test Results:".

Maybe just:

Tests showed using cond_resched_tasks_rcu_qs() reduced the boot time from
~50s to ~10.7s (...).