Re: [PATCH] kvm/x86: Avoid async PF to end RCU read-side critical section early in PREEMPT=n kernel

From: Paolo Bonzini
Date: Tue Oct 03 2017 - 10:11:18 EST


I'd prefer a slight change in subject and topic:

------- 8< --------
Subject: [PATCH] kvm/x86: Avoid async PF preempting the kernel incorrectly

Currently, in PREEMPT_COUNT=n kernel, kvm_async_pf_task_wait() could call
schedule() to reschedule in some cases. This could result in
accidentally ending the current RCU read-side critical section early,
causing random memory corruption in the guest, or otherwise preempting
the currently running task inside between preempt_disable and
preempt_enable.

The difficulty to handle this well is because we don't know whether an
async PF delivered in a preemptible section or RCU read-side critical section
for PREEMPT_COUNT=n, since preempt_disable()/enable() and rcu_read_lock/unlock()
are both no-ops in that case.

To cure this, we treat any async PF interrupting a kernel context as one
that cannot be preempted, preventing kvm_async_pf_task_wait() from choosing
the schedule() path in that case.

To do so, a second parameter for kvm_async_pf_task_wait() is introduced,
so that we know whether it's called from a context interrupting the
kernel, and the parameter is set properly in all the callsites.

Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Boqun Feng <boqun.feng@xxxxxxxxx>
------- 8< --------

I'll let Radim pick up the patch.

Paolo

On 03/10/2017 15:36, Boqun Feng wrote:
> Currently, in PREEMPT=n kernel, kvm_async_pf_task_wait() could call
> schedule() to reschedule in some cases, which could result in
> accidentally ending the current RCU read-side critical section early.
> And this could end up with random memory corruption in the guest.
>
> The difficulty to handle this well is because we don't know whether an
> async PF delivered in a RCU read-side critical section for
> PREEMPT_COUNT=n kernel, since rcu_read_lock/unlock() are just no-ops in
> that case.
>
> To cure this, we treat any async PF interrupting a kernel context as one
> delivered in a RCU read-side critical section, and we don't allow
> kvm_async_pf_task_wait() to choose schedule path in that case for
> PREEMPT_COUNT=n kernel, because that will introduce unvolunteerly
> context switches and break the assumption for RCU to work properly.
>
> To do so, a second parameter for kvm_async_pf_task_wait() is introduced,
> so that we know whether it's called from a context interrupting the
> kernel, and we set that parameter properly in all the callsites.
>
> Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
> Signed-off-by: Boqun Feng <boqun.feng@xxxxxxxxx>
> ---
> arch/x86/include/asm/kvm_para.h | 4 ++--
> arch/x86/kernel/kvm.c | 14 ++++++++++----
> arch/x86/kvm/mmu.c | 2 +-
> 3 files changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index bc62e7cbf1b1..59ad3d132353 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> bool kvm_para_available(void);
> unsigned int kvm_arch_para_features(void);
> void __init kvm_guest_init(void);
> -void kvm_async_pf_task_wait(u32 token);
> +void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
> void kvm_async_pf_task_wake(u32 token);
> u32 kvm_read_and_reset_pf_reason(void);
> extern void kvm_disable_steal_time(void);
> @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
>
> #else /* CONFIG_KVM_GUEST */
> #define kvm_guest_init() do {} while (0)
> -#define kvm_async_pf_task_wait(T) do {} while(0)
> +#define kvm_async_pf_task_wait(T, I) do {} while(0)
> #define kvm_async_pf_task_wake(T) do {} while(0)
>
> static inline bool kvm_para_available(void)
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index e675704fa6f7..8bb9594d0761 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -117,7 +117,11 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> return NULL;
> }
>
> -void kvm_async_pf_task_wait(u32 token)
> +/*
> + * @interrupt_kernel: Is this called from a routine which interrupts the kernel
> + * (other than user space)?
> + */
> +void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
> {
> u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> @@ -140,8 +144,10 @@ void kvm_async_pf_task_wait(u32 token)
>
> n.token = token;
> n.cpu = smp_processor_id();
> - n.halted = is_idle_task(current) || preempt_count() > 1 ||
> - rcu_preempt_depth();
> + n.halted = is_idle_task(current) ||
> + (IS_ENABLED(CONFIG_PREEMPT_COUNT)
> + ? preempt_count() > 1 || rcu_preempt_depth()
> + : interrupt_kernel);
> init_swait_queue_head(&n.wq);
> hlist_add_head(&n.link, &b->list);
> raw_spin_unlock(&b->lock);
> @@ -269,7 +275,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> case KVM_PV_REASON_PAGE_NOT_PRESENT:
> /* page is swapped out by the host. */
> prev_state = exception_enter();
> - kvm_async_pf_task_wait((u32)read_cr2());
> + kvm_async_pf_task_wait((u32)read_cr2(), !user_mode(regs));
> exception_exit(prev_state);
> break;
> case KVM_PV_REASON_PAGE_READY:
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index eca30c1eb1d9..106d4a029a8a 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> case KVM_PV_REASON_PAGE_NOT_PRESENT:
> vcpu->arch.apf.host_apf_reason = 0;
> local_irq_disable();
> - kvm_async_pf_task_wait(fault_address);
> + kvm_async_pf_task_wait(fault_address, 0);
> local_irq_enable();
> break;
> case KVM_PV_REASON_PAGE_READY:
>