Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space

From: gengdongjiu
Date: Fri Dec 14 2018 - 19:06:52 EST

> On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@xxxxxxx> wrote:
> >
> > Hi Dongjiu Geng,
> >
> > On 14/12/2018 10:15, Dongjiu Geng wrote:
> > > When user space do memory recovery, it will check whether KVM and
> > > guest support the error recovery, only when both of them support,
> > > user space will do the error recovery. This patch exports this
> > > capability of KVM to user space.
> >
> > I can understand user-space only wanting to do the work if host and
> > guest support the feature. But 'error recovery' isn't a KVM feature,
> > its a Linux kernel feature.
> >
> > KVM will send it's user-space a SIGBUS with MCEERR code whenever its
> > trying to map a page at stage2 that the kernel-mm code refuses this because its poisoned.
> > (e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)
> >
> > This is exactly the same as happens to a normal user-space process.
> >
> > I think you really want to know if the host kernel was built with
> Does userspace need to care about that? Presumably if the host kernel wasn't built with that support then it will simply never deliver any
> memory failure events to QEMU, which is fine.
> The point I was trying to make in the email Dongjiu references
> ( is simply that "QEMU gets memory-failure notifications from the host kernel"
> does not imply "the guest is prepared to receive memory failure notifications", and so the code path which handles the SIGBUS must do
> some kind of check for whether the guest CPU is a type which expects them and that the board code set up the ACPI tables that it wants to
> fill in.

Thanks Peter's explanation. Frankly speaking, I agree Peter's suggestion.

To James, I explain more to you, as peter said QEMU needs to check whether the guest CPU is a type which can handle the error though guest ACPI table. Let us see the X86's QEMU logic:
1. Before the vCPU created, it will set a default env->mcg_cap value with MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports RAS error recovery.[1]
2. when the vCPU initialize, it will check whether host kernel support this feature[2]. Only when host kernel and default env->mcg_cap value all expected this feature, then it will setup vCPU support RAS error recovery[3].
So I add this IOCTL "KVM_CAP_ARM_MEMORY_ERROR_RECOVERY" to Let QEMU check whether host/KVM support RAS error detection and recovery, only when this supports, QEMU will do the error recovery for the guest memory.

cenv->mcg_cap = MCE_CAP_DEF | MCE_BANKS_DEF |
(cpu->enable_lmce ? MCG_LMCE_P : 0);

[2] ret = kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks);

env->mcg_cap &= mcg_cap | MCG_CAP_BANKS_MASK;
ret = kvm_vcpu_ioctl(cs, KVM_X86_SETUP_MCE, &env->mcg_cap);

-------------------------------------For James's comments---------------------------------------------------------------------
> KVM doesn't detect these errors.
> The hardware detects them and notifies the OS via one of a number of mechanisms.
> This gets plumbed into memory_failure(), which sets a flag that the mm code uses
> to prevent the page being used again.

> KVM is only involved when it tries to map a page at stage2 and the mm code
> rejects it with -EHWPOISON. This is the same as the architectures
> do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of
> handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.
James, for your above comments, I completed understand, but KVM also delivered the SIGBUS, which means KVM supports guest memory RAS error recovery, so maybe we need to tell user space this capability.

---------------------------------------------- For James's comments ---------------------------------------------------
> The CPU RAS Extensions are not at all relevant here. It is perfectly possible to
> support memory-failure without them, AMD-Seattle and APM-X-Gene do this. These
> systems would report not-supported here, but the kernel does support this stuff.
> Just because the CPU supports this, doesn't mean the kernel was built with
> CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to SIGKILL.
James, for your above comments[4], if you think we should not check the "cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", which do you prefer we should check?
In the X86 KVM code, it uses hardcode to tell use space the host/KVM support RAS error software recovery. If KVM does not check the " cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", we have to check the hardcode as X86's method.

u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P;

long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
r = -EFAULT;
if (copy_to_user(argp, &kvm_mce_cap_supported,
goto out;
r = 0;

> thanks
> -- PMM