Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space

From: Peter Maydell
Date: Thu Jan 10 2019 - 08:26:04 EST


On Thu, 10 Jan 2019 at 12:09, gengdongjiu <gengdongjiu@xxxxxxxxxx> wrote:
> Peter, I summarize James's main idea, James think QEMU does not needs
> to check *something* if Qemu support firmware-first.
> What do we do for your comments?

Unless I'm missing something, the code in your most recent patchset
attempts to update an ACPI table when it gets the SIGBUS from the
host kernel without doing anything to check whether it has ever
created the ACPI table (and set up the QEMU global variable that
tells the code where it is in the guest memory) in the first place.
I don't see how that can work.

> >> I think one question here which it would be good to answer is:
> >> if we are modelling a guest and we haven't specifically provided
> >> it an ACPI table to tell it about memory errors, what do we do
> >> when we get a sigbus from the host? We have basically two choices:
> >> (1) send the guest an SError (aka asynchronous external abort)
> >> anyway (with no further info about what the memory error is)
> >
> > For an AR signal an external abort is valid. Its up to the implementation
> > whether these are synchronous or asynchronous. Qemu can only take a signal for
> > something that was synchronous, so you can choose between the two.
> > Synchronous external abort is marginally better as an unaware OS knows its
> > affects this thread, and may be able to kill it.
> > SError with an imp-def ESR is indistinguishable from 'part of the soc fell out',
> > and should always result in a panic().
> >
> >
> >> (2) just stop QEMU (as we would for a memory error in QEMU's
> >> own memory)
> >
> > This is also valid. A machine may take external-abort to EL3 and then
> > reboot/crash/burn.

We should decide which of these we want to do, and have a comment
explaining what we're doing. If I'm reading your current patchset
correctly, it does neither -- if it can't record the fault in the
ACPI table it just ignores it without either stopping QEMU or
delivering an SError.

I think I favour option (2) here.

thanks
-- PMM