Re: [PATCH v3] x86/hyperv: Handle unknown NMIs on one CPU when unknown_nmi_panic

From: Vitaly Kuznetsov
Date: Fri Dec 02 2016 - 03:40:07 EST


Thomas Gleixner <tglx@xxxxxxxxxxxxx> writes:

> Vitaly,
>
> On Thu, 1 Dec 2016, Vitaly Kuznetsov wrote:
>
>> There is a feature in Hyper-V (Debug-VM --InjectNonMaskableInterrupt) which
>> injects NMI to the guest. Prior to WS2016 the NMI is injected to all CPUs
>> of the guest and WS2016 injects it to CPU0 only. When unknown_nmi_panic is
>> enabled and we'd like to do kdump we need to perform some minimal cleanup
>> so the kdump kernel will be able to initialize VMBus devices, this cleanup
>> includes sending CHANNELMSG_UNLOAD to the host waiting for
>> CHANNELMSG_UNLOAD_RESPONSE to arrive. WS2012R2 always sends the response
>> to the CPU which was used to send CHANNELMSG_REQUESTOFFERS on VMBus module
>> load and not to the CPU which is sending CHANNELMSG_UNLOAD. As we can't do
>> any cross-CPU work reliably on crash we have vmbus_wait_for_unload()
>> function which tries to read CHANNELMSG_UNLOAD_RESPONSE on all CPUs message
>> pages and this sometimes works. It was discovered that in case the host
>> wants to send more than one message to a secondary CPU (not the CPU running
>> vmbus_wait_for_unload()) we're unable to get it as after reading the first
>> message we're supposed to do EOMing by doing wrmsrl(HV_X64_MSR_EOM, 0) but
>> this is per-CPU. I have a feeling that this was working some time ago when
>> I implemented vmbus_wait_for_unload(), the host was re-trying to deliver a
>> message even without wrmsrl() but apparently this doesn't work any more.
>> Unfortunately there is not that much we can do when all CPUs get NMI as
>> all but the first one are getting blocked with interrupts disabled. What we
>> can do is limit processing unknown interrupts to the first CPU which gets
>> it in case we're about to crash.
>
> This is completely unreadable and I really tried hard to make sense of it.
>
> Please structure it in a way that people who are not familiar with the
> inner workings of hyperv can at least understand the problem you are trying
> to solve and the concept of the solution w/o needing to figure out what all
> the acronyms and constants actually mean.
>
> Also visual structuring in paragraphs helps readability a lot.
>

Got it,

I'll try to do my best to make it readable.

> AFAICT this tries to deal with different problems of different hypervisor
> versions, but even that is unclear as you talk about version WS2016,
> versions prior to WS2016 and then about WS2012R2 in particular.
>
> Another issue I have with this is:
>
> ".... I have a feeling that this was working ...."
>
> Changes like this are not about feelings. We want to have changes based on
> facts.
>

The thing is that Hyper-V is a (proprietary) software which gets updates
and I don't remember which particular updates were installed when I was
imlementing vmbus_wait_for_unload() but as far as I remember it was
always working on WS2012R2. Now I observe a different behavior ...

--
Vitaly