Re: [PATCH v3 2/5] x86/apic: Fix restoring boot irq mode in reboot and kexec/kdump

From: Eric W. Biederman
Date: Tue Feb 13 2018 - 12:41:08 EST


Dou Liyang <douly.fnst@xxxxxxxxxxxxxx> writes:

> Hi Baoquan,
>
> At 02/12/2018 11:08 AM, Eric W. Biederman wrote:
>> Baoquan He <bhe@xxxxxxxxxx> writes:
>>
>>> This is a regression fix.
>>>
>>> Before, to fix erratum AVR31, commit 522e66464467 ("x86/apic: Disable
>>> I/O APIC before shutdown of the local APIC") moved lapic_shutdown()
>>> calling after disable_IO_APIC(). This introdued a regression. The
>>> root cause is that disable_IO_APIC() not only clears IO_APIC, also
>>> restore boot irq mode by setting LAPIC/APIC/IMCR, lapic_shutdown()
>>> after disable_IO_APIC() will disable LAPIC and ruin the possible
>>> virtual wire mode setting which the code has been trying to do all
>>> along.
>>>
>>> The consequence is, in KVM guest kernel always prints warning as below
>>> during kexec/kdump kernel boots up. That happened in setup_local_APIC()
>>> since 'do { xxx } while (queued && max_loops > 0)' loop does not function
>
> I am not sure another thing here
>
> AFAIK, according to the order of SMP machine shutdown, other CPUs will
> be stopped firstly, then the last CPU disable its local apic.
>
> --machine_shutdown
> |----......
> |----stop_other_cpus()
> |----local_shutdown()
>
> So, the pending interrupts exist only in BSP and only be ACKed by
> BSP. is it right?(I validated this by print the value of APIC_IRR/ISR of
> all CPUs, found only BSP had the non-zero value).

We don't know. In the kexec on panic case we try and shutdown the other
cpus but we have a timeout because we might fail.

Further you have to be careful with the concept of boot cpu. In the
normal kexec case shutdown on the boot cpu and leave it running. In the
kexec on panic case we shutdown on an arbitrary cpu.

> If it is right, We will do not need check the pending interrupt for each
> cpus.

It is also cheap if there are no pending interrupts as there is nothing
to do.

> BTW, the pending interrupt check code is mixed with the local
> APIC setup code, that looks messy. How about extracting the code which
> related to the crash interrupt check, put it into a new function and
> only invoked it when the CPU is BSP?

Moving it into it's own function makes sense. Let's not taint the
concept with ``crash''. We don't know that the only way this will
ever happen is from kexec on panic. We only know it was easy to
trigger the condition from kexec on panic.

There are a lot of cases I can think of that interrupts might fire while
interrupts are disabled because the kernel is booting. A normal kexec
is also possible.

Throw in MSI interrupts and transitioning from one state to another in
non-legacy apic mode and we might be more likely to get some irqs in
a pending state.

Your patch makes me nervous as it is not just code motion but a much
more substantial change.


As much as I agree that we need to fix the regression in the apic
shutdown code that is causing problems. What we really need to do
is to completely remove the apic shutdown code from the kexec on panic
code path. Over the long term that will provide us with a much more
robust path to getting crash dumps and the like.


Eric


> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 3fc259b4dd2d..0350528b320d 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1408,6 +1408,57 @@ static void lapic_setup_esr(void)
> oldvalue, value);
> }
>
> +
> +static void clear_crash_pending_intr(void)
> +{
> + long long max_loops = cpu_khz ? cpu_khz : 1000000;
> + unsigned long long tsc = 0, ntsc;
> + unsigned int value, queued;
> + int i, j, acked = 0;
> +
> + if (boot_cpu_has(X86_FEATURE_TSC))
> + tsc = rdtsc();
> + /*
> + * After a crash, we no longer service the interrupts and a pending
> + * interrupt from previous kernel might still have ISR bit set.
> + *
> + * Most probably by now CPU has serviced that pending interrupt and
> + * it might not have done the ack_APIC_irq() because it thought,
> + * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it
> + * does not clear the ISR bit and cpu thinks it has already serivced
> + * the interrupt. Hence a vector might get locked. It was noticed
> + * for timer irq (vector 0x31). Issue an extra EOI to clear ISR.
> + */
> + do {
> + queued = 0;
> + for (i = APIC_ISR_NR - 1; i >= 0; i--)
> + queued |= apic_read(APIC_IRR + i*0x10);
> +
> + for (i = APIC_ISR_NR - 1; i >= 0; i--) {
> + value = apic_read(APIC_ISR + i*0x10);
> + for (j = 31; j >= 0; j--) {
> + if (value & (1<<j)) {
> + ack_APIC_irq();
> + acked++;
> + }
> + }
> + }
> + if (acked > 256) {
> + printk(KERN_ERR "LAPIC pending interrupts after %d
> EOI\n",
> + acked);
> + break;
> + }
> + if (queued) {
> + if (boot_cpu_has(X86_FEATURE_TSC) && cpu_khz) {
> + ntsc = rdtsc();
> + max_loops = (cpu_khz << 10) - (ntsc - tsc);
> + } else
> + max_loops--;
> + }
> + } while (queued && max_loops > 0);
> + WARN_ON(max_loops <= 0);
> +}
> +
> /**
> * setup_local_APIC - setup the local APIC
> *
> @@ -1417,13 +1468,7 @@ static void lapic_setup_esr(void)
> void setup_local_APIC(void)
> {
> int cpu = smp_processor_id();
> - unsigned int value, queued;
> - int i, j, acked = 0;
> - unsigned long long tsc = 0, ntsc;
> - long long max_loops = cpu_khz ? cpu_khz : 1000000;
> -
> - if (boot_cpu_has(X86_FEATURE_TSC))
> - tsc = rdtsc();
> - tsc = rdtsc();
> + unsigned int value;
>
> if (disable_apic) {
> disable_ioapic_support();
> @@ -1475,45 +1520,8 @@ void setup_local_APIC(void)
> value &= ~APIC_TPRI_MASK;
> apic_write(APIC_TASKPRI, value);
>
> - /*
> - * After a crash, we no longer service the interrupts and a pending
> - * interrupt from previous kernel might still have ISR bit set.
> - *
> - * Most probably by now CPU has serviced that pending interrupt and
> - * it might not have done the ack_APIC_irq() because it thought,
> - * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it
> - * does not clear the ISR bit and cpu thinks it has already serivced
> - * the interrupt. Hence a vector might get locked. It was noticed
> - * for timer irq (vector 0x31). Issue an extra EOI to clear ISR.
> - */
> - do {
> - queued = 0;
> - for (i = APIC_ISR_NR - 1; i >= 0; i--)
> - queued |= apic_read(APIC_IRR + i*0x10);
> -
> - for (i = APIC_ISR_NR - 1; i >= 0; i--) {
> - value = apic_read(APIC_ISR + i*0x10);
> - for (j = 31; j >= 0; j--) {
> - if (value & (1<<j)) {
> - ack_APIC_irq();
> - acked++;
> - }
> - }
> - }
> - if (acked > 256) {
> - printk(KERN_ERR "LAPIC pending interrupts after %d
> EOI\n",
> - acked);
> - break;
> - }
> - if (queued) {
> - if (boot_cpu_has(X86_FEATURE_TSC) && cpu_khz) {
> - ntsc = rdtsc();
> - max_loops = (cpu_khz << 10) - (ntsc - tsc);
> - } else
> - max_loops--;
> - }
> - } while (queued && max_loops > 0);
> - WARN_ON(max_loops <= 0);
> + if (!cpu)
> + clear_crash_pending_intr();
>
> /*
> * Now that we are all set up, enable the APIC
>
>
> Thanks,
> dou.