Re: [PATCH] arch/x86: Fix kdump on x86 with physically hotadded CPUs

From: Prarit Bhargava
Date: Tue Oct 04 2016 - 12:01:46 EST

On 10/04/2016 10:38 AM, Thomas Gleixner wrote:
> On Tue, 4 Oct 2016, Prarit Bhargava wrote:
>> On 10/04/2016 06:58 AM, Thomas Gleixner wrote:
>>> While it is the right thing to initialize the package map in that case, it
>>> still papers over a robustness issue in the uncore code, which needs to be
>>> fixed first.
>> I will include a separate patch with an error check for pkg == 0xffff in the
>> uncore code.
> 0xffff? That won't help. The id returned is -1 if the entry is not
> initialized. And aside of that just patching that particular place is not
> helping as the uncore code and also rapl is relying on the package map
> being populated.

Yes, I noticed that after I started digging into it this morning. Not only what
you pointed out but there's init that occurs in the uncore code that would have
to be undone.

There is a similar issue in the rapl code, but that code inadvertently protects
itself with for loops that end up never running (and that's why the rapl code
doesn't panic).

> So we need a sanity check in the initialization code which prevents any of
> this being executed.

Ok, should this be done only for logical_proc_id or for logical_proc_id,
phys_proc_id, and cpu_core_id? What do you think of adding that to the end of
smp_init_package_map() or smp_store_cpu_info()?

>>>> + if (!num_processors) {
>>>> + pr_warn("CPU 0 not enumerated in mptable or ACPI MADT\n");
>>>> + num_processors = 1;
>>> And in this case we end up with the same problem, right?
>> It occurs to me that I over thought this: I was thinking that there might exist
>> a pre-ACPI (or at least a system without an MADT) x86 system that wold boot such
>> that num_processors = 0. But in that case, the cpu should be listed in the
>> mptables so the above should not happen. I'll change that to a BUG().
> No. That's the wrong thing to do. Think SMP kernel on UP machines ...

Sorry Thomas, but my history with real UP hardware is limited. I think you
might be saying I should call generic_processor_info(0, apic_version[0]) to
populate cpu 0 but I'm not sure.