Re: [PATCH] xen/pci: try to reserve MCFG areas earlier

From: Igor Druzhinin
Date: Mon Sep 09 2019 - 17:48:46 EST


On 09/09/2019 20:19, Boris Ostrovsky wrote:
> On 9/8/19 7:37 PM, Igor Druzhinin wrote:
>> On 09/09/2019 00:30, Boris Ostrovsky wrote:
>>> On 9/8/19 5:11 PM, Igor Druzhinin wrote:
>>>> On 08/09/2019 19:28, Boris Ostrovsky wrote:
>>>>> On 9/6/19 7:00 PM, Igor Druzhinin wrote:
>>>>>> On 06/09/2019 23:30, Boris Ostrovsky wrote:
>>>>>>> Where is MCFG parsed? pci_arch_init()?
>>>>>>>> It happens twice:
>>>>>> 1) first time early one in pci_arch_init() that is arch_initcall - that
>>>>>> time pci_mmcfg_list will be freed immediately there because MCFG area is
>>>>>> not reserved in E820;
>>>>>> 2) second time late one in acpi_init() which is subsystem_initcall right
>>>>>> before where PCI enumeration starts - this time ACPI tables will be
>>>>>> checked for a reserved resource and pci_mmcfg_list will be finally
>>>>>> populated.
>>>>>>
>>>>>> The problem is that on a system that doesn't have MCFG area reserved in
>>>>>> E820 pci_mmcfg_list is empty before acpi_init() and our PCI hooks are
>>>>>> called in the same place. So MCFG is still not in use by Xen at this
>>>>>> point since we haven't reached our xen_mcfg_late().
>>>>> Would it be possible for us to parse MCFG ourselves in pci_xen_init()? I
>>>>> realize that we'd be doing this twice (or maybe even three times since
>>>>> apparently both pci_arch_init()Â and acpi_ini() do it).
>>>>>
>>>> I don't thine it makes sense:
>>>> a) it needs to be done after ACPI is initialized since we need to parse
>>>> it to figure out the exact reserved region - that's why it's currently
>>>> done in acpi_init() (see commit message for the reasons why)
>>> Hmm... We should be able to parse ACPI tables by the time
>>> pci_arch_init() is called. In fact, if you look at
>>> pci_mmcfg_early_init() you will see that it does just that.
>>>
>> The point is not to parse MCFG after acpi_init but to parse DSDT for
>> reserved resource which could be done only after ACPI initialization.
>
> OK, I think I understand now what you are trying to do --- you are
> essentially trying to account for the range inserted by
> setup_mcfg_map(), right?
>

Actually, pci_mmcfg_late_init() that's called out of acpi_init() -
that's where MCFG areas are properly sized. setup_mcfg_map() is mostly
for bus hotplug where MCFG area is discovered by evaluating _CBA method;
for cold-plugged buses it just confirms that MCFG area is already
registered because it is mandated for them to be in MCFG table at boot time.

> The other question I have is why you think it's worth keeping
> xen_mcfg_late() as a late initcall. How could MCFG info be updated
> between acpi_init() and late_initcalls being run? I'd think it can only
> happen when a new device is hotplugged.
>

It was a precaution against setup_mcfg_map() calls that might add new
areas that are not in MCFG table but for some reason have _CBA method.
It's obviously a "firmware is broken" scenario so I don't have strong
feelings to keep it here. Will prefer to remove in v2 if you want.

Igor