Re: [RFC PATCH] xen/x86: Increase xen_e820_map to E820_X_MAX possible entries

From: Juergen Gross
Date: Tue Nov 15 2016 - 03:42:58 EST


On 15/11/16 09:01, Jan Beulich wrote:
>>>> On 15.11.16 at 08:36, <JGross@xxxxxxxx> wrote:
>> On 15/11/16 08:15, Jan Beulich wrote:
>>>>>> On 15.11.16 at 07:33, <JGross@xxxxxxxx> wrote:
>>>> On 15/11/16 01:11, Alex Thorlton wrote:
>>>>> Hey everyone,
>>>>>
>>>>> We're having problems with large systems hitting a BUG in
>>>>> xen_memory_setup, due to extra e820 entries created in the
>>>>> XENMEM_machine_memory_map callback. The change in the patch gets things
>>>>> working, but Boris and I wanted to get opinions on whether or not this
>>>>> is the appropriate/entire solution, which is why I've sent it as an RFC
>>>>> for now.
>>>>>
>>>>> Boris pointed out to me that E820_X_MAX is only large when CONFIG_EFI=y,
>>>>> which is a detail worth discussig. He proposed possibly adding
>>>>> CONFIG_XEN to the conditions under which we set E820_X_MAX to a larger
>>>>> value than E820MAX, since the Xen e820 table isn't bound by the
>>>>> zero-page memory limitations.
>>>>>
>>>>> I do *slightly* question the use of E820_X_MAX here, only from a
>>>>> cosmetic prospective, as I believe this macro is intended to describe
>>>>> the maximum size of the extended e820 table, which, AFAIK, is not used
>>>>> by the Xen HV. That being said, there isn't exactly a "more
>>>>> appropriate" macro/variable to use, so this may not really be an issue.
>>>>>
>>>>> Any input on the patch, or the questions I've raised above is greatly
>>>>> appreciated!
>>>>
>>>> While I think extending the e820 table is the right thing to do I'm
>>>> questioning the assumptions here.
>>>>
>>>> Looking briefly through the Xen hypervisor sources I think it isn't
>>>> yet ready for such large machines: the hypervisor's e820 map seems to
>>>> be still limited to 128 e820 entries. Jan, did I overlook an EFI
>>>> specific path extending this limitation?
>>>
>>> No, you didn't. I do question the correlation with "large machines"
>>> here though: The issue isn't with large machines afaict, but with
>>> ones having very many entries (i.e. heavily fragmented).
>>
>> Fact is: non-EFI machines are limited to 128 entries.
>>
>> One reason for fragmentation is NUMA - which tends to be present and
>> especially adding many entries only with lots of nodes - on large
>> machines.
>
> Yes. However, at present Xen only supports up to 64 nodes, and
> with one entry per node (assuming there are holes between nodes,
> but contiguous memory within them) that's still unlikely to overflow
> the table. Yet I'm not meaning to discard the fact that it's been
> known for a long time that since Linux needed a larger table, Xen
> would need to follow suit sooner or later.
>
>>>> In case I'm right the Xen hypervisor should be prepared for a larger
>>>> e820 map, but this won't help alone as there would still be additional
>>>> entries for the IOAPICs created.
>>>>
>>>> So I think we need something like:
>>>>
>>>> #define E820_XEN_MAX (E820_X_MAX + MAX_IO_APICS)
>>>>
>>>> and use this for sizing xen_e820_map[].
>>>
>>> I would say that if any change gets done here, there shouldn't be
>>> any static upper limit at all. That could even be viewed as in line
>>> with recent e820.c changes moving to dynamic allocations. In
>>> particular I don't see why MAX_IO_APICS would need adding in
>>> here, but not other (current and future) factors determining the
>>> (pseudo) E820 map Xen presents to Dom0.
>>
>> The hypervisor interface of XENMEM_machine_memory_map requires to
>> specify the size of the e820 map where the hypervisor can supply
>> entries. The alternative would be try and error: start with a small
>> e820 map and in case of error increasing the size until success. I
>> don't like this approach. Especially as dynamic memory allocations
>> are not possible at this stage (using RESERVE_BRK() isn't any better
>> than a static __initdata array IMO).
>
> Well, I think as a first step we should extend
> XENMEM_{,machine_}memory_map to the model used elsewhere
> where passing in a NULL handle (and perhaps count being zero) is
> a request for the number of needed entries. Granted this doesn't
> help with Linux'es way of consuming the output, but it at least
> allows for dynamic sizing. And Linux would then be free to prepare
> a static array or have a RESERVE_BRK() as it sees fit.

This still needs the definition of an upper limit, as RESERVE_BRK()
is a compile time feature.

For a fully dynamical solution we'd need a way to get a partial
E820 map from the hypervisor (e.g. first 128 entries) in order to
be able to setup at least some memory and later get the rest of
the memory map using some dynamically allocated memory.

>> Which other current factors do you see determining the number of
>> e820 entries presented to Dom0?
>
> Just look at the implementation: The dependency is on iomem_caps,
> and hence any code removing pages from there could affect the
> number of entries. Which means the number of IOMMUs also has an
> effect here. And further things could get added at any time, as we
> happen to find them.

Its easier to see with that information. :-)


Juergen