Re: [RFC PATCH] xen/x86: Increase xen_e820_map to E820_X_MAX possible entries
From: Juergen Gross
Date: Tue Nov 15 2016 - 02:36:57 EST
On 15/11/16 08:15, Jan Beulich wrote:
>>>> On 15.11.16 at 07:33, <JGross@xxxxxxxx> wrote:
>> On 15/11/16 01:11, Alex Thorlton wrote:
>>> Hey everyone,
>>>
>>> We're having problems with large systems hitting a BUG in
>>> xen_memory_setup, due to extra e820 entries created in the
>>> XENMEM_machine_memory_map callback. The change in the patch gets things
>>> working, but Boris and I wanted to get opinions on whether or not this
>>> is the appropriate/entire solution, which is why I've sent it as an RFC
>>> for now.
>>>
>>> Boris pointed out to me that E820_X_MAX is only large when CONFIG_EFI=y,
>>> which is a detail worth discussig. He proposed possibly adding
>>> CONFIG_XEN to the conditions under which we set E820_X_MAX to a larger
>>> value than E820MAX, since the Xen e820 table isn't bound by the
>>> zero-page memory limitations.
>>>
>>> I do *slightly* question the use of E820_X_MAX here, only from a
>>> cosmetic prospective, as I believe this macro is intended to describe
>>> the maximum size of the extended e820 table, which, AFAIK, is not used
>>> by the Xen HV. That being said, there isn't exactly a "more
>>> appropriate" macro/variable to use, so this may not really be an issue.
>>>
>>> Any input on the patch, or the questions I've raised above is greatly
>>> appreciated!
>>
>> While I think extending the e820 table is the right thing to do I'm
>> questioning the assumptions here.
>>
>> Looking briefly through the Xen hypervisor sources I think it isn't
>> yet ready for such large machines: the hypervisor's e820 map seems to
>> be still limited to 128 e820 entries. Jan, did I overlook an EFI
>> specific path extending this limitation?
>
> No, you didn't. I do question the correlation with "large machines"
> here though: The issue isn't with large machines afaict, but with
> ones having very many entries (i.e. heavily fragmented).
Fact is: non-EFI machines are limited to 128 entries.
One reason for fragmentation is NUMA - which tends to be present and
especially adding many entries only with lots of nodes - on large
machines.
So while you are of course right that the problem isn't the size of
the machine, but the memory fragmentation, the chances to show up
are much higher on large machines.
So I'd rephrase:
Looking briefly through the Xen hypervisor sources I think it isn't yet
ready for machines with many e820 entries due to memory fragmentation
to be found e.g. on very large NUMA machines with lots of nodes: ...
>> In case I'm right the Xen hypervisor should be prepared for a larger
>> e820 map, but this won't help alone as there would still be additional
>> entries for the IOAPICs created.
>>
>> So I think we need something like:
>>
>> #define E820_XEN_MAX (E820_X_MAX + MAX_IO_APICS)
>>
>> and use this for sizing xen_e820_map[].
>
> I would say that if any change gets done here, there shouldn't be
> any static upper limit at all. That could even be viewed as in line
> with recent e820.c changes moving to dynamic allocations. In
> particular I don't see why MAX_IO_APICS would need adding in
> here, but not other (current and future) factors determining the
> (pseudo) E820 map Xen presents to Dom0.
The hypervisor interface of XENMEM_machine_memory_map requires to
specify the size of the e820 map where the hypervisor can supply
entries. The alternative would be try and error: start with a small
e820 map and in case of error increasing the size until success. I
don't like this approach. Especially as dynamic memory allocations
are not possible at this stage (using RESERVE_BRK() isn't any better
than a static __initdata array IMO).
Which other current factors do you see determining the number of
e820 entries presented to Dom0?
Juergen