Re: [PATCH v2 0/5] Add movablecore_map boot option

From: Jiang Liu
Date: Thu Nov 29 2012 - 10:48:11 EST


On 11/29/2012 06:38 PM, Yasuaki Ishimatsu wrote:
> Hi Tony,
>
> 2012/11/29 6:34, Luck, Tony wrote:
>>> 1. use firmware information
>>> According to ACPI spec 5.0, SRAT table has memory affinity structure
>>> and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
>>> Affinity Structure". If we use the information, we might be able to
>>> specify movable memory by firmware. For example, if Hot Pluggable
>>> Filed is enabled, Linux sets the memory as movable memory.
>>>
>>> 2. use boot option
>>> This is our proposal. New boot option can specify memory range to use
>>> as movable memory.
>>
>> Isn't this just moving the work to the user? To pick good values for the
>
> Yes.
>
>> movable areas, they need to know how the memory lines up across
>> node boundaries ... because they need to make sure to allow some
>> non-movable memory allocations on each node so that the kernel can
>> take advantage of node locality.
>
> There is no problem.
> Linux has already two boot options, kernelcore= and movablecore=.
> So if we use them, non-movable memory is divided into each node evenly.
>
> But there is no way to specify a node used as movable currently. So
> we proposed the new boot option.
>
>> So the user would have to read at least the SRAT table, and perhaps
>> more, to figure out what to provide as arguments.
>>
>
>> Since this is going to be used on a dynamic system where nodes might
>> be added an removed - the right values for these arguments might
>> change from one boot to the next. So even if the user gets them right
>> on day 1, a month later when a new node has been added, or a broken
>> node removed the values would be stale.
>
> I don't think so. Even if we hot add/remove node, the memory range of
> each memory device is not changed. So we don't need to change the boot
> option.
Hi Yasuaki,
Addresses assigned to each memory device may change under different
hardware configurations.
According to my experiences with some hotplug capable Xeon and Itanium
systems, a typical algorithm adopted by BIOS to support memory hotplug is:
1) For backward compatibility, BIOS assigns continuous addresses to memory
devices present at boot time. In other words, there are no holes in the memory
addresses except the hole just below 4G reserved for MMIO and other arch
specific usage.
2) To support memory hotplug, BIOS reserves enough memory address ranges
at the high end.

Let's take a typical 4 sockets system as an example. Say we have four
sockets S0-S3, and each socket supports two memory devices(M0-M1) at maximum.
Each memory device supports 128G memory at maximum. And at boot, all memory
slots are fully populated with 4GB memory. Then the address assignment looks
like:
0-2G: S0.M0
2-4G: MMIO
4-8G: S0.M1
8-12G: S1.M0
12-16G: S1.M1
16-20G: S2.M0
20-24G: S2.M1
24-28G: S2.M0
28-32G: S2.M1
32-34G: S0.M0 (memory recovered from the MMIO hole)
1024-1152G: reserved for S0.M0
1152-1280G: reserved for S0.M1
1280-1408G: reserved for S1.M0
1408-1536G: reserved for S1.M1
1536-1664G: reserved for S2.M0
1664-1792G: reserved for S2.M1
1792-1920G: reserved for S3.M0
1920-2048G: reserved for S4.M1

If we hot-remove S2.M0 and add back a bigger memory device with 8G memory, it will
be assigned a new memory address range 1536-1544G.

Based on above algorithm, and we configure 16-24G(S2.M0 and S2.M1) as movable memory.
1) memory on S3 will be configured as movable if S2 isn't present at boot time. (the
same effect as "movable_node" in discussion at https://lkml.org/lkml/2012/11/27/154)
2) S2.M0 will be configured as non-movable and S3.M0 will be configured as movable
if S1.M0 isn't present at boot.
3) And how about replace S1.M0 with a 8GB memory device?

To summarize, kernel parameter to configure movable memory for hotplug will easily
become invalid if hardware configuration changes, and that may confuse administrators.
I still think the most reliable way is to figure out movable memory for hotplug by
parsing hardware configuration information from BIOS.

Regards!
Gerry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/