Re: [PATCH] mm: don't rely on system state to detect hot-plug operations

From: Laurent Dufour
Date: Thu Sep 10 2020 - 17:50:19 EST


Le 10/09/2020 à 14:00, David Hildenbrand a écrit :
On 10.09.20 13:35, Laurent Dufour wrote:
Le 10/09/2020 à 13:12, Michal Hocko a écrit :
On Thu 10-09-20 09:51:39, Laurent Dufour wrote:
Le 10/09/2020 à 09:23, Michal Hocko a écrit :
On Wed 09-09-20 18:07:15, Laurent Dufour wrote:
Le 09/09/2020 à 12:59, Michal Hocko a écrit :
On Wed 09-09-20 11:21:58, Laurent Dufour wrote:
[...]
For the point a, using the enum allows to know in
register_mem_sect_under_node() if the link operation is due to a hotplug
operation or done at boot time.

Yes, but let me repeat. We have a mess here and different paths check
for the very same condition by different ways. We need to unify those.

What are you suggesting to unify these checks (using a MP_* enum as
suggested by David, something else)?

We do have system_state check spread at different places. I would use
this one and wrap it behind a helper. Or have I missed any reason why
that wouldn't work for this case?

That would not work in that case because memory can be hot-added at the
SYSTEM_SCHEDULING system state and the regular memory is also registered at
that system state too. So system state is not enough to discriminate between
the both.

If that is really the case all other places need a fix as well.
Btw. could you be more specific about memory hotplug during early boot?
How that happens? I am only aware of https://lkml.kernel.org/r/20200818110046.6664-1-osalvador@xxxxxxx
and that doesn't happen as early as SYSTEM_SCHEDULING.

That points has been raised by David, quoting him here:

IIRC, ACPI can hotadd memory while SCHEDULING, this patch would break that.

Ccing Oscar, I think he mentioned recently that this is the case with ACPI.

Oscar told that he need to investigate further on that.

On my side I can't get these ACPI "early" hot-plug operations to happen so I
can't check that.

If this is clear that ACPI memory hotplug doesn't happen at SYSTEM_SCHEDULING,
the patch I proposed at first is enough to fix the issue.


Booting a qemu guest with 4 coldplugged DIMMs gives me:

:/root# dmesg | grep link_mem
[ 0.302247] link_mem_sections() during 1
[ 0.445086] link_mem_sections() during 1
[ 0.445766] link_mem_sections() during 1
[ 0.446749] link_mem_sections() during 1
[ 0.447746] link_mem_sections() during 1

So AFAICs everything happens during SYSTEM_SCHEDULING - boot memory and
ACPI (cold)plug.

To make forward progress with this, relying on the system_state is
obviously not sufficient.

1. We have to fix this instance and the instance directly in
get_nid_for_pfn() by passing in the context (I once had a patch to clean
that up, to not have two state checks, but it got lost somewhere).

2. The "system_state < SYSTEM_RUNNING" check in
register_memory_resource() is correct. Actual memory hotplug after boot
is not impacted. (I remember we discussed this exact behavior back then)

3. build_all_zonelists() should work as expected, called from
start_kernel() before sched_init().

I'm bit confused now.
Since hotplug operation is happening at SYSTEM_SCHEDULING like the regular memory registration, would it be enough to add a parameter to register_mem_sect_under_node() (reworking the memmap_context enum)?
That way the check is not based on the system state but on the calling path.