Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

From: Alexander Duyck
Date: Mon Oct 08 2018 - 18:07:45 EST

Next message: Renato Lui Geh: "[PATCH v3] staging: iio: adt7316: fix parenthesis alignment"
Previous message: Fabio Estevam: "Re: [PATCH v2] staging: iio: adt7316: fix parenthesis alignment"
In reply to: Dan Williams: "Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap"
Next in thread: Alexander Duyck: "Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 10/8/2018 3:00 PM, Dan Williams wrote:

On Mon, Oct 8, 2018 at 2:48 PM Alexander Duyck
<alexander.h.duyck@xxxxxxxxxxxxxxx> wrote:

On 10/8/2018 2:01 PM, Dan Williams wrote:

On Tue, Sep 25, 2018 at 1:29 PM Alexander Duyck
<alexander.h.duyck@xxxxxxxxxxxxxxx> wrote:

The ZONE_DEVICE pages were being initialized in two locations. One was with
the memory_hotplug lock held and another was outside of that lock. The
problem with this is that it was nearly doubling the memory initialization
time. Instead of doing this twice, once while holding a global lock and
once without, I am opting to defer the initialization to the one outside of
the lock. This allows us to avoid serializing the overhead for memory init
and we can instead focus on per-node init times.

One issue I encountered is that devm_memremap_pages and
hmm_devmmem_pages_create were initializing only the pgmap field the same
way. One wasn't initializing hmm_data, and the other was initializing it to
a poison value. Since this is something that is exposed to the driver in
the case of hmm I am opting for a third option and just initializing
hmm_data to 0 since this is going to be exposed to unknown third party
drivers.

Reviewed-by: Pavel Tatashin <pavel.tatashin@xxxxxxxxxxxxx>
Signed-off-by: Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx>
---

v4: Moved moved memmap_init_zone_device to below memmmap_init_zone to avoid
merge conflicts with other changes in the kernel.
v5: No change

This patch appears to cause a regression in the "create.sh" unit test
in the ndctl test suite.

So all you had to do is run the create.sh script to see the issue? I
just want to confirm there isn't any additional information needed
before I try chasing this down.

From the ndctl source tree run:

make -j TESTS="create.sh" check

...the readme has some more setup instructions:
https://github.com/pmem/ndctl/blob/master/README.md

0day has sometimes run this test suite automatically, but we need to
get that more robust because setting up this environment is a bit of a
hoop to jump through with the need to setup the nfit_test module.

I tried to reproduce on -next with:

2302f5ee215e mm: defer ZONE_DEVICE page initialization to the point
where we init pgmap

...but -next does not even boot for me at that commit.

What version of -next? There are a couple of patches probably needed
depending on which version you are trying to boot.

Today's -next, but backed up to that above commit. I was also seeing
CONFIG_DEBUG_LIST spamming the logs, and a crash in the crypto layer.

Here is a warning signature that proceeds a hang with this patch
applied against v4.19-rc6:

percpu ref (blk_queue_usage_counter_release) <= 0 (-1530626) after
switching to atomic
WARNING: CPU: 24 PID: 7346 at lib/percpu-refcount.c:155
percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
CPU: 24 PID: 7346 Comm: modprobe Tainted: G OE 4.19.0-rc6+ #2458
[..]
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x1f7/0x200
[..]
Call Trace:
<IRQ>
? percpu_ref_reinit+0x140/0x140
rcu_process_callbacks+0x273/0x880
__do_softirq+0xd2/0x428
irq_exit+0xf6/0x100
smp_apic_timer_interrupt+0xa2/0x220
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:lock_acquire+0xb8/0x1a0
[..]
? __put_page+0x55/0x150
? __put_page+0x55/0x150
__put_page+0x83/0x150
? __put_page+0x55/0x150
devm_memremap_pages_release+0x194/0x250
release_nodes+0x17c/0x2c0
device_release_driver_internal+0x1a2/0x250
driver_detach+0x3a/0x70
bus_remove_driver+0x58/0xd0
__x64_sys_delete_module+0x13f/0x200
? trace_hardirqs_off_thunk+0x1a/0x1c
do_syscall_64+0x60/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe

So it looks like we are tearing down memory when this is triggered. Do
we know if this is at the end of the test or if this is running in
parallel with anything?

Should not be running in parallel with anything this test is
performing a series of namespace setup and teardown events.

Wait, where did the call to "percpu_ref_get()" go? I think that's the bug.

Actually I think you are probably right. Do you want to get that or should I. Should be a quick patch since you could probably just add a call to percpu_ref_get_many to hold a reference for each page in the range of device pages before calling memmap_init_zone_device.

- Alex

Next message: Renato Lui Geh: "[PATCH v3] staging: iio: adt7316: fix parenthesis alignment"
Previous message: Fabio Estevam: "Re: [PATCH v2] staging: iio: adt7316: fix parenthesis alignment"
In reply to: Dan Williams: "Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap"
Next in thread: Alexander Duyck: "Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]