Re: [PATCH v11 0/3] remain and optimize memblock_next_valid_pfn on arm and arm64

From: Jia He
Date: Tue Jun 11 2019 - 21:10:58 EST


Hi Hanjun

On 2019/6/11 23:18, Hanjun Guo wrote:
Hello Ard,

Thanks for the reply, please see my comments inline.

On 2019/6/10 21:16, Ard Biesheuvel wrote:
On Sat, 8 Jun 2019 at 06:22, Hanjun Guo <guohanjun@xxxxxxxxxx> wrote:
Hi Ard, Will,

This week we were trying to debug an issue of time consuming in mem_init(),
and leading to this similar solution form Jia He, so I would like to bring this
thread back, please see my detail test result below.

On 2018/9/7 22:44, Will Deacon wrote:
On Thu, Sep 06, 2018 at 01:24:22PM +0200, Ard Biesheuvel wrote:
On 22 August 2018 at 05:07, Jia He <hejianet@xxxxxxxxx> wrote:
Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
where possible") optimized the loop in memmap_init_zone(). But it causes
possible panic bug. So Daniel Vacek reverted it later.

But as suggested by Daniel Vacek, it is fine to using memblock to skip
gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.

More from what Daniel said:
"On arm and arm64, memblock is used by default. But generic version of
pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
not always return the next valid one but skips more resulting in some
valid frames to be skipped (as if they were invalid). And that's why
kernel was eventually crashing on some !arm machines."

About the performance consideration:
As said by James in b92df1de5,
"I have tested this patch on a virtual model of a Samurai CPU with a
sparse memory map. The kernel boot time drops from 109 to 62 seconds."
Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.

Besides we can remain memblock_next_valid_pfn, there is still some room
for improvement. After this set, I can see the time overhead of memmap_init
is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
memory, pagesize 64k). I believe arm server will benefit more if memory is
larger than TBs

OK so we can summarize the benefits of this series as follows:
- boot time on a virtual model of a Samurai CPU drops from 109 to 62 seconds
- boot time on a QDF2400 arm64 server with 96 GB of RAM drops by ~15
*milliseconds*

Google was not very helpful in figuring out what a Samurai CPU is and
why we should care about the boot time of Linux running on a virtual
model of it, and the 15 ms speedup is not that compelling either.
Testing this patch set on top of Kunpeng 920 based ARM64 server, with
384G memory in total, we got the time consuming below

without this patch set with this patch set
mem_init() 13310ms 1415ms

So we got about 8x speedup on this machine, which is very impressive.

Yes, this is impressive. But does it matter in the grand scheme of
things?
It matters for this machine, because it's for storage and there is
a watchdog and the time consuming triggers the watchdog.

How much time does this system take to arrive at this point
from power on?
Sorry, I don't have such data, as the arch timer is not initialized
and I didn't see the time stamp at this point, but I read the cycles
from arch timer before and after the time consuming function to get
how much time consumed.

The time consuming is related the memory DIMM size and where to locate those
memory DIMMs in the slots. In above case, we are using 16G memory DIMM.
We also tested 1T memory with 64G size for each memory DIMM on another ARM64
machine, the time consuming reduced from 20s to 2s (I think it's related to
firmware implementations).

I agree that this optimization looks good in isolation, but the fact
that you spotted a bug justifies my skepticism at the time. On the
other hand, now that we have several independent reports (from you,
but also from the Renesas folks) that the speedup is worthwhile for
real world use cases, I think it does make sense to revisit it.
Thank you very much for taking care of this :)

So what I would like to see is the patch set being proposed again,
with the new data points added for documentation. Also, the commit
logs need to crystal clear about how the meaning of PFN validity
differs between ARM and other architectures, and why the assumptions
that the optimization is based on are guaranteed to hold.
I think Jia He no longer works for HXT, if don't mind, I can repost
this patch set with Jia He's authority unchanged.
Ok, I don't mind that, thanks for your followup :)

---
Cheers,
Justin (Jia He)