Re: [External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page

From: Xu Lu
Date: Fri Dec 06 2024 - 08:42:52 EST

Next message: Parker Newman: "Re: [PATCH v1 1/1] net: stmmac: dwmac-tegra: Read iommu stream id from device tree"
Previous message: Vincent Mailhol: "Re: [PATCH 02/10] compiler.h: add is_const() as a replacement of __is_constexpr()"
In reply to: David Hildenbrand: "Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page"
Next in thread: Pedro Falcato: "Re: [External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi David,

On Fri, Dec 6, 2024 at 6:13 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 06.12.24 03:00, Zi Yan wrote:
> > On 5 Dec 2024, at 5:37, Xu Lu wrote:
> >
> >> This patch series attempts to break through the limitation of MMU and
> >> supports larger base page on RISC-V, which only supports 4K page size
> >> now. The key idea is to always manage and allocate memory at a
> >> granularity of 64K and use SVNAPOT to accelerate address translation.
> >> This is the second version and the detailed introduction can be found
> >> in [1].
> >>
> >> Changes from v1:
> >> - Rebase on v6.12.
> >>
> >> - Adjust the page table entry shift to reduce page table memory usage.
> >> For example, in SV39, the traditional va behaves as:
> >>
> >> ----------------------------------------------
> >> | pgd index | pmd index | pte index | offset |
> >> ----------------------------------------------
> >> | 38 30 | 29 21 | 20 12 | 11 0 |
> >> ----------------------------------------------
> >>
> >> When we choose 64K as basic software page, va now behaves as:
> >>
> >> ----------------------------------------------
> >> | pgd index | pmd index | pte index | offset |
> >> ----------------------------------------------
> >> | 38 34 | 33 25 | 24 16 | 15 0 |
> >> ----------------------------------------------
> >>
> >> - Fix some bugs in v1.
> >>
> >> Thanks in advance for comments.
> >>
> >> [1] https://lwn.net/Articles/952722/
> >
> > This looks very interesting. Can you cc me and linux-mm@xxxxxxxxx
> > in the future? Thanks.
> >
> > Have you thought about doing it for ARM64 4KB as well? ARM64’s contig PTE
> > should have similar effect of RISC-V’s SVNAPOT, right?
>
> What is the real benefit over 4k + large folios/mTHP?
>
> 64K comes with the problem of internal fragmentation: for example, a
> page table that only occupies 4k of memory suddenly consumes 64K; quite
> a downside.

The original idea comes from the performance benefits we achieved on
the ARM 64K kernel. We run several real world applications on the ARM
Ampere Altra platform and found these apps' performance based on the
64K page kernel is significantly higher than that on the 4K page
kernel:
For Redis, the throughput has increased by 250% and latency has
decreased by 70%.
For Mysql, the throughput has increased by 16.9% and latency has
decreased by 14.5%.
For our own newsql database, throughput has increased by 16.5% and
latency has decreased by 13.8%.

Also, we have compared the performance between 64K and 4k + large
folios/mTHP on ARM Neoverse-N2. The result shows considerable
performance improvement on 64K kernel for both speccpu and lmbench,
even when 4K kernel enables THP and ARM64_CONTPTE:
For speccpu benchmark, 64K kernel without any huge pages optimization
can still achieve 4.17% higher score than 4K kernel with transparent
huge pages as well as CONTPTE optimization.
For lmbench, 64K kernel achieves 75.98% lower memory mapping
latency(16MB) than 4K kernel with transparent huge pages and CONTPTE
optimization, 84.34% higher map read open2close bandwidth(16MB), and
10.71% lower random load latency(16MB).
Interestingly, sometimes kernel with transparent pages support have
poorer performance for both 4K and 64K (for example, mmap read
bandwidth bench). We assume this is due to the overhead of huge pages'
combination and collapse.
Also, if you check the full result, you will find that usually the
larger the memory size used for testing is, the better the performance
of 64k kernel is (compared to 4K kernel). Unless the memory size lies
in a range where 4K kernel can apply 2MB huge pages while 64K kernel
can't.
In summary, for performance sensitive applications which require
higher bandwidth and lower latency, sometimes 4K pages with huge pages
may not be the best choice and 64k page can achieve better results.
The test environment and result is attached.

As RISC-V has no native 64K MMU support, we introduce a software
implementation and accelerate it via Svnapot. Of course, there will be
some extra overhead compared with native 64K MMU. Thus, we are also
trying to persuade the RISC-V community to support the extension of
native 64K MMU [1]. Please join us if you are interested.

[1] https://lists.riscv.org/g/tech-privileged/topic/query_about_risc_v_s_support/108641509

Best Regards,

Xu Lu

>
> --
> Cheers,
>
> David / dhildenb
>

Attachment: ARM 64K Result.xlsx
Description: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Next message: Parker Newman: "Re: [PATCH v1 1/1] net: stmmac: dwmac-tegra: Read iommu stream id from device tree"
Previous message: Vincent Mailhol: "Re: [PATCH 02/10] compiler.h: add is_const() as a replacement of __is_constexpr()"
In reply to: David Hildenbrand: "Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page"
Next in thread: Pedro Falcato: "Re: [External] Re: [RFC PATCH v2 00/21] riscv: Introduce 64K base page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]