Re: [PATCH 2/2] arm64/mm: Enable color zero pages

From: Gavin Shan
Date: Tue Sep 22 2020 - 08:39:53 EST


Hi Anshuman,

On 9/21/20 10:40 PM, Anshuman Khandual wrote:
On 09/21/2020 08:26 AM, Gavin Shan wrote:
On 9/17/20 8:22 PM, Robin Murphy wrote:
On 2020-09-17 04:35, Gavin Shan wrote:
On 9/16/20 6:28 PM, Will Deacon wrote:
On Wed, Sep 16, 2020 at 01:25:23PM +1000, Gavin Shan wrote:
This enables color zero pages by allocating contigous page frames
for it. The number of pages for this is determined by L1 dCache
(or iCache) size, which is probbed from the hardware.

    * Add cache_total_size() to return L1 dCache (or iCache) size

    * Implement setup_zero_pages(), which is called after the page
      allocator begins to work, to allocate the contigous pages
      needed by color zero page.

    * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.

Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
---
  arch/arm64/include/asm/cache.h   | 22 ++++++++++++++++++++
  arch/arm64/include/asm/pgtable.h |  9 ++++++--
  arch/arm64/kernel/cacheinfo.c    | 34 +++++++++++++++++++++++++++++++
  arch/arm64/mm/init.c             | 35 ++++++++++++++++++++++++++++++++
  arch/arm64/mm/mmu.c              |  7 -------
  5 files changed, 98 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index a4d1b5f771f6..420e9dde2c51 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -39,6 +39,27 @@
  #define CLIDR_LOC(clidr)    (((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
  #define CLIDR_LOUIS(clidr)    (((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
+#define CSSELR_TND_SHIFT    4
+#define CSSELR_TND_MASK        (UL(1) << CSSELR_TND_SHIFT)
+#define CSSELR_LEVEL_SHIFT    1
+#define CSSELR_LEVEL_MASK    (UL(7) << CSSELR_LEVEL_SHIFT)
+#define CSSELR_IND_SHIFT    0
+#define CSSERL_IND_MASK        (UL(1) << CSSELR_IND_SHIFT)
+
+#define CCSIDR_64_LS_SHIFT    0
+#define CCSIDR_64_LS_MASK    (UL(7) << CCSIDR_64_LS_SHIFT)
+#define CCSIDR_64_ASSOC_SHIFT    3
+#define CCSIDR_64_ASSOC_MASK    (UL(0x1FFFFF) << CCSIDR_64_ASSOC_SHIFT)
+#define CCSIDR_64_SET_SHIFT    32
+#define CCSIDR_64_SET_MASK    (UL(0xFFFFFF) << CCSIDR_64_SET_SHIFT)
+
+#define CCSIDR_32_LS_SHIFT    0
+#define CCSIDR_32_LS_MASK    (UL(7) << CCSIDR_32_LS_SHIFT)
+#define CCSIDR_32_ASSOC_SHIFT    3
+#define CCSIDR_32_ASSOC_MASK    (UL(0x3FF) << CCSIDR_32_ASSOC_SHIFT)
+#define CCSIDR_32_SET_SHIFT    13
+#define CCSIDR_32_SET_MASK    (UL(0x7FFF) << CCSIDR_32_SET_SHIFT)


[...]

Ok. If this was proposed before, I'm not sure if the link to that
patchset is still available? :)

When I was searching "my_zero_pfn" in upstream kernel, DAX uses the
zero pages to fill the holes in one particular file in dax_load_hole().
mmap() on /proc/kcore could use zero page either.

But how often those mapped areas will be used afterwards either in DAX
or /proc/kcore ? It seems like the minimal adaption for this feature so
far on platforms (i.e s390 and mips) might have to do with real world
workload's frequency of such read accesses on mapped areas using zero
pages.


I don't think /proc/kcore is used frequently in real world, still depending
on the workload. DAX has been supported by multiple filesystems (ext2/ext4/
xfs). Taking ext4 as an example, all allocated extents (blocks), which isn't
be written with the data. The data is retrieved from the zero page. I guess
this intends to avoid exposing data, which was written by previous user for
safety reason. This would be common case and heavily depend on read performance
on zero page. Besides, holes (blocks) in ext4 are also backed by zero pages
and it would happen frequently.

However, I'm not a filesystem guy. I checked the code and understood the code
as above, but I might be wrong completely here.

Yeah, I failed to understand why this feature was enabled on s390/mips from
the corresponding commit logs. Nothing helpful is provided there. I guess
some specific S390/MIPS CPUs have large L1 cache capacity, multiple page
sizes in one set. In this case, multiple (color) zero pages can avoid
the cache line collisions on reading these pages. However, I'm not sure
about arm64. On the CPU where I had my experiment, there is 8-ways/64-sets
and 32KB L1 dCache and iCache, meaning 4KB L1 cache in one particular set.

This feature has low cost to be enabled as several extra pages are needed
and not harmful at least as I can see.

[...]

Thanks,
Gavin