Re: [RFC PATCH 4/7] riscv: Implement sv48 support

From: Anup Patel
Date: Wed Apr 08 2020 - 01:06:51 EST


On Wed, Apr 8, 2020 at 10:09 AM Alex Ghiti <alex@xxxxxxxx> wrote:
>
> Hi Anup,
>
> On 4/7/20 1:56 AM, Anup Patel wrote:
> > On Tue, Apr 7, 2020 at 10:44 AM Alex Ghiti <alex@xxxxxxxx> wrote:
> >>
> >>
> >> On 4/3/20 11:53 AM, Palmer Dabbelt wrote:
> >>> On Sun, 22 Mar 2020 04:00:25 PDT (-0700), alex@xxxxxxxx wrote:
> >>>> By adding a new 4th level of page table, give the possibility to 64bit
> >>>> kernel to address 2^48 bytes of virtual address: in practice, that
> >>>> roughly
> >>>> offers ~160TB of virtual address space to userspace and allows up to 64TB
> >>>> of physical memory.
> >>>>
> >>>> By default, the kernel will try to boot with a 4-level page table. If the
> >>>> underlying hardware does not support it, we will automatically
> >>>> fallback to
> >>>> a standard 3-level page table by folding the new PUD level into PGDIR
> >>>> level.
> >>>>
> >>>> Early page table preparation is too early in the boot process to use any
> >>>> device-tree entry, then in order to detect HW capabilities at runtime, we
> >>>> use SATP feature that ignores writes with an unsupported mode. The
> >>>> current
> >>>> mode used by the kernel is then made available through cpuinfo.
> >>>
> >>> Ya, I think that's the right way to go about this. There's no reason to
> >>> rely on duplicate DT mechanisms for things the ISA defines for us.
> >>>
> >>>>
> >>>> Signed-off-by: Alexandre Ghiti <alex@xxxxxxxx>
> >>>> ---
> >>>> arch/riscv/Kconfig | 6 +-
> >>>> arch/riscv/include/asm/csr.h | 3 +-
> >>>> arch/riscv/include/asm/fixmap.h | 1 +
> >>>> arch/riscv/include/asm/page.h | 15 +++-
> >>>> arch/riscv/include/asm/pgalloc.h | 36 ++++++++
> >>>> arch/riscv/include/asm/pgtable-64.h | 98 ++++++++++++++++++++-
> >>>> arch/riscv/include/asm/pgtable.h | 5 +-
> >>>> arch/riscv/kernel/head.S | 37 ++++++--
> >>>> arch/riscv/mm/context.c | 4 +-
> >>>> arch/riscv/mm/init.c | 128 +++++++++++++++++++++++++---
> >>>> 10 files changed, 302 insertions(+), 31 deletions(-)
> >>>>
> >>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> >>>> index a475c78e66bc..79560e94cc7c 100644
> >>>> --- a/arch/riscv/Kconfig
> >>>> +++ b/arch/riscv/Kconfig
> >>>> @@ -66,6 +66,7 @@ config RISCV
> >>>> select ARCH_HAS_GCOV_PROFILE_ALL
> >>>> select HAVE_COPY_THREAD_TLS
> >>>> select HAVE_ARCH_KASAN if MMU && 64BIT
> >>>> + select RELOCATABLE if 64BIT
> >>>>
> >>>> config ARCH_MMAP_RND_BITS_MIN
> >>>> default 18 if 64BIT
> >>>> @@ -104,7 +105,7 @@ config PAGE_OFFSET
> >>>> default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
> >>>> default 0x80000000 if 64BIT && !MMU
> >>>> default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
> >>>> - default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
> >>>> + default 0xffffc00000000000 if 64BIT && !MAXPHYSMEM_2GB
> >>>>
> >>>> config ARCH_FLATMEM_ENABLE
> >>>> def_bool y
> >>>> @@ -148,8 +149,11 @@ config GENERIC_HWEIGHT
> >>>> config FIX_EARLYCON_MEM
> >>>> def_bool MMU
> >>>>
> >>>> +# On a 64BIT relocatable kernel, the 4-level page table is at runtime
> >>>> folded
> >>>> +# on a 3-level page table when sv48 is not supported.
> >>>> config PGTABLE_LEVELS
> >>>> int
> >>>> + default 4 if 64BIT && RELOCATABLE
> >>>> default 3 if 64BIT
> >>>> default 2
> >>>
> >>> I assume this means you're relying on relocation to move the kernel around
> >>> independently of PAGE_OFFSET in order to fold in the missing page table
> >>> level?
> >>
> >> Yes, relocation is needed to fallback to 3-level and move PAGE_OFFSET
> >> accordingly.
> >>
> >>> That seems reasonable, but it does impose a performance penalty as
> >>> relocatable
> >>> kernels necessitate slower generated code. Additionally, there will
> >>> likely be
> >>> a performance penalty due to the extra memory access on TLB misses that is
> >>> unnecessary for workloads that don't necessitate the longer VA width on
> >>> machines that support it.
> >>
> >> Sorry, I had no time to answer your previous mail regarding performance:
> >> I have no number. But the only penalty caused by this patchset on
> >> 3-level page table is the check in page table management functions to
> >> know if 4-level is activated or not. And as you said, the extra cost of
> >> relocatable kernel that I had ignored since necessary anyway.
> >
> > I guess we don't need relocation if we can avoid page table folding by
> > detecting Sv48 mode very early in setup_vm(). Is there any other place
> > where relocation would be required ?
>
> Folding the 4th level is only a part of the problem, we also have to
> dynamically change the virtual address of the kernel: how can we achieve
> that without relocations ?
>
> KASLR also uses relocations, see Zong's recent patchset.

Good to know that relocation is not just for page table folding.

Thanks,
Anup

>
> Thanks,
>
> Alex
>
> >
> > If we can totally avoid relocation then it will certainly help in performance.
> >
> > Regards,
> > Anup
> >
> >>
> >>>
> >>> I think the best bet here would be to have a Kconfig option for the
> >>> number of
> >>> page table levels (which could be MAXPHYSMEM or a second partially free
> >>> parameter) and then another boolean argument along the lines of "also
> >>> support
> >>> machines with smaller VA widths". It seems best to turn on the largest VA
> >>> width and support for folding by default, as I assume that's what
> >>> distros would
> >>> do.
> >>
> >> I'm not a big fan of a new Kconfig option to allow people to have a
> >> 3-level page table because that implies maintaining a new kernel, even
> >> for us, having to compile 2 kernels each time we change something to mm
> >> code will be painful.
> >>
> >> I have just reviewed Zong's KASLR patchset: he needs to parse the dtb to
> >> find out the reserved regions in order to not override one of them when
> >> copying the kernel to its new destination. And after that, he loops back
> >> to setup_vm to re-create the mapping to the new kernel.
> >> If that's the way we take for KASLR, we can follow the same path here:
> >> boot with 4-level by default, go to check what is wanted in the device
> >> tree and if it is 3-level, loop back to setup_vm.
> >>
> >>>
> >>> I didn't really look closely at the rest of this, but it generally
> >>> smells OK.
> >>> The diff will need to be somewhat different for the next version, anyway :)
> >>>
> >>> Thanks for doing this!
> >>>
> >>>> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> >>>> index 435b65532e29..3828d55af85e 100644
> >>>> --- a/arch/riscv/include/asm/csr.h
> >>>> +++ b/arch/riscv/include/asm/csr.h
> >>>> @@ -40,11 +40,10 @@
> >>>> #ifndef CONFIG_64BIT
> >>>> #define SATP_PPN _AC(0x003FFFFF, UL)
> >>>> #define SATP_MODE_32 _AC(0x80000000, UL)
> >>>> -#define SATP_MODE SATP_MODE_32
> >>>> #else
> >>>> #define SATP_PPN _AC(0x00000FFFFFFFFFFF, UL)
> >>>> #define SATP_MODE_39 _AC(0x8000000000000000, UL)
> >>>> -#define SATP_MODE SATP_MODE_39
> >>>> +#define SATP_MODE_48 _AC(0x9000000000000000, UL)
> >>>> #endif
> >>>>
> >>>> /* Exception cause high bit - is an interrupt if set */
> >>>> diff --git a/arch/riscv/include/asm/fixmap.h
> >>>> b/arch/riscv/include/asm/fixmap.h
> >>>> index 42d2c42f3cc9..26e7799c5675 100644
> >>>> --- a/arch/riscv/include/asm/fixmap.h
> >>>> +++ b/arch/riscv/include/asm/fixmap.h
> >>>> @@ -27,6 +27,7 @@ enum fixed_addresses {
> >>>> FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
> >>>> FIX_PTE,
> >>>> FIX_PMD,
> >>>> + FIX_PUD,
> >>>> FIX_EARLYCON_MEM_BASE,
> >>>> __end_of_fixed_addresses
> >>>> };
> >>>> diff --git a/arch/riscv/include/asm/page.h
> >>>> b/arch/riscv/include/asm/page.h
> >>>> index 691f2f9ded2f..f1a26a0690ef 100644
> >>>> --- a/arch/riscv/include/asm/page.h
> >>>> +++ b/arch/riscv/include/asm/page.h
> >>>> @@ -32,11 +32,19 @@
> >>>> * physical memory (aligned on a page boundary).
> >>>> */
> >>>> #ifdef CONFIG_RELOCATABLE
> >>>> -extern unsigned long kernel_virt_addr;
> >>>> #define PAGE_OFFSET kernel_virt_addr
> >>>> +
> >>>> +#ifdef CONFIG_64BIT
> >>>> +/*
> >>>> + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address
> >>>> space so
> >>>> + * define the PAGE_OFFSET value for SV39.
> >>>> + */
> >>>> +#define PAGE_OFFSET_L3 0xffffffe000000000
> >>>> +#define PAGE_OFFSET_L4 _AC(CONFIG_PAGE_OFFSET, UL)
> >>>> +#endif /* CONFIG_64BIT */
> >>>> #else
> >>>> #define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
> >>>> -#endif
> >>>> +#endif /* CONFIG_RELOCATABLE */
> >>>>
> >>>> #define KERN_VIRT_SIZE -PAGE_OFFSET
> >>>>
> >>>> @@ -104,6 +112,9 @@ extern unsigned long pfn_base;
> >>>>
> >>>> extern unsigned long max_low_pfn;
> >>>> extern unsigned long min_low_pfn;
> >>>> +#ifdef CONFIG_RELOCATABLE
> >>>> +extern unsigned long kernel_virt_addr;
> >>>> +#endif
> >>>>
> >>>> #define __pa_to_va_nodebug(x) ((void *)((unsigned long) (x) +
> >>>> va_pa_offset))
> >>>> #define __va_to_pa_nodebug(x) ((unsigned long)(x) - va_pa_offset)
> >>>> diff --git a/arch/riscv/include/asm/pgalloc.h
> >>>> b/arch/riscv/include/asm/pgalloc.h
> >>>> index 3f601ee8233f..540eaa5a8658 100644
> >>>> --- a/arch/riscv/include/asm/pgalloc.h
> >>>> +++ b/arch/riscv/include/asm/pgalloc.h
> >>>> @@ -36,6 +36,42 @@ static inline void pud_populate(struct mm_struct
> >>>> *mm, pud_t *pud, pmd_t *pmd)
> >>>>
> >>>> set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> >>>> }
> >>>> +
> >>>> +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d,
> >>>> pud_t *pud)
> >>>> +{
> >>>> + if (pgtable_l4_enabled) {
> >>>> + unsigned long pfn = virt_to_pfn(pud);
> >>>> +
> >>>> + set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> >>>> + }
> >>>> +}
> >>>> +
> >>>> +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
> >>>> + pud_t *pud)
> >>>> +{
> >>>> + if (pgtable_l4_enabled) {
> >>>> + unsigned long pfn = virt_to_pfn(pud);
> >>>> +
> >>>> + set_p4d_safe(p4d,
> >>>> + __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> >>>> + }
> >>>> +}
> >>>> +
> >>>> +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned
> >>>> long addr)
> >>>> +{
> >>>> + if (pgtable_l4_enabled)
> >>>> + return (pud_t *)__get_free_page(
> >>>> + GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
> >>>> + return NULL;
> >>>> +}
> >>>> +
> >>>> +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
> >>>> +{
> >>>> + if (pgtable_l4_enabled)
> >>>> + free_page((unsigned long)pud);
> >>>> +}
> >>>> +
> >>>> +#define __pud_free_tlb(tlb, pud, addr) pud_free((tlb)->mm, pud)
> >>>> #endif /* __PAGETABLE_PMD_FOLDED */
> >>>>
> >>>> #define pmd_pgtable(pmd) pmd_page(pmd)
> >>>> diff --git a/arch/riscv/include/asm/pgtable-64.h
> >>>> b/arch/riscv/include/asm/pgtable-64.h
> >>>> index b15f70a1fdfa..cc4ffbe778f3 100644
> >>>> --- a/arch/riscv/include/asm/pgtable-64.h
> >>>> +++ b/arch/riscv/include/asm/pgtable-64.h
> >>>> @@ -8,16 +8,32 @@
> >>>>
> >>>> #include <linux/const.h>
> >>>>
> >>>> -#define PGDIR_SHIFT 30
> >>>> +extern bool pgtable_l4_enabled;
> >>>> +
> >>>> +#define PGDIR_SHIFT (pgtable_l4_enabled ? 39 : 30)
> >>>> /* Size of region mapped by a page global directory */
> >>>> #define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
> >>>> #define PGDIR_MASK (~(PGDIR_SIZE - 1))
> >>>>
> >>>> +/* pud is folded into pgd in case of 3-level page table */
> >>>> +#define PUD_SHIFT 30
> >>>> +#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
> >>>> +#define PUD_MASK (~(PUD_SIZE - 1))
> >>>> +
> >>>> #define PMD_SHIFT 21
> >>>> /* Size of region mapped by a page middle directory */
> >>>> #define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
> >>>> #define PMD_MASK (~(PMD_SIZE - 1))
> >>>>
> >>>> +/* Page Upper Directory entry */
> >>>> +typedef struct {
> >>>> + unsigned long pud;
> >>>> +} pud_t;
> >>>> +
> >>>> +#define pud_val(x) ((x).pud)
> >>>> +#define __pud(x) ((pud_t) { (x) })
> >>>> +#define PTRS_PER_PUD (PAGE_SIZE / sizeof(pud_t))
> >>>> +
> >>>> /* Page Middle Directory entry */
> >>>> typedef struct {
> >>>> unsigned long pmd;
> >>>> @@ -25,7 +41,6 @@ typedef struct {
> >>>>
> >>>> #define pmd_val(x) ((x).pmd)
> >>>> #define __pmd(x) ((pmd_t) { (x) })
> >>>> -
> >>>> #define PTRS_PER_PMD (PAGE_SIZE / sizeof(pmd_t))
> >>>>
> >>>> static inline int pud_present(pud_t pud)
> >>>> @@ -60,6 +75,16 @@ static inline void pud_clear(pud_t *pudp)
> >>>> set_pud(pudp, __pud(0));
> >>>> }
> >>>>
> >>>> +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
> >>>> +{
> >>>> + return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> >>>> +}
> >>>> +
> >>>> +static inline unsigned long _pud_pfn(pud_t pud)
> >>>> +{
> >>>> + return pud_val(pud) >> _PAGE_PFN_SHIFT;
> >>>> +}
> >>>> +
> >>>> static inline unsigned long pud_page_vaddr(pud_t pud)
> >>>> {
> >>>> return (unsigned long)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
> >>>> @@ -70,6 +95,15 @@ static inline struct page *pud_page(pud_t pud)
> >>>> return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
> >>>> }
> >>>>
> >>>> +#define mm_pud_folded mm_pud_folded
> >>>> +static inline bool mm_pud_folded(struct mm_struct *mm)
> >>>> +{
> >>>> + if (pgtable_l4_enabled)
> >>>> + return false;
> >>>> +
> >>>> + return true;
> >>>> +}
> >>>> +
> >>>> #define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
> >>>>
> >>>> static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
> >>>> @@ -90,4 +124,64 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
> >>>> #define pmd_ERROR(e) \
> >>>> pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
> >>>>
> >>>> +#define pud_ERROR(e) \
> >>>> + pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
> >>>> +
> >>>> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
> >>>> +{
> >>>> + if (pgtable_l4_enabled)
> >>>> + *p4dp = p4d;
> >>>> + else
> >>>> + set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
> >>>> +}
> >>>> +
> >>>> +static inline int p4d_none(p4d_t p4d)
> >>>> +{
> >>>> + if (pgtable_l4_enabled)
> >>>> + return (p4d_val(p4d) == 0);
> >>>> +
> >>>> + return 0;
> >>>> +}
> >>>> +
> >>>> +static inline int p4d_present(p4d_t p4d)
> >>>> +{
> >>>> + if (pgtable_l4_enabled)
> >>>> + return (p4d_val(p4d) & _PAGE_PRESENT);
> >>>> +
> >>>> + return 1;
> >>>> +}
> >>>> +
> >>>> +static inline int p4d_bad(p4d_t p4d)
> >>>> +{
> >>>> + if (pgtable_l4_enabled)
> >>>> + return !p4d_present(p4d);
> >>>> +
> >>>> + return 0;
> >>>> +}
> >>>> +
> >>>> +static inline void p4d_clear(p4d_t *p4d)
> >>>> +{
> >>>> + if (pgtable_l4_enabled)
> >>>> + set_p4d(p4d, __p4d(0));
> >>>> +}
> >>>> +
> >>>> +static inline unsigned long p4d_page_vaddr(p4d_t p4d)
> >>>> +{
> >>>> + if (pgtable_l4_enabled)
> >>>> + return (unsigned long)pfn_to_virt(
> >>>> + p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> >>>> +
> >>>> + return pud_page_vaddr((pud_t) { p4d_val(p4d) });
> >>>> +}
> >>>> +
> >>>> +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> >>>> +
> >>>> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> >>>> +{
> >>>> + if (pgtable_l4_enabled)
> >>>> + return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
> >>>> +
> >>>> + return (pud_t *)p4d;
> >>>> +}
> >>>> +
> >>>> #endif /* _ASM_RISCV_PGTABLE_64_H */
> >>>> diff --git a/arch/riscv/include/asm/pgtable.h
> >>>> b/arch/riscv/include/asm/pgtable.h
> >>>> index dce401eed1d3..06361db3f486 100644
> >>>> --- a/arch/riscv/include/asm/pgtable.h
> >>>> +++ b/arch/riscv/include/asm/pgtable.h
> >>>> @@ -13,8 +13,7 @@
> >>>>
> >>>> #ifndef __ASSEMBLY__
> >>>>
> >>>> -/* Page Upper Directory not used in RISC-V */
> >>>> -#include <asm-generic/pgtable-nopud.h>
> >>>> +#include <asm-generic/pgtable-nop4d.h>
> >>>> #include <asm/page.h>
> >>>> #include <asm/tlbflush.h>
> >>>> #include <linux/mm_types.h>
> >>>> @@ -27,7 +26,7 @@
> >>>>
> >>>> #ifdef CONFIG_MMU
> >>>> #ifdef CONFIG_64BIT
> >>>> -#define VA_BITS 39
> >>>> +#define VA_BITS (pgtable_l4_enabled ? 48 : 39)
> >>>> #define PA_BITS 56
> >>>> #else
> >>>> #define VA_BITS 32
> >>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> >>>> index 1c2fbefb8786..22617bd7477f 100644
> >>>> --- a/arch/riscv/kernel/head.S
> >>>> +++ b/arch/riscv/kernel/head.S
> >>>> @@ -113,6 +113,8 @@ clear_bss_done:
> >>>> call setup_vm
> >>>> #ifdef CONFIG_MMU
> >>>> la a0, early_pg_dir
> >>>> + la a1, satp_mode
> >>>> + REG_L a1, (a1)
> >>>> call relocate
> >>>> #endif /* CONFIG_MMU */
> >>>>
> >>>> @@ -131,24 +133,28 @@ clear_bss_done:
> >>>> #ifdef CONFIG_MMU
> >>>> relocate:
> >>>> #ifdef CONFIG_RELOCATABLE
> >>>> - /* Relocate return address */
> >>>> - la a1, kernel_virt_addr
> >>>> - REG_L a1, 0(a1)
> >>>> + /*
> >>>> + * Relocate return address but save it in case 4-level page table is
> >>>> + * not supported.
> >>>> + */
> >>>> + mv s1, ra
> >>>> + la a3, kernel_virt_addr
> >>>> + REG_L a3, 0(a3)
> >>>> #else
> >>>> - li a1, PAGE_OFFSET
> >>>> + li a3, PAGE_OFFSET
> >>>> #endif
> >>>> la a2, _start
> >>>> - sub a1, a1, a2
> >>>> - add ra, ra, a1
> >>>> + sub a3, a3, a2
> >>>> + add ra, ra, a3
> >>>>
> >>>> /* Point stvec to virtual address of intruction after satp write */
> >>>> la a2, 1f
> >>>> - add a2, a2, a1
> >>>> + add a2, a2, a3
> >>>> csrw CSR_TVEC, a2
> >>>>
> >>>> + /* First try with a 4-level page table */
> >>>> /* Compute satp for kernel page tables, but don't load it yet */
> >>>> srl a2, a0, PAGE_SHIFT
> >>>> - li a1, SATP_MODE
> >>>> or a2, a2, a1
> >>>>
> >>>> /*
> >>>> @@ -162,6 +168,19 @@ relocate:
> >>>> or a0, a0, a1
> >>>> sfence.vma
> >>>> csrw CSR_SATP, a0
> >>>> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
> >>>> + /*
> >>>> + * If we fall through here, that means the HW does not support SV48.
> >>>> + * We need a 3-level page table then simply fold pud into pgd level
> >>>> + * and finally jump back to relocate with 3-level parameters.
> >>>> + */
> >>>> + call setup_vm_fold_pud
> >>>> +
> >>>> + la a0, early_pg_dir
> >>>> + li a1, SATP_MODE_39
> >>>> + mv ra, s1
> >>>> + tail relocate
> >>>> +#endif
> >>>> .align 2
> >>>> 1:
> >>>> /* Set trap vector to spin forever to help debug */
> >>>> @@ -213,6 +232,8 @@ relocate:
> >>>> #ifdef CONFIG_MMU
> >>>> /* Enable virtual memory and relocate to virtual address */
> >>>> la a0, swapper_pg_dir
> >>>> + la a1, satp_mode
> >>>> + REG_L a1, (a1)
> >>>> call relocate
> >>>> #endif
> >>>>
> >>>> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> >>>> index 613ec81a8979..152b423c02ea 100644
> >>>> --- a/arch/riscv/mm/context.c
> >>>> +++ b/arch/riscv/mm/context.c
> >>>> @@ -9,6 +9,8 @@
> >>>> #include <asm/cacheflush.h>
> >>>> #include <asm/mmu_context.h>
> >>>>
> >>>> +extern uint64_t satp_mode;
> >>>> +
> >>>> /*
> >>>> * When necessary, performs a deferred icache flush for the given MM
> >>>> context,
> >>>> * on the local CPU. RISC-V has no direct mechanism for instruction
> >>>> cache
> >>>> @@ -59,7 +61,7 @@ void switch_mm(struct mm_struct *prev, struct
> >>>> mm_struct *next,
> >>>> cpumask_set_cpu(cpu, mm_cpumask(next));
> >>>>
> >>>> #ifdef CONFIG_MMU
> >>>> - csr_write(CSR_SATP, virt_to_pfn(next->pgd) | SATP_MODE);
> >>>> + csr_write(CSR_SATP, virt_to_pfn(next->pgd) | satp_mode);
> >>>> local_flush_tlb_all();
> >>>> #endif
> >>>>
> >>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> >>>> index 18bbb426848e..ad96667d2ab6 100644
> >>>> --- a/arch/riscv/mm/init.c
> >>>> +++ b/arch/riscv/mm/init.c
> >>>> @@ -24,6 +24,17 @@
> >>>>
> >>>> #include "../kernel/head.h"
> >>>>
> >>>> +#ifdef CONFIG_64BIT
> >>>> +uint64_t satp_mode = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ?
> >>>> + SATP_MODE_39 : SATP_MODE_48;
> >>>> +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ? false :
> >>>> true;
> >>>> +#else
> >>>> +uint64_t satp_mode = SATP_MODE_32;
> >>>> +bool pgtable_l4_enabled = false;
> >>>> +#endif
> >>>> +EXPORT_SYMBOL(pgtable_l4_enabled);
> >>>> +EXPORT_SYMBOL(satp_mode);
> >>>> +
> >>>> unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
> >>>> __page_aligned_bss;
> >>>> EXPORT_SYMBOL(empty_zero_page);
> >>>> @@ -245,9 +256,12 @@ static void __init create_pte_mapping(pte_t *ptep,
> >>>>
> >>>> #ifndef __PAGETABLE_PMD_FOLDED
> >>>>
> >>>> +pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
> >>>> pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
> >>>> +pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
> >>>> pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
> >>>> pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
> >>>> +pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
> >>>>
> >>>> static pmd_t *__init get_pmd_virt(phys_addr_t pa)
> >>>> {
> >>>> @@ -264,7 +278,8 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
> >>>> if (mmu_enabled)
> >>>> return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> >>>>
> >>>> - BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
> >>>> + /* Only one PMD is available for early mapping */
> >>>> + BUG_ON((va - PAGE_OFFSET) >> PUD_SHIFT);
> >>>>
> >>>> return (uintptr_t)early_pmd;
> >>>> }
> >>>> @@ -296,19 +311,70 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
> >>>> create_pte_mapping(ptep, va, pa, sz, prot);
> >>>> }
> >>>>
> >>>> -#define pgd_next_t pmd_t
> >>>> -#define alloc_pgd_next(__va) alloc_pmd(__va)
> >>>> -#define get_pgd_next_virt(__pa) get_pmd_virt(__pa)
> >>>> +static pud_t *__init get_pud_virt(phys_addr_t pa)
> >>>> +{
> >>>> + if (mmu_enabled) {
> >>>> + clear_fixmap(FIX_PUD);
> >>>> + return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
> >>>> + } else {
> >>>> + return (pud_t *)((uintptr_t)pa);
> >>>> + }
> >>>> +}
> >>>> +
> >>>> +static phys_addr_t __init alloc_pud(uintptr_t va)
> >>>> +{
> >>>> + if (mmu_enabled)
> >>>> + return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> >>>> +
> >>>> + /* Only one PUD is available for early mapping */
> >>>> + BUG_ON((va - PAGE_OFFSET) >> PGDIR_SHIFT);
> >>>> +
> >>>> + return (uintptr_t)early_pud;
> >>>> +}
> >>>> +
> >>>> +static void __init create_pud_mapping(pud_t *pudp,
> >>>> + uintptr_t va, phys_addr_t pa,
> >>>> + phys_addr_t sz, pgprot_t prot)
> >>>> +{
> >>>> + pmd_t *nextp;
> >>>> + phys_addr_t next_phys;
> >>>> + uintptr_t pud_index = pud_index(va);
> >>>> +
> >>>> + if (sz == PUD_SIZE) {
> >>>> + if (pud_val(pudp[pud_index]) == 0)
> >>>> + pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
> >>>> + return;
> >>>> + }
> >>>> +
> >>>> + if (pud_val(pudp[pud_index]) == 0) {
> >>>> + next_phys = alloc_pmd(va);
> >>>> + pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
> >>>> + nextp = get_pmd_virt(next_phys);
> >>>> + memset(nextp, 0, PAGE_SIZE);
> >>>> + } else {
> >>>> + next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
> >>>> + nextp = get_pmd_virt(next_phys);
> >>>> + }
> >>>> +
> >>>> + create_pmd_mapping(nextp, va, pa, sz, prot);
> >>>> +}
> >>>> +
> >>>> +#define pgd_next_t pud_t
> >>>> +#define alloc_pgd_next(__va) alloc_pud(__va)
> >>>> +#define get_pgd_next_virt(__pa) get_pud_virt(__pa)
> >>>> #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \
> >>>> - create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
> >>>> -#define fixmap_pgd_next fixmap_pmd
> >>>> + create_pud_mapping(__nextp, __va, __pa, __sz, __prot)
> >>>> +#define fixmap_pgd_next (pgtable_l4_enabled ? \
> >>>> + (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
> >>>> +#define trampoline_pgd_next (pgtable_l4_enabled ? \
> >>>> + (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
> >>>> #else
> >>>> #define pgd_next_t pte_t
> >>>> #define alloc_pgd_next(__va) alloc_pte(__va)
> >>>> #define get_pgd_next_virt(__pa) get_pte_virt(__pa)
> >>>> #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \
> >>>> create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
> >>>> -#define fixmap_pgd_next fixmap_pte
> >>>> +#define fixmap_pgd_next ((uintptr_t)fixmap_pte)
> >>>> #endif
> >>>>
> >>>> static void __init create_pgd_mapping(pgd_t *pgdp,
> >>>> @@ -319,6 +385,13 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
> >>>> phys_addr_t next_phys;
> >>>> uintptr_t pgd_index = pgd_index(va);
> >>>>
> >>>> +#ifndef __PAGETABLE_PMD_FOLDED
> >>>> + if (!pgtable_l4_enabled) {
> >>>> + create_pud_mapping((pud_t *)pgdp, va, pa, sz, prot);
> >>>> + return;
> >>>> + }
> >>>> +#endif
> >>>> +
> >>>> if (sz == PGDIR_SIZE) {
> >>>> if (pgd_val(pgdp[pgd_index]) == 0)
> >>>> pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
> >>>> @@ -449,15 +522,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>>>
> >>>> /* Setup early PGD for fixmap */
> >>>> create_pgd_mapping(early_pg_dir, FIXADDR_START,
> >>>> - (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> >>>> + fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> >>>>
> >>>> #ifndef __PAGETABLE_PMD_FOLDED
> >>>> - /* Setup fixmap PMD */
> >>>> + /* Setup fixmap PUD and PMD */
> >>>> + if (pgtable_l4_enabled)
> >>>> + create_pud_mapping(fixmap_pud, FIXADDR_START,
> >>>> + (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
> >>>> create_pmd_mapping(fixmap_pmd, FIXADDR_START,
> >>>> (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> >>>> +
> >>>> /* Setup trampoline PGD and PMD */
> >>>> create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> >>>> - (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> >>>> + trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> >>>> + if (pgtable_l4_enabled)
> >>>> + create_pud_mapping(trampoline_pud, PAGE_OFFSET,
> >>>> + (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
> >>>> create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
> >>>> load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
> >>>> #else
> >>>> @@ -490,6 +570,29 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>>> dtb_early_pa = dtb_pa;
> >>>> }
> >>>>
> >>>> +#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_64BIT)
> >>>> +/*
> >>>> + * This function is called only if the current kernel is 64bit and
> >>>> the HW
> >>>> + * does not support sv48.
> >>>> + */
> >>>> +asmlinkage __init void setup_vm_fold_pud(void)
> >>>> +{
> >>>> + pgtable_l4_enabled = false;
> >>>> + kernel_virt_addr = PAGE_OFFSET_L3;
> >>>> + satp_mode = SATP_MODE_39;
> >>>> +
> >>>> + /*
> >>>> + * PTE/PMD levels do not need to be cleared as they are common
> >>>> between
> >>>> + * 3- and 4-level page tables: the 30 least significant bits
> >>>> + * (2 * 9 + 12) are common.
> >>>> + */
> >>>> + memset(trampoline_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
> >>>> + memset(early_pg_dir, 0, sizeof(pgd_t) * PTRS_PER_PGD);
> >>>> +
> >>>> + setup_vm(dtb_early_pa);
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>> static void __init setup_vm_final(void)
> >>>> {
> >>>> uintptr_t va, map_size;
> >>>> @@ -525,12 +628,13 @@ static void __init setup_vm_final(void)
> >>>> }
> >>>> }
> >>>>
> >>>> - /* Clear fixmap PTE and PMD mappings */
> >>>> + /* Clear fixmap page table mappings */
> >>>> clear_fixmap(FIX_PTE);
> >>>> clear_fixmap(FIX_PMD);
> >>>> + clear_fixmap(FIX_PUD);
> >>>>
> >>>> /* Move to swapper page table */
> >>>> - csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) |
> >>>> SATP_MODE);
> >>>> + csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) |
> >>>> satp_mode);
> >>>> local_flush_tlb_all();
> >>>> }
> >>>> #else
> >>
> >> Alex