Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

From: Palmer Dabbelt
Date: Wed Jul 22 2020 - 15:52:44 EST


On Wed, 22 Jul 2020 02:43:50 PDT (-0700), Arnd Bergmann wrote:
On Tue, Jul 21, 2020 at 9:06 PM Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:

On Tue, 21 Jul 2020 11:36:10 PDT (-0700), alex@xxxxxxxx wrote:
> Let's try to make progress here: I add linux-mm in CC to get feedback on
> this patch as it blocks sv48 support too.

Sorry for being slow here. I haven't replied because I hadn't really fleshed
out the design yet, but just so everyone's on the same page my problems with
this are:

* We waste vmalloc space on 32-bit systems, where there isn't a lot of it.

There is actually an ongoing work to make 32-bit Arm kernels move
vmlinux into the vmalloc space, as part of the move to avoid highmem.

Overall, a 32-bit system would waste about 0.1% of its virtual address space
by having the kernel be located in both the linear map and the vmalloc area.
It's not zero, but not that bad either. With the typical split of 3072 MB user,
768MB linear and 256MB vmalloc, it's also around 1.5% of the available
vmalloc area (assuming a 4MB vmlinux in a typical 32-bit kernel), but the
boundaries can be changed arbitrarily if needed.

OK, I guess maybe it's not so bad. Our 32-bit defconfig is 10MiB, but I
wouldn't really put much weight behind that number as it's just a 64-bit
defconfig built for 32-bit. We don't have any 32-bit hardware anyway, so if
this becomes an issue later I guess we can just deal with it then.

The eventual goal is to have a split of 3840MB for either user or linear map
plus and 256MB for vmalloc, including the kernel. Switching between linear
and user has a noticeable runtime overhead, but it relaxes both the limits
for user memory and lowmem, and it provides a somewhat stronger
address space isolation.

Ya, I think we decided not to do that, at least for now. I guess the right
answer there will depend on what 32-bit systems look like, and since we don't
have any I'm inclined to just stick to the fast option.

Another potential idea would be to completely randomize the physical
addresses underneath the kernel by using a random permutation of the
pages in the kernel image. This adds even more overhead (virt_to_phys
may need to call vmalloc_to_page or similar) and may cause problems
with DMA into kernel .data across page boundaries,

* Sort out how to maintain a linear map as the canonical hole moves around
between the VA widths without adding a bunch of overhead to the virt2phys and
friends. This is probably going to be the trickiest part, but I think if we
just change the page table code to essentially lie about VAs when an sv39
system runs an sv48+sv39 kernel we could make it work -- there'd be some
logical complexity involved, but it would remain fast.

I assume you can't use the trick that x86 has where all kernel addresses
are at the top of the 64-bit address space and user addresses are at the
bottom, regardless of the size of the page tables?

They have the load in their mapping functions, as far as I can tell that's
required to do this sort of thing. We do as well to handle some of the
implicit boot stuff for now, but I was assuming that we'd want to get rid of
that for performance reasons. That said, maybe it just doesn't matter?