Re: [PATCH] riscv: Define TASK_SIZE_MAX for __access_ok()

From: Arnd Bergmann
Date: Mon Mar 25 2024 - 10:49:41 EST


On Mon, Mar 25, 2024, at 08:25, Alexandre Ghiti wrote:
> On 24/03/2024 23:05, Arnd Bergmann wrote:
>> On Tue, Mar 19, 2024, at 17:51, Alexandre Ghiti wrote:
>>>
>>> The use of alternatives allows to return right away if the buffer is
>>> beyond the usable user address space, and it's not just "slightly
>>> faster" for some cases (a very large buffer with only a few bytes being
>>> beyond the limit or someone could fault-in all the user pages and fail
>>> very late...etc). access_ok() is here to guarantee that such situations
>>> don't happen, so actually it makes more sense to use an alternative to
>>> avoid that.
>> The access_ok() function really wants a compile-time constant
>> value for TASK_SIZE_MAX so it can do constant folding for
>> repeated calls inside of one function, so for configurations
>> with a boot-time selected TASK_SIZE_64 it's already not ideal,
>> with or without alternatives.
>>
>> If I read the current code correctly, riscv doesn't even
>> have a way to build with a compile-time selected
>> VA_BITS/PGDIR_SIZE, which is probably a better place to
>> start optimizing, since this rarely needs to be selected
>> dynamically.
>
>
> Indeed, we do not support compile-time fixed VA_BITS! We could, but that
> would only be used for custom kernels. I don't think distro kernels will
> ever (?) propose 3 different kernels for sv39, sv48 and sv57 because the
> cost of dynamically choosing the address space width is not big enough
> to me (and the burden of maintaining 3 different kernels is).
>
> Let me know if I'm wrong, I'd be happy to work on that.

My feeling is that in most cases, users are better off with a
compile-time default, given that the addressable memory has
a factor of 512 between each step. With sv39, I think you are
limited to having all RAM in the first 128GB of physical
address space, and each process is limited to 256GB virtual
addressing, but this is already covers pretty much anything
you want to do on small systems that run a custom kernel.

On most desktop/server/cloud distros, hardwiring sv48 is
probably sufficient if all general purpose machines support
this, and it should be enough even for commercial databases
that micro-optimize 100TB datasets through a permanent mmap(),
as well as most NUMA systems with discontiguous memory.
This adds a little cost over hardcoded sv39, but is still
faster than a boot-time sv39/sv48 config that most users
will not be aware of.

Once enterprise distros certify systems beyond a few dozen
TB of RAM, they probably need to enable the boot time
detection, until then I think the few users with gigantic
systems will probably be fine running a custom sv57
kernel. At that point, that distro can start shipping a
kernel with boot-time detected page table sizes.

Arnd