Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB

From: David Hildenbrand
Date: Mon Nov 22 2021 - 16:56:34 EST


On 22.11.21 21:44, Jens Axboe wrote:
> On 11/22/21 1:08 PM, David Hildenbrand wrote:
>> On 22.11.21 20:53, Jens Axboe wrote:
>>> On 11/22/21 11:26 AM, David Hildenbrand wrote:
>>>> On 22.11.21 18:55, Andrew Dona-Couch wrote:
>>>>> Forgive me for jumping in to an already overburdened thread. But can
>>>>> someone pushing back on this clearly explain the issue with applying
>>>>> this patch?
>>>>
>>>> It will allow unprivileged users to easily and even "accidentally"
>>>> allocate more unmovable memory than it should in some environments. Such
>>>> limits exist for a reason. And there are ways for admins/distros to
>>>> tweak these limits if they know what they are doing.
>>>
>>> But that's entirely the point, the cases where this change is needed are
>>> already screwed by a distro and the user is the administrator. This is
>>> _exactly_ the case where things should just work out of the box. If
>>> you're managing farms of servers, yeah you have competent administration
>>> and you can be expected to tweak settings to get the best experience and
>>> performance, but the kernel should provide a sane default. 64K isn't a
>>> sane default.
>>
>> 0.1% of RAM isn't either.
>
> No default is perfect, byt 0.1% will solve 99% of the problem. And most
> likely solve 100% of the problems for the important case, which is where
> you want things to Just Work on your distro without doing any
> administration. If you're aiming for perfection, it doesn't exist.

... and my Fedora is already at 16 MiB *sigh*.

And I'm not aiming for perfection, I'm aiming for as little
FOLL_LONGTERM users as possible ;)

>
>>>> This is not a step into the right direction. This is all just trying to
>>>> hide the fact that we're exposing FOLL_LONGTERM usage to random
>>>> unprivileged users.
>>>>
>>>> Maybe we could instead try getting rid of FOLL_LONGTERM usage and the
>>>> memlock limit in io_uring altogether, for example, by using mmu
>>>> notifiers. But I'm no expert on the io_uring code.
>>>
>>> You can't use mmu notifiers without impacting the fast path. This isn't
>>> just about io_uring, there are other users of memlock right now (like
>>> bpf) which just makes it even worse.
>>
>> 1) Do we have a performance evaluation? Did someone try and come up with
>> a conclusion how bad it would be?
>
> I honestly don't remember the details, I took a look at it about a year
> ago due to some unrelated reasons. These days it just pertains to
> registered buffers, so it's less of an issue than back then when it
> dealt with the rings as well. Hence might be feasible, I'm certainly not
> against anyone looking into it. Easy enough to review and test for
> performance concerns.

That at least sounds promising.

>
>> 2) Could be provide a mmu variant to ordinary users that's just good
>> enough but maybe not as fast as what we have today? And limit
>> FOLL_LONGTERM to special, privileged users?
>
> If it's not as fast, then it's most likely not good enough though...

There is always a compromise of course.

See, FOLL_LONGTERM is *the worst* kind of memory allocation thingy you
could possible do to your MM subsystem. It's absolutely the worst thing
you can do to swap and compaction.

I really don't want random feature X to be next and say "well, io_uring
uses it, so I can just use it for max performance and we'll adjust the
memlock limit, who cares!".

>
>> 3) Just because there are other memlock users is not an excuse. For
>> example, VFIO/VDPA have to use it for a reason, because there is no way
>> not do use FOLL_LONGTERM.
>
> It's not an excuse, the statement merely means that the problem is
> _worse_ as there are other memlock users.

Yes, and it will keep getting worse every time we introduce more
FOLL_LONGTERM users that really shouldn't be FOLL_LONGTERM users unless
really required. Again, VFIO/VDPA/RDMA are prime examples, because the
HW forces us to do it. And these are privileged features either way.

>
>>>
>>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something
>>> like what was suggested, if that will help move things forward. IMHO the
>>> 32MB machine is mostly a theoretical case, but whatever .
>>
>> 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges
>> where FOLL_LONGTERM cannot be used, as that memory is not available.
>>
>> 2) With 0.1% RAM it's sufficient to start 1000 processes to break any
>> system completely and deeply mess up the MM. Oh my.
>
> We're talking per-user limits here. But if you want to talk hyperbole,
> then 64K multiplied by some other random number will also allow
> everything to be pinned, potentially.
>

Right, it's per-user. 0.1% per user FOLL_LONGTERM locked into memory in
the worst case.

--
Thanks,

David / dhildenb