Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting

From: Jens Axboe

Date: Sat Jan 24 2026 - 13:45:01 EST


On 1/24/26 8:55 AM, Jens Axboe wrote:
> On 1/24/26 8:14 AM, Jens Axboe wrote:
>>>> ________________________________________________________
>>>> Executed in 2.81 secs fish external
>>>> usr time 0.71 secs 497.00 micros 0.71 secs
>>>> sys time 19.57 secs 183.00 micros 19.57 secs
>>>>
>>>> which isn't insane. Obviously also needs conditional rescheduling in the
>>>> page loops, as those can take a loooong time for large amounts of
>>>> memory.
>>>
>>> 2.8 sec sounds like a lot as well, makes me wonder which part of
>>> that is mm, but it mm should scale fine-ish. Surely there will be
>>> contention on page refcounts but at least the table walk is
>>> lockless in the best case scenario and otherwise seems to be read
>>> protected by an rw lock.
>>
>> Well a lot of that is also just faulting in the memory on clear, test
>> case should probably be modified to do its own timing. And iterating
>> page arrays is a huge part of it too. There's no real contention in that
>> 2.8 seconds.
>
> I checked and the faulting part is 2.0s of that runtime. On a re-run:
>
> axboe@r7625 ~> time ./ppage 32 32
> register 32 GB, num threads 32
> clear msec 2011
>
> ________________________________________________________
> Executed in 3.13 secs fish external
> usr time 0.78 secs 193.00 micros 0.78 secs
> sys time 27.46 secs 271.00 micros 27.46 secs
>
> Or just a single thread:
>
> axboe@r7625 ~> time ./ppage 32 1
> register 32 GB, num threads 1
> clear msec 2081
>
> ________________________________________________________
> Executed in 2.29 secs fish external
> usr time 0.58 secs 750.00 micros 0.58 secs
> sys time 1.71 secs 0.00 micros 1.71 secs
>
> axboe@r7625 ~ [1]> time ./ppage 64 1
> register 64 GB, num threads 1
> clear msec 5380
>
> ________________________________________________________
> Executed in 6.24 secs fish external
> usr time 1.42 secs 328.00 micros 1.42 secs
> sys time 4.82 secs 375.00 micros 4.82 secs

Pondering this some more... We only need the page as the key, as far as
I can tell. The memory is always accounted to ctx->user anyway, and each
struct page address is the same across mm's anyway. So unless I'm
missing something, which is of course quite possible, a per-ctx
accounting should be just fine. This will account each ring registration
separate obviously, but this is what we're doing now anyway. If we want
per user_struct accounting to only account each unique page once, then
we'd simply need to move the xarray to struct user_struct. At least to
me, the important part here is that we need the keep the page pinned
until all refs to it have dropped.

Running with multiple threads in this test case is also pretty futile,
as most of them will run into contention off:

io_register_rsrc_update
__io_register_rsrc_update
io_sqe_buffer_register
io_pin_pages
gup_fast_fallback
__gup_longterm_locked
__get_user_pages
handle_mm_fault
follow_page_pte

which is where basically all of the time is spent on the thread side,
there are multiple threads doing this at the same time. This is really
why cloning exists, just register them once in the parent and clone
between threads.

With all that set, here's the test patch I've run just now: