Re: [PATCH v2] io_uring/rsrc: fix RLIMIT_MEMLOCK bypass by removing cross-buffer accounting

From: Pavel Begunkov

Date: Sat Jan 24 2026 - 06:04:50 EST

On 1/23/26 16:52, Jens Axboe wrote:

On 1/23/26 8:04 AM, Jens Axboe wrote:

On 1/23/26 7:50 AM, Jens Axboe wrote:

On 1/23/26 7:26 AM, Pavel Begunkov wrote:

On 1/22/26 21:51, Pavel Begunkov wrote:
...

I already briefly touched on that earlier, for sure not going to be of
any practical concern.

Modest 16 GB can give 1M entries. Assuming 50ns-100ns per entry for the
xarray business, that's 50-100ms. It's all serialised, so multiply by
the number of CPUs/threads, e.g. 10-100, that's 0.5-10s. Account sky
high spinlock contention, and it jumps again, and there can be more
memory / CPUs / numa nodes. Not saying that it's worse than the
current O(n^2), I have a test program that borderline hangs the
system.

...

Should've tried 32x32 as well, that ends up going deep into "this sucks"
territory:

git

good luck

FWIW, current scales perfectly with CPUs, so just 1 thread
should be enough for testing.

git + user_struct

axboe@r7625 ~> time ./ppage 32 32
register 32 GB, num threads 32

________________________________________________________
Executed in 16.34 secs fish external

That's as precise to the calculations above as it could be, it
was 100x16GB but that should only be differ by the factor of ~1.5.
Without anchoring to this particular number, the problem is that
the wall clock runtime for the accounting will linearly depend on
the number of threads, so this 16 sec is what seemed concerning.

usr time 0.54 secs 497.00 micros 0.54 secs
sys time 451.94 secs 55.00 micros 451.94 secs

...

and the crazier cases:

I don't think it's even crazy, thinking of databases with lots
of caches where it wants to read to / write from. 100GB+
shouldn't be surprising.

axboe@r7625 ~> time ./ppage 32 32
register 32 GB, num threads 32

________________________________________________________
Executed in 2.81 secs fish external
usr time 0.71 secs 497.00 micros 0.71 secs
sys time 19.57 secs 183.00 micros 19.57 secs

which isn't insane. Obviously also needs conditional rescheduling in the
page loops, as those can take a loooong time for large amounts of
memory.

2.8 sec sounds like a lot as well, makes me wonder which part of
that is mm, but it mm should scale fine-ish. Surely there will be
contention on page refcounts but at least the table walk is
lockless in the best case scenario and otherwise seems to be read
protected by an rw lock.

--
Pavel Begunkov