Jens Axboe <axboe@xxxxxxxxx> writes:
On 4/4/23 9:48?AM, Gabriel Krisman Bertazi wrote:
Pavel Begunkov <asml.silence@xxxxxxxxx> writes:
On 4/1/23 01:04, Gabriel Krisman Bertazi wrote:
Pavel Begunkov <asml.silence@xxxxxxxxx> writes:
I didn't try it, but kmem_cache vs kmalloc, IIRC, doesn't bring usThere is no locks in the fast path of slub, as far as I know. it has
much, definitely doesn't spare from locking, and the overhead
definitely wasn't satisfactory for requests before.
a
per-cpu cache that is refilled once empty, quite similar to the fastpath
of this cache. I imagine the performance hit in slub comes from the
barrier and atomic operations?
Yeah, I mean all kinds of synchronisation. And I don't think
that's the main offender here, the test is single threaded without
contention and the system was mostly idle.
kmem_cache works fine for most hot paths of the kernel. I think this
It doesn't for io_uring. There are caches for the net side and now
in the block layer as well. I wouldn't say it necessarily halves
performance but definitely takes a share of CPU.
Right. My point is that all these caches (block, io_uring) duplicate
what the slab cache is meant to do. Since slab became a bottleneck, I'm
looking at how to improve the situation on their side, to see if we can
drop the caching here and in block/.
That would certainly be a worthy goal, and I do agree that these caches
are (largely) working around deficiencies. One important point that you
may miss is that most of this caching gets its performance from both
avoiding atomics in slub, but also because we can guarantee that both
alloc and free happen from process context. The block IRQ bits are a bit
different, but apart from that, it's true elsewhere. Caching that needs
to even disable IRQs locally generally doesn't beat out slub by much,
the big wins are the cases where we know free+alloc is done in process
context.
Yes, I noticed that. I was thinking of exposing a flag at kmem_cache
creation-time to tell slab the user promises not to use it in IRQ
context, so it doesn't need to worry about nested invocation in the
allocation/free path. Then, for those caches, have a
kmem_cache_alloc_locked variant, where the synchronization is maintained
by the caller (i.e. by ->uring_lock here), so it can manipulate the
cache without atomics.
I was looking at your implementation of the block cache for inspiration
and saw how you kept a second list for IRQ. I'm thinking how to fit a
similar change inside slub. But for now, I want to get the simpler
case, which is all io_uring need.
I'll try to get a prototype before lsfmm and see if I get the MM folks
input there.