Re: [PATCH net-next v3 01/18] net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)

From: Jakub Kicinski
Date: Thu Jun 22 2023 - 16:28:42 EST


On Thu, 22 Jun 2023 20:40:43 +0100 David Howells wrote:
> > How did that happen? I thought MSG_SPLICE_PAGES comes from former
> > sendpage users and sendpage can't operate on slab pages.
>
> Some of my patches, take the siw one for example, now aggregate all the bits
> that make up a message into a single sendmsg() call, including any protocol
> header and trailer in the same bio_vec[] as the payload where before it would
> have to do, say, sendmsg+sendpage+sendpage+...+sendpage+sendmsg.

Maybe it's just me but I'd prefer to keep the clear rule that splice
operates on pages not slab objects. SIW is the software / fake
implementation of RDMA, right? You couldn't have picked a less
important user :(

Paolo indicated that he'll take a look tomorrow, we'll see what he
thinks.

> I'm trying to make it so that I make the minimum number of sendmsg calls
> (ie. 1 where possible) and the loop that processes the data is inside of that.

The in-kernel users can be fixed to not use slab, and user space can't
feed us slab objects.

> This offers the opportunity, at least in the future, to append slab data to an
> already-existing private fragment in the skbuff.

Maybe we can get Eric to comment. The ability to identify "frag type"
seems cool indeed, but I haven't thought about using it to attach
slab objects.

> > The locking is to local_bh_disable(). Does the milliont^w new frag
> > allocator have any additional benefits?
>
> It is shareable because it does locking. Multiple sockets of multiple
> protocols can share the pages it has reserved. It drops the lock around calls
> to the page allocator so that GFP_KERNEL/GFP_NOFS can be used with it.
>
> Without this, the page fragment allocator would need to be per-socket, I
> think, or be done further up the stack where the higher level drivers would
> have to have a fragment bucket per whatever unit they use to deal with the
> lack of locking.

There's also the per task frag which can be used under normal conditions
(sk_use_task_frag).

> Doing it here makes cleanup simpler since I just transfer my ref on the
> fragment to the skbuff frag list and it will automatically be cleaned up with
> the skbuff.
>
> Willy suggested that I just allocate a page for each thing I want to copy, but
> I would rather not do that for, say, an 8-byte bit of protocol data.

TBH my intuition would also be get a full page and let the callers who
care about performance fix themselves. Assuming we want to let slab
objects in in the first place.