Re: [PATCH] net: add per device sg_max_frags for skb

From: Hannes Frederic Sowa
Date: Wed Jan 13 2016 - 10:07:22 EST


On 13.01.2016 15:19, Eric Dumazet wrote:
1) There are no arch with 1K page sizes. Most certainly, if we had
MAX_SKB_FRAGS=65 some assumptions in the stack would fail.

2) TCP stack has coalescing support. write(2) or sendmsg(2) should
append data into the last skb in write queue, and still use 32 KB
frags.
You get pathological skb when using sendpage() or when one thread
writes data into _multiple_ TCP sockets, since TCP stack uses
a per thread 32 KB reserve (
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5640f7685831e088fe6c2e1f863a6805962f8e81
)

2) As I said, implementing a limit in TCP stack is not enough. Your
patch is therefore adding complexity for all users, but is not a
general solution.

GRO, tun device, many things can still cook 'big skbs'

You need to properly implement a fallback, possibly using
ndo_features_check(), or directly from your ndo_start_xmit()

3) We currently have a very dumb way to fallback, forcing a linearize
call, likely to fail if memory is fragmented and skb big.

You could instead provide a smart helper, trying to reduce the
number of frags in a skb by chosing adjacent frags and
re-allocating/merging them.

By choosing, I mean trying to pick smallest ones to minimize copy
cost, to get one skb with X less fragment. (X=1 in your case ?)

I know for example that bnx2x could benefit from such a helper, as
it has a 13 frags limits.
(bnx2x_pkt_req_lin(), called from bnx2x ndo_start_xmit()

As I proposed, we could globally (or per netns) limit the maximum , I think this would be okay and could be the best alternative to install slow-paths which could be hit quite constantly.

Otherwise, the fallbacks like Eric proposed them are needed. I do not see any other choice.

Thanks,
Hannes