Re: Re: [PATCH net-next 2/2] sock: Fix improper heuristic on raising memory
From: Abel Wu
Date: Tue Oct 03 2023 - 08:49:21 EST
On 9/24/23 3:28 PM, Shakeel Butt wrote:
On Fri, Sep 22, 2023 at 06:10:06PM +0800, Abel Wu wrote:
[...]
After a second thought, it is still vague to me about the position
the memcg pressure should be in socket memory allocation. It lacks
convincing design. I think the above hunk helps, but not much.
I wonder if we should take option (3) first. Thoughts?
Let's take a step further. Let's decouple the memcg accounting and
global skmem accounting. __sk_mem_raise_allocated is already very hard
to reason. There are couple of heuristics in it which may or may not
apply to both accounting infrastructures.
Let's explicitly document what heurisitics allows to forcefully succeed
the allocations i.e. irrespective of pressure or over limit for both
accounting infras. I think decoupling them would make the flow of the
code very clear.
I can't agree more.
There are three heuristics:
I found all of them were first introduced in linux-2.4.0-test7pre1 for
TCP only, and then migrated to socket core in linux-2.6.8-rc1 without
functional change.
1. minimum buffer size even under pressure.
This is required by RFC 7323 (TCP Extensions for High Performance) to
make features like Window Scale option work as expected, and should be
succeeded under global pressure by tcp_{r,w}mem's definition. And IMHO
for same reason, it should also be succeeded under memcg pressure, or
else workloads might suffer performance drop due to bottleneck on
network.
The allocation must not be succeeded either exceed global or memcg's
hard limit, or else a DoS attack can be taken place by spawning lots
of sockets that are under minimum buffer size.
2. allow allocation for a socket whose usage is below average of the
system.
Since 'average' is within the scope of global accounting, this one
only makes sense under global memory pressure. Actually this exists
before cgroup was born, hence doesn't take memcg into consideration.
While OTOH the intention of throttling under memcg pressure is to
relief the memcg from heavy reclaim pressure, this heuristic does no
help. And there also seems to be no reason to succeed the allocation
when global or memcg's hard limit is exceeded.
3. socket is over its sndbuf.
TBH I don't get its point..
Let's discuss which heuristic applies to which accounting infra and
under which state (under pressure or over limit).
I will follow your suggestion to post a patch to explicitly document
the behaviors once things are cleared.
Thanks,
Abel