Re: [PATCH] net: add big honking pfmemalloc OOM warning

From: Juha-Matti Tilli
Date: Wed Apr 10 2019 - 11:01:32 EST

On Wed, Apr 10, 2019 at 5:16 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
> If NFS sessions hang, then there is a bug to eventually root cause and fix.
> Just telling the user : Increase the limit is the same thing than admitting :
> Our limit system or TCP or NFS stacks are broken and unable to
> recover, so lets disable the limit system and work around a more
> serious bug.
> Maybe the bug is in a NIC driver, please share more details before
> adding yet another noisy signal in syslog
> SNMP counters are per netns, and more useful in the modern computing
> era, where a host is shared by many different containers.

Any idea where the bug might be?

It can't be in NFS, because I have observed the issue to be a TCP
level issue. NFS would be working just fine if TCP worked, but the
underlying TCP connection is not working fine, unless we bump up

It could be in ixgbe, because the incoming SKB gets pfmemalloc pages
for some reason, and that happens repeatedly for a duration of 5-10
minutes for every single retransmit, until the condition clears. Ping
is working just fine at the time the NFS connection is stuck. I think
these 63-queue NICs use different queue for ping than they use for the
TCP NFS connection. I think there is some code in ixgbe for not
reusing pfmemalloc pages, but it seems every packet nevertheless gets
a pfmemalloc page in the queue that is used for TCP NFS. Might the
cause be that if ixgbe gets the pages in large bunches, it gets
multiple pfmemalloc pages at a time and then every packet is dropped
until all the pfmemalloc pages run out (not being reused)?

It could also be in the default value of vm.min_free_kbytes, but I'm
not experienced enough in Linux kernel internals to adjust the complex
calculations. Just saying that 90 MB sounds ridiculously low on a 256
GB NUMA machine.

Are you of the opinion that Intel as the developer of ixgbe should be informed?

Anyway, I posted more details to the mailing lists about a week ago,
search for "NFS hang, sk_drops increasing, segs_in not, pfmemalloc
suspected" in the mailing lists, or click this direct link:

The current situation is that we've been running the production system
for 2 weeks with a bumped-up vm.min_free_kbytes, no NFS hangs, whereas
before the bump, we had approximately one hang per day, so without the
bump, the period of 2 weeks would have approximately 14 NFS hangs.

To me, this OOM condition seems to be global, so having it per-netns
offers no clear benefit in my opinion. Or is vm.min_free_kbytes per
container tunable?

