Re: [PATCH] net: add big honking pfmemalloc OOM warning

From: Juha-Matti Tilli
Date: Wed Apr 10 2019 - 11:01:32 EST


On Wed, Apr 10, 2019 at 5:16 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
> If NFS sessions hang, then there is a bug to eventually root cause and fix.
>
> Just telling the user : Increase the limit is the same thing than admitting :
>
> Our limit system or TCP or NFS stacks are broken and unable to
> recover, so lets disable the limit system and work around a more
> serious bug.
>
> Maybe the bug is in a NIC driver, please share more details before
> adding yet another noisy signal in syslog
>
> SNMP counters are per netns, and more useful in the modern computing
> era, where a host is shared by many different containers.

Any idea where the bug might be?

It can't be in NFS, because I have observed the issue to be a TCP
level issue. NFS would be working just fine if TCP worked, but the
underlying TCP connection is not working fine, unless we bump up
vm.min_free_kbytes.

It could be in ixgbe, because the incoming SKB gets pfmemalloc pages
for some reason, and that happens repeatedly for a duration of 5-10
minutes for every single retransmit, until the condition clears. Ping
is working just fine at the time the NFS connection is stuck. I think
these 63-queue NICs use different queue for ping than they use for the
TCP NFS connection. I think there is some code in ixgbe for not
reusing pfmemalloc pages, but it seems every packet nevertheless gets
a pfmemalloc page in the queue that is used for TCP NFS. Might the
cause be that if ixgbe gets the pages in large bunches, it gets
multiple pfmemalloc pages at a time and then every packet is dropped
until all the pfmemalloc pages run out (not being reused)?

It could also be in the default value of vm.min_free_kbytes, but I'm
not experienced enough in Linux kernel internals to adjust the complex
calculations. Just saying that 90 MB sounds ridiculously low on a 256
GB NUMA machine.

Are you of the opinion that Intel as the developer of ixgbe should be informed?

Anyway, I posted more details to the mailing lists about a week ago,
search for "NFS hang, sk_drops increasing, segs_in not, pfmemalloc
suspected" in the mailing lists, or click this direct link:
https://lkml.org/lkml/2019/4/3/682

The current situation is that we've been running the production system
for 2 weeks with a bumped-up vm.min_free_kbytes, no NFS hangs, whereas
before the bump, we had approximately one hang per day, so without the
bump, the period of 2 weeks would have approximately 14 NFS hangs.

To me, this OOM condition seems to be global, so having it per-netns
offers no clear benefit in my opinion. Or is vm.min_free_kbytes per
container tunable?

BR, Juha-Matti