Re: Steam is broken on new kernels

From: Pierre-Loup A. Griffais
Date: Fri Jun 21 2019 - 21:03:36 EST




On 6/21/19 5:19 PM, Eric Dumazet wrote:
On Fri, Jun 21, 2019 at 7:54 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

Eric is talking about this patch, I think:

https://patchwork.ozlabs.org/patch/1120222/


That is correct.

I am about to take a flight from Boston to Paris, so I can not really
follow discussions/tests for the following hours.

I built the tip of linux-5.1.y and reproduced the issue while trying to log out and back into Steam; it exhibited this symptom as well:

pgriffais@pgriffais:~$ nstat -az | grep -i wqueue
TcpExtTCPWqueueTooBig 31 0.0

I applied Eric's path to the tip of the branch and ran that kernel and the bug didn't occur through several logout / login cycles, so things look good at first glance. I'll keep running that kernel and report back if anything crops up in the future, but I believe we're good, beyond getting distros to ship this additional fix.

Thanks,
- Pierre-Loup


Thanks.

I guess I'll ask people on the github thread to test that too.

Linus

On Fri, Jun 21, 2019 at 3:38 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:

Please look at my recent patch.
Sorry I am travelling....

On Fri, Jun 21, 2019, 6:19 PM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

On Fri, Jun 21, 2019 at 2:41 PM Greg Kroah-Hartman
<gregkh@xxxxxxxxxxxxxxxxxxx> wrote:

What specific commit caused the breakage?

Both on reddit and on github there seems to be confusion about whether
it's a problem or not. Some people have it working with the exact same
kernel that breaks for others.

And then some people seem to say it works intermittently for them,
which seems to indicate a timing issue.

Looking at the SACK patches (assuming it's one of them), I'd suspect
the "tcp: tcp_fragment() should apply sane memory limits".

Eric, that one does

if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
return -ENOMEM;
}

but I think it's *normal* for "sk_wmem_queued >> 1" to be around the
same size as sk_sndbuf. So if there is some fragmentation, and we add
more skb's to it, that would seem to trigger fairly easily.
Particularly since this is all in "truesize" units, which can be a lot
bigger than the packets themselves.

I don't know the code, so I may be out to lunch and barking up
completely the wrong tree, but that particular check does seem like it
might trigger much more easily than I think the code _intended_ it to
trigger?

Pierre-Loup - do you guys have a test-case inside of valve? Or is this
purely "we see some people with problems"?

Linus