Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken

From: Dmitry Safonov
Date: Fri Dec 06 2024 - 15:35:47 EST


On Fri, 6 Dec 2024 at 15:15, Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
>
> On Fri, Dec 6, 2024 at 3:49 AM Dmitry Safonov <0x7f454c46@xxxxxxxxx> wrote:
> >
[..]
> > > @@ -585,8 +589,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
> > >
> > > err = sk_diag_fill(sk, rep, cb, req, 0, net_admin);
> > > if (err < 0) {
> > > - WARN_ON(err == -EMSGSIZE);
> > > nlmsg_free(rep);
> > > + if (err == -EMSGSIZE) {
> > > + attr_size <<= 1;
> > > + if (attr_size + NLMSG_HDRLEN <=
> > > SKB_WITH_OVERHEAD(32768)) {
> > > + cond_resched();
> > > + goto retry;
> > > + }
> > > + }
> > > goto out;
> > > }
> > > err = nlmsg_unicast(net->diag_nlsk, rep, NETLINK_CB(in_skb).portid);
> >
> > To my personal taste on larger than 327 md5 keys scale, I'd prefer to
> > see "dump may be inconsistent, retry if you need consistency" than
> > -EMSGSIZE fail, yet userspace potentially may use the errno as a
> > "retry" signal.
> >
>
> I do not yet understand this point. I will let you send a patch for
> further discussion.

Let me explain my view. It's based on two points:
(a) TCP-MD5/AO-diag interfaces are mostly used for
debugging/investigating/monitoring by tools alike ss. Without a
side-synchronisation, they can't be used by BGP or other tools/tests
to make decisions as the socket is controlled by another process and
the resulting dump may be incomplete, inconsistent or outdated.
(b) The current default of optmem_max limit (128Kb) allows to allocate
on a socket 655 TCP-AO keys and even more TCP-MD5 keys. Some of
Arista's customers (I'd guess the same for other BGP users) have 1000
peers (for MD5 it's one key per peer on a listen socket, for AO might
be higher).

I think the situation that's being addressed here is a race and
potentially it's rare to hit (I have to run a reproducer in a loop to
hit it). That's why in my view a re-try jump is too big of a hammer.
And failing with -EMSGSIZE on 327+ keys scale sounds slighly worse
than just marking the resulting dump as inconsistent and letting the
user decide if he wants to re-run the dump or if the dump is "good
enough" to get a sense of the situation. I would even say "the dump is
inconsistent" has its value as a signal that the keys on a socket
change right now, which may be useful.

Regarding the patch, my attempt was in this thread:
https://lore.kernel.org/all/20241113-tcp-md5-diag-prep-v2-1-00a2a7feb1fa@xxxxxxxxx/

However, I should note that I'm fine with either of the approaches
(userspace to retry on EMSGSIZE or to get an inconsistent dump and
decide what to do with that). I'm somewhat looking forward to
switching to problems (1)/(3)/(4) from the cover-letter and adding
TCP-AO-diag, rather than being stuck arguing about what's the best
solution for quite a rare race :-)

Thanks,
Dmitry