Re: [PATCH net V3 1/2] veth: enable dev_watchdog for detecting stalled TXQs

From: Jesper Dangaard Brouer

Date: Wed Nov 12 2025 - 16:58:28 EST

On 08/11/2025 02.54, Jakub Kicinski wrote:

On Fri, 7 Nov 2025 14:42:58 +0100 Jesper Dangaard Brouer wrote:

I think this belongs in net-next.. Fail safe is not really a bug fix.
I'm slightly worried we're missing a corner case and will cause
timeouts to get printed for someone's config.

This is a recovery fix. If the race condition fix isn't 100% then this
patch will allow veth to recover. Thus, to me it makes sense to group
these two patches together.

I'm more worried that we we're missing a corner case that we cannot
recover from. Than triggering timeouts to get printed, for a config
where NAPI consumer veth_poll() takes more that 5 seconds to run (budget
max 64 packets this needs to consume packets at a rate less than 12.8
pps). It might be good to get some warnings if the system is operating
this slow.

Also remember this is not the default config that most people use.
The code is only activated if attaching a qdisc to veth, which isn't
default. Plus, NAPI mode need to be activated, where in normal NAPI mode
the producer and consumer usually runs on the same CPU, which makes it
impossible to overflow the ptr_ring. The veth backpressure is primarily
needed when running with threaded-NAPI, where it is natural that
producer and consumer runs on different CPUs. In our production setup
the consumer is always slower than the producer (as the product inside
the namespace have installed too many nftables rules).

I understand all of this, but IMO the fix is in patch 2.
This is a resiliency improvement, not a fix.

As maintainer you have the final say, so I send a [V4]. Notice that
doing it this way will cause a merge conflict once net and net-next gets
merged.

[V4] https://lore.kernel.org/all/176295319819.307447.6162285688886096284.stgit@firesoul/

--Jesper