Re: [PATCH net] net/sched: sch_api: fix xa_insert() error path in tcf_block_get_ext()

From: Simon Horman
Date: Thu Oct 24 2024 - 09:20:23 EST


On Wed, Oct 23, 2024 at 01:05:41PM +0300, Vladimir Oltean wrote:
> This command:
>
> $ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact
> Error: block dev insert failed: -EBUSY.
>
> fails because user space requests the same block index to be set for
> both ingress and egress.
>
> [ side note, I don't think it even failed prior to commit 913b47d3424e
> ("net/sched: Introduce tc block netdev tracking infra"), because this
> is a command from an old set of notes of mine which used to work, but
> alas, I did not scientifically bisect this ]
>
> The problem is not that it fails, but rather, that the second time
> around, it fails differently (and irrecoverably):
>
> $ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact
> Error: dsa_core: Flow block cb is busy.
>
> [ another note: the extack is added by me for illustration purposes.
> the context of the problem is that clsact_init() obtains the same
> &q->ingress_block pointer as &q->egress_block, and since we call
> tcf_block_get_ext() on both of them, "dev" will be added to the
> block->ports xarray twice, thus failing the operation: once through
> the ingress block pointer, and once again through the egress block
> pointer. the problem itself is that when xa_insert() fails, we have
> emitted a FLOW_BLOCK_BIND command through ndo_setup_tc(), but the
> offload never sees a corresponding FLOW_BLOCK_UNBIND. ]
>
> Even correcting the bad user input, we still cannot recover:
>
> $ tc qdisc replace dev swp3 ingress_block 1 egress_block 2 clsact
> Error: dsa_core: Flow block cb is busy.
>
> Basically the only way to recover is to reboot the system, or unbind and
> rebind the net device driver.
>
> To fix the bug, we need to fill the correct error teardown path which
> was missed during code movement, and call tcf_block_offload_unbind()
> when xa_insert() fails.
>
> [ last note, fundamentally I blame the label naming convention in
> tcf_block_get_ext() for the bug. The labels should be named after what
> they do, not after the error path that jumps to them. This way, it is
> obviously wrong that two labels pointing to the same code mean
> something is wrong, and checking the code correctness at the goto site
> is also easier ]

Yes, a text book case of why that practice is discouraged.

> Fixes: 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()")
> Signed-off-by: Vladimir Oltean <vladimir.oltean@xxxxxxx>

Reviewed-by: Simon Horman <horms@xxxxxxxxxx>