[PATCH net] net: enetc: avoid buffer leaks on xdp_do_redirect() failure

From: Vladimir Oltean
Date: Mon Dec 12 2022 - 19:22:02 EST


Before enetc_clean_rx_ring_xdp() calls xdp_do_redirect(), each software
BD in the RX ring between index orig_i and i can have one of 2 refcount
values on its page.

We are the owner of the current buffer that is being processed, so the
refcount will be at least 1.

If the current owner of the buffer at the diametrically opposed index
in the RX ring (i.o.w, the other half of this page) has not yet called
kfree(), this page's refcount could even be 2.

enetc_page_reusable() in enetc_flip_rx_buff() tests for the page
refcount against 1, and [ if it's 2 ] does not attempt to reuse it.

But if enetc_flip_rx_buff() is put after the xdp_do_redirect() call,
the page refcount can have one of 3 values. It can also be 0, if there
is no owner of the other page half, and xdp_do_redirect() for this
buffer ran so far that it triggered a flush of the devmap/cpumap bulk
queue, and the consumers of those bulk queues also freed the buffer,
all by the time xdp_do_redirect() returns the execution back to enetc.

This is the reason why enetc_flip_rx_buff() is called before
xdp_do_redirect(), but there is a big flaw with that reasoning:
enetc_flip_rx_buff() will set rx_swbd->page = NULL on both sides of the
enetc_page_reusable() branch, and if xdp_do_redirect() returns an error,
we call enetc_xdp_free(), which does not deal gracefully with that.

In fact, what happens is quite special. The page refcounts start as 1.
enetc_flip_rx_buff() figures they're reusable, transfers these
rx_swbd->page pointers to a different rx_swbd in enetc_reuse_page(), and
bumps the refcount to 2. When xdp_do_redirect() later returns an error,
we call the no-op enetc_xdp_free(), but we still haven't lost the
reference to that page. A copy of it is still at rx_ring->next_to_alloc,
but that has refcount 2 (and there are no concurrent owners of it in
flight, to drop the refcount). What really kills the system is when
we'll flip the rx_swbd->page the second time around. With an updated
refcount of 2, the page will not be reusable and we'll really leak it.
Then enetc_new_page() will have to allocate more pages, which will then
eventually leak again on further errors from xdp_do_redirect().

The problem, summarized, is that we zeroize rx_swbd->page before we're
completely done with it, and this makes it impossible for the error path
to do something with it.

Since the packet is potentially multi-buffer and therefore the
rx_swbd->page is potentially an array, manual passing of the old
pointers between enetc_flip_rx_buff() and enetc_xdp_free() is a bit
difficult.

For the sake of going with a simple solution, we accept the possibility
of racing with xdp_do_redirect(), and we move the flip procedure to
execute only on the redirect success path. By racing, I mean that the
page may be deemed as not reusable by enetc (having a refcount of 0),
but there will be no leak in that case, either.

Once we accept that, we have something better to do with buffers on
XDP_REDIRECT failure. Since we haven't performed half-page flipping yet,
we won't, either (and this way, we can avoid enetc_xdp_free()
completely, which gives the entire page to the slab allocator).
Instead, we'll call enetc_xdp_drop(), which will recycle this half of
the buffer back to the RX ring.

Fixes: 9d2b68cc108d ("net: enetc: add support for XDP_REDIRECT")
Suggested-by: Lorenzo Bianconi <lorenzo.bianconi@xxxxxxxxxx>
Signed-off-by: Vladimir Oltean <vladimir.oltean@xxxxxxx>
---
drivers/net/ethernet/freescale/enetc/enetc.c | 35 +++++---------------
1 file changed, 8 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c b/drivers/net/ethernet/freescale/enetc/enetc.c
index 8671591cb750..3a79ead5219a 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc.c
@@ -1489,23 +1489,6 @@ static void enetc_xdp_drop(struct enetc_bdr *rx_ring, int rx_ring_first,
rx_ring->stats.xdp_drops++;
}

-static void enetc_xdp_free(struct enetc_bdr *rx_ring, int rx_ring_first,
- int rx_ring_last)
-{
- while (rx_ring_first != rx_ring_last) {
- struct enetc_rx_swbd *rx_swbd = &rx_ring->rx_swbd[rx_ring_first];
-
- if (rx_swbd->page) {
- dma_unmap_page(rx_ring->dev, rx_swbd->dma, PAGE_SIZE,
- rx_swbd->dir);
- __free_page(rx_swbd->page);
- rx_swbd->page = NULL;
- }
- enetc_bdr_idx_inc(rx_ring, &rx_ring_first);
- }
- rx_ring->stats.xdp_redirect_failures++;
-}
-
static int enetc_clean_rx_ring_xdp(struct enetc_bdr *rx_ring,
struct napi_struct *napi, int work_limit,
struct bpf_prog *prog)
@@ -1527,8 +1510,8 @@ static int enetc_clean_rx_ring_xdp(struct enetc_bdr *rx_ring,
int orig_i, orig_cleaned_cnt;
struct xdp_buff xdp_buff;
struct sk_buff *skb;
- int tmp_orig_i, err;
u32 bd_status;
+ int err;

rxbd = enetc_rxbd(rx_ring, i);
bd_status = le32_to_cpu(rxbd->r.lstatus);
@@ -1615,18 +1598,16 @@ static int enetc_clean_rx_ring_xdp(struct enetc_bdr *rx_ring,
break;
}

- tmp_orig_i = orig_i;
-
- while (orig_i != i) {
- enetc_flip_rx_buff(rx_ring,
- &rx_ring->rx_swbd[orig_i]);
- enetc_bdr_idx_inc(rx_ring, &orig_i);
- }
-
err = xdp_do_redirect(rx_ring->ndev, &xdp_buff, prog);
if (unlikely(err)) {
- enetc_xdp_free(rx_ring, tmp_orig_i, i);
+ enetc_xdp_drop(rx_ring, orig_i, i);
+ rx_ring->stats.xdp_redirect_failures++;
} else {
+ while (orig_i != i) {
+ enetc_flip_rx_buff(rx_ring,
+ &rx_ring->rx_swbd[orig_i]);
+ enetc_bdr_idx_inc(rx_ring, &orig_i);
+ }
xdp_redirect_frm_cnt++;
rx_ring->stats.xdp_redirect++;
}
--
2.34.1