Re: [PATCH net-next] tuntap: Fallback to automq on TUNSETSTEERINGEBPF prog negative return

From: Jason Wang
Date: Mon Sep 23 2019 - 01:16:16 EST



On 2019/9/23 äå11:18, Matt Cover wrote:
On Sun, Sep 22, 2019 at 7:34 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:

On 2019/9/23 äå9:15, Matt Cover wrote:
On Sun, Sep 22, 2019 at 5:51 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:
On 2019/9/23 äå6:30, Matt Cover wrote:
On Sun, Sep 22, 2019 at 1:36 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
On Sun, Sep 22, 2019 at 10:43:19AM -0700, Matt Cover wrote:
On Sun, Sep 22, 2019 at 5:37 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
On Fri, Sep 20, 2019 at 11:58:43AM -0700, Matthew Cover wrote:
Treat a negative return from a TUNSETSTEERINGEBPF bpf prog as a signal
to fallback to tun_automq_select_queue() for tx queue selection.

Compilation of this exact patch was tested.

For functional testing 3 additional printk()s were added.

Functional testing results (on 2 txq tap device):

[Fri Sep 20 18:33:27 2019] ========== tun no prog ==========
[Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '-1'
[Fri Sep 20 18:33:27 2019] tuntap: tun_automq_select_queue() ran
[Fri Sep 20 18:33:27 2019] ========== tun prog -1 ==========
[Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '-1'
[Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '-1'
[Fri Sep 20 18:33:27 2019] tuntap: tun_automq_select_queue() ran
[Fri Sep 20 18:33:27 2019] ========== tun prog 0 ==========
[Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '0'
[Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '0'
[Fri Sep 20 18:33:27 2019] ========== tun prog 1 ==========
[Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '1'
[Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '1'
[Fri Sep 20 18:33:27 2019] ========== tun prog 2 ==========
[Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '2'
[Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '0'

Signed-off-by: Matthew Cover <matthew.cover@xxxxxxxxxxxxx>
Could you add a bit more motivation data here?
Thank you for these questions Michael.

I'll plan on adding the below information to the
commit message and submitting a v2 of this patch
when net-next reopens. In the meantime, it would
be very helpful to know if these answers address
some of your concerns.

1. why is this a good idea
This change allows TUNSETSTEERINGEBPF progs to
do any of the following.
1. implement queue selection for a subset of
traffic (e.g. special queue selection logic
for ipv4, but return negative and use the
default automq logic for ipv6)
2. determine there isn't sufficient information
to do proper queue selection; return
negative and use the default automq logic
for the unknown
3. implement a noop prog (e.g. do
bpf_trace_printk() then return negative and
use the default automq logic for everything)

2. how do we know existing userspace does not rely on existing behaviour
Prior to this change a negative return from a
TUNSETSTEERINGEBPF prog would have been cast
into a u16 and traversed netdev_cap_txqueue().

In most cases netdev_cap_txqueue() would have
found this value to exceed real_num_tx_queues
and queue_index would be updated to 0.

It is possible that a TUNSETSTEERINGEBPF prog
return a negative value which when cast into a
u16 results in a positive queue_index less than
real_num_tx_queues. For example, on x86_64, a
return value of -65535 results in a queue_index
of 1; which is a valid queue for any multiqueue
device.

It seems unlikely, however as stated above is
unfortunately possible, that existing
TUNSETSTEERINGEBPF programs would choose to
return a negative value rather than return the
positive value which holds the same meaning.

It seems more likely that future
TUNSETSTEERINGEBPF programs would leverage a
negative return and potentially be loaded into
a kernel with the old behavior.
OK if we are returning a special
value, shouldn't we limit it? How about a special
value with this meaning?
If we are changing an ABI let's at least make it
extensible.

A special value with this meaning sounds
good to me. I'll plan on adding a define
set to -1 to cause the fallback to automq.
Can it really return -1?

I see:

static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
struct sk_buff *skb)
...


The way I was initially viewing the old
behavior was that returning negative was
undefined; it happened to have the
outcomes I walked through, but not
necessarily by design.
Having such fallback may bring extra troubles, it requires the eBPF
program know the existence of the behavior which is not a part of kernel
ABI actually. And then some eBPF program may start to rely on that which
is pretty dangerous. Note, one important consideration is to have
macvtap support where does not have any stuffs like automq.

Thanks

How about we call this TUN_SSE_ABORT
instead of TUN_SSE_DO_AUTOMQ?

TUN_SSE_ABORT could be documented as
falling back to the default queue
selection method in either space
(presumably macvtap has some queue
selection method when there is no prog).

This looks like a more complex API, we don't want userspace to differ
macvtap from tap too much.

Thanks

This is barely more complex and provides
similar to what is done in many places.
For xdp, an XDP_PASS enacts what the
kernel would do if there was no bpf prog.
For tc cls in da mode, TC_ACT_OK enacts
what the kernel would do if there was
no bpf prog. For xt_bpf, false enacts
what the kernel would do if there was
no bpf prog (as long as negation
isn't in play in the rule, I believe).


I think this is simply because you can't implement e.g XDP_PASS/TC_ACT_OK through eBPF itself which is not the case of steering prog here.



I know that this is somewhat of an
oversimplification and that each of
these also means something else in
the respective hookpoint, but I standby
seeing value in this change.

macvtap must have some default (i.e the
action which it takes when no prog is
loaded), even if that is just use queue
0. We can provide the same TUN_SSE_ABORT
in userspace which does the same thing;
enacts the default when returned. Any
differences left between tap and macvtap
would be in what the default is, not in
these changes. And that difference already
exists today.


I think it's better to safe to just drop the packet instead of trying to workaround it.

Thanks



In order to keep the new behavior
extensible, how should we state that a
negative return other than -1 is
undefined and therefore subject to
change. Is something like this
sufficient?

Documentation/networking/tc-actions-env-rules.txt

Additionally, what should the new
behavior implement when a negative other
than -1 is returned? I would like to have
it do the same thing as -1 for now, but
with the understanding that this behavior
is undefined. Does this sound reasonable?

3. why doesn't userspace need a way to figure out whether it runs on a kernel with and
without this patch
There may be some value in exposing this fact
to the ebpf prog loader. What is the standard
practice here, a define?
We'll need something at runtime - people move binaries between kernels
without rebuilding then. An ioctl is one option.
A sysfs attribute is another, an ethtool flag yet another.
A combination of these is possible.

And if we are doing this anyway, maybe let userspace select
the new behaviour? This way we can stay compatible with old
userspace...

Understood. I'll look into adding an
ioctl to activate the new behavior. And
perhaps a method of checking which is
behavior is currently active (in case we
ever want to change the default, say
after some suitably long transition
period).

thanks,
MST

---
drivers/net/tun.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index aab0be4..173d159 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -583,35 +583,37 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
return txq;
}

-static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
+static int tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
{
struct tun_prog *prog;
u32 numqueues;
- u16 ret = 0;
+ int ret = -1;

numqueues = READ_ONCE(tun->numqueues);
if (!numqueues)
return 0;

+ rcu_read_lock();
prog = rcu_dereference(tun->steering_prog);
if (prog)
ret = bpf_prog_run_clear_cb(prog->prog, skb);
+ rcu_read_unlock();

- return ret % numqueues;
+ if (ret >= 0)
+ ret %= numqueues;
+
+ return ret;
}

static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
struct net_device *sb_dev)
{
struct tun_struct *tun = netdev_priv(dev);
- u16 ret;
+ int ret;

- rcu_read_lock();
- if (rcu_dereference(tun->steering_prog))
- ret = tun_ebpf_select_queue(tun, skb);
- else
+ ret = tun_ebpf_select_queue(tun, skb);
+ if (ret < 0)
ret = tun_automq_select_queue(tun, skb);
- rcu_read_unlock();

return ret;
}
--
1.8.3.1