Re: [PATCH net-next] tuntap: Fallback to automq on TUNSETSTEERINGEBPF prog negative return

From: Matt Cover
Date: Mon Sep 23 2019 - 12:31:51 EST


On Sun, Sep 22, 2019 at 10:16 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:
>
>
> On 2019/9/23 äå11:18, Matt Cover wrote:
> > On Sun, Sep 22, 2019 at 7:34 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:
> >>
> >> On 2019/9/23 äå9:15, Matt Cover wrote:
> >>> On Sun, Sep 22, 2019 at 5:51 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:
> >>>> On 2019/9/23 äå6:30, Matt Cover wrote:
> >>>>> On Sun, Sep 22, 2019 at 1:36 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> >>>>>> On Sun, Sep 22, 2019 at 10:43:19AM -0700, Matt Cover wrote:
> >>>>>>> On Sun, Sep 22, 2019 at 5:37 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> >>>>>>>> On Fri, Sep 20, 2019 at 11:58:43AM -0700, Matthew Cover wrote:
> >>>>>>>>> Treat a negative return from a TUNSETSTEERINGEBPF bpf prog as a signal
> >>>>>>>>> to fallback to tun_automq_select_queue() for tx queue selection.
> >>>>>>>>>
> >>>>>>>>> Compilation of this exact patch was tested.
> >>>>>>>>>
> >>>>>>>>> For functional testing 3 additional printk()s were added.
> >>>>>>>>>
> >>>>>>>>> Functional testing results (on 2 txq tap device):
> >>>>>>>>>
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun no prog ==========
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '-1'
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_automq_select_queue() ran
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog -1 ==========
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '-1'
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '-1'
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_automq_select_queue() ran
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog 0 ==========
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '0'
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '0'
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog 1 ==========
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '1'
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '1'
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] ========== tun prog 2 ==========
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '2'
> >>>>>>>>> [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '0'
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Matthew Cover <matthew.cover@xxxxxxxxxxxxx>
> >>>>>>>> Could you add a bit more motivation data here?
> >>>>>>> Thank you for these questions Michael.
> >>>>>>>
> >>>>>>> I'll plan on adding the below information to the
> >>>>>>> commit message and submitting a v2 of this patch
> >>>>>>> when net-next reopens. In the meantime, it would
> >>>>>>> be very helpful to know if these answers address
> >>>>>>> some of your concerns.
> >>>>>>>
> >>>>>>>> 1. why is this a good idea
> >>>>>>> This change allows TUNSETSTEERINGEBPF progs to
> >>>>>>> do any of the following.
> >>>>>>> 1. implement queue selection for a subset of
> >>>>>>> traffic (e.g. special queue selection logic
> >>>>>>> for ipv4, but return negative and use the
> >>>>>>> default automq logic for ipv6)
> >>>>>>> 2. determine there isn't sufficient information
> >>>>>>> to do proper queue selection; return
> >>>>>>> negative and use the default automq logic
> >>>>>>> for the unknown
> >>>>>>> 3. implement a noop prog (e.g. do
> >>>>>>> bpf_trace_printk() then return negative and
> >>>>>>> use the default automq logic for everything)
> >>>>>>>
> >>>>>>>> 2. how do we know existing userspace does not rely on existing behaviour
> >>>>>>> Prior to this change a negative return from a
> >>>>>>> TUNSETSTEERINGEBPF prog would have been cast
> >>>>>>> into a u16 and traversed netdev_cap_txqueue().
> >>>>>>>
> >>>>>>> In most cases netdev_cap_txqueue() would have
> >>>>>>> found this value to exceed real_num_tx_queues
> >>>>>>> and queue_index would be updated to 0.
> >>>>>>>
> >>>>>>> It is possible that a TUNSETSTEERINGEBPF prog
> >>>>>>> return a negative value which when cast into a
> >>>>>>> u16 results in a positive queue_index less than
> >>>>>>> real_num_tx_queues. For example, on x86_64, a
> >>>>>>> return value of -65535 results in a queue_index
> >>>>>>> of 1; which is a valid queue for any multiqueue
> >>>>>>> device.
> >>>>>>>
> >>>>>>> It seems unlikely, however as stated above is
> >>>>>>> unfortunately possible, that existing
> >>>>>>> TUNSETSTEERINGEBPF programs would choose to
> >>>>>>> return a negative value rather than return the
> >>>>>>> positive value which holds the same meaning.
> >>>>>>>
> >>>>>>> It seems more likely that future
> >>>>>>> TUNSETSTEERINGEBPF programs would leverage a
> >>>>>>> negative return and potentially be loaded into
> >>>>>>> a kernel with the old behavior.
> >>>>>> OK if we are returning a special
> >>>>>> value, shouldn't we limit it? How about a special
> >>>>>> value with this meaning?
> >>>>>> If we are changing an ABI let's at least make it
> >>>>>> extensible.
> >>>>>>
> >>>>> A special value with this meaning sounds
> >>>>> good to me. I'll plan on adding a define
> >>>>> set to -1 to cause the fallback to automq.
> >>>> Can it really return -1?
> >>>>
> >>>> I see:
> >>>>
> >>>> static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
> >>>> struct sk_buff *skb)
> >>>> ...
> >>>>
> >>>>
> >>>>> The way I was initially viewing the old
> >>>>> behavior was that returning negative was
> >>>>> undefined; it happened to have the
> >>>>> outcomes I walked through, but not
> >>>>> necessarily by design.
> >>>> Having such fallback may bring extra troubles, it requires the eBPF
> >>>> program know the existence of the behavior which is not a part of kernel
> >>>> ABI actually. And then some eBPF program may start to rely on that which
> >>>> is pretty dangerous. Note, one important consideration is to have
> >>>> macvtap support where does not have any stuffs like automq.
> >>>>
> >>>> Thanks
> >>>>
> >>> How about we call this TUN_SSE_ABORT
> >>> instead of TUN_SSE_DO_AUTOMQ?
> >>>
> >>> TUN_SSE_ABORT could be documented as
> >>> falling back to the default queue
> >>> selection method in either space
> >>> (presumably macvtap has some queue
> >>> selection method when there is no prog).
> >>
> >> This looks like a more complex API, we don't want userspace to differ
> >> macvtap from tap too much.
> >>
> >> Thanks
> >>
> > This is barely more complex and provides
> > similar to what is done in many places.
> > For xdp, an XDP_PASS enacts what the
> > kernel would do if there was no bpf prog.
> > For tc cls in da mode, TC_ACT_OK enacts
> > what the kernel would do if there was
> > no bpf prog. For xt_bpf, false enacts
> > what the kernel would do if there was
> > no bpf prog (as long as negation
> > isn't in play in the rule, I believe).
>
>
> I think this is simply because you can't implement e.g
> XDP_PASS/TC_ACT_OK through eBPF itself which is not the case of steering
> prog here.
>
>
> >
> > I know that this is somewhat of an
> > oversimplification and that each of
> > these also means something else in
> > the respective hookpoint, but I standby
> > seeing value in this change.
> >
> > macvtap must have some default (i.e the
> > action which it takes when no prog is
> > loaded), even if that is just use queue
> > 0. We can provide the same TUN_SSE_ABORT
> > in userspace which does the same thing;
> > enacts the default when returned. Any
> > differences left between tap and macvtap
> > would be in what the default is, not in
> > these changes. And that difference already
> > exists today.
>
>
> I think it's better to safe to just drop the packet instead of trying to
> workaround it.
>

This patch aside, dropping the packet here
seems like the wrong choice. Loading a
prog at this hookpoint "configures"
steering. The action of configuring
steering should not result in dropped
packets.

Suboptimal delivery is generally preferable
to no delivery. Leaving the behavior as-is
(i.e. relying on netdev_cap_txqueue()) or
making any return which doesn't fit in a
u16 simply use queue 0 would be highly
preferable to dropping the packet.

> Thanks
>
>
> >
> >>>>> In order to keep the new behavior
> >>>>> extensible, how should we state that a
> >>>>> negative return other than -1 is
> >>>>> undefined and therefore subject to
> >>>>> change. Is something like this
> >>>>> sufficient?
> >>>>>
> >>>>> Documentation/networking/tc-actions-env-rules.txt
> >>>>>
> >>>>> Additionally, what should the new
> >>>>> behavior implement when a negative other
> >>>>> than -1 is returned? I would like to have
> >>>>> it do the same thing as -1 for now, but
> >>>>> with the understanding that this behavior
> >>>>> is undefined. Does this sound reasonable?
> >>>>>
> >>>>>>>> 3. why doesn't userspace need a way to figure out whether it runs on a kernel with and
> >>>>>>>> without this patch
> >>>>>>> There may be some value in exposing this fact
> >>>>>>> to the ebpf prog loader. What is the standard
> >>>>>>> practice here, a define?
> >>>>>> We'll need something at runtime - people move binaries between kernels
> >>>>>> without rebuilding then. An ioctl is one option.
> >>>>>> A sysfs attribute is another, an ethtool flag yet another.
> >>>>>> A combination of these is possible.
> >>>>>>
> >>>>>> And if we are doing this anyway, maybe let userspace select
> >>>>>> the new behaviour? This way we can stay compatible with old
> >>>>>> userspace...
> >>>>>>
> >>>>> Understood. I'll look into adding an
> >>>>> ioctl to activate the new behavior. And
> >>>>> perhaps a method of checking which is
> >>>>> behavior is currently active (in case we
> >>>>> ever want to change the default, say
> >>>>> after some suitably long transition
> >>>>> period).
> >>>>>
> >>>>>>>> thanks,
> >>>>>>>> MST
> >>>>>>>>
> >>>>>>>>> ---
> >>>>>>>>> drivers/net/tun.c | 20 +++++++++++---------
> >>>>>>>>> 1 file changed, 11 insertions(+), 9 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>>>>>> index aab0be4..173d159 100644
> >>>>>>>>> --- a/drivers/net/tun.c
> >>>>>>>>> +++ b/drivers/net/tun.c
> >>>>>>>>> @@ -583,35 +583,37 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
> >>>>>>>>> return txq;
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> -static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
> >>>>>>>>> +static int tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
> >>>>>>>>> {
> >>>>>>>>> struct tun_prog *prog;
> >>>>>>>>> u32 numqueues;
> >>>>>>>>> - u16 ret = 0;
> >>>>>>>>> + int ret = -1;
> >>>>>>>>>
> >>>>>>>>> numqueues = READ_ONCE(tun->numqueues);
> >>>>>>>>> if (!numqueues)
> >>>>>>>>> return 0;
> >>>>>>>>>
> >>>>>>>>> + rcu_read_lock();
> >>>>>>>>> prog = rcu_dereference(tun->steering_prog);
> >>>>>>>>> if (prog)
> >>>>>>>>> ret = bpf_prog_run_clear_cb(prog->prog, skb);
> >>>>>>>>> + rcu_read_unlock();
> >>>>>>>>>
> >>>>>>>>> - return ret % numqueues;
> >>>>>>>>> + if (ret >= 0)
> >>>>>>>>> + ret %= numqueues;
> >>>>>>>>> +
> >>>>>>>>> + return ret;
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
> >>>>>>>>> struct net_device *sb_dev)
> >>>>>>>>> {
> >>>>>>>>> struct tun_struct *tun = netdev_priv(dev);
> >>>>>>>>> - u16 ret;
> >>>>>>>>> + int ret;
> >>>>>>>>>
> >>>>>>>>> - rcu_read_lock();
> >>>>>>>>> - if (rcu_dereference(tun->steering_prog))
> >>>>>>>>> - ret = tun_ebpf_select_queue(tun, skb);
> >>>>>>>>> - else
> >>>>>>>>> + ret = tun_ebpf_select_queue(tun, skb);
> >>>>>>>>> + if (ret < 0)
> >>>>>>>>> ret = tun_automq_select_queue(tun, skb);
> >>>>>>>>> - rcu_read_unlock();
> >>>>>>>>>
> >>>>>>>>> return ret;
> >>>>>>>>> }
> >>>>>>>>> --
> >>>>>>>>> 1.8.3.1