Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

From: Akihiko Odaki
Date: Mon Oct 09 2023 - 04:57:25 EST


On 2023/10/09 17:04, Willem de Bruijn wrote:
On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2023/10/09 5:08, Willem de Bruijn wrote:
On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2023/10/09 4:07, Willem de Bruijn wrote:
On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but it makes
little sense to allow to implement different hashing algorithms with
eBPF since the hash value reported by virtio-net is strictly defined by
the specification.

The hash value already stored in sk_buff is not used and computed
independently since it may have been computed in a way not conformant
with the specification.

Signed-off-by: Akihiko Odaki <akihiko.odaki@xxxxxxxxxx>
---

+static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
+ .max_indirection_table_length =
+ TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
+
+ .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
+};

No need to have explicit capabilities exchange like this? Tun either
supports all or none.

tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.

It is because the flow dissector does not support IPv6 extensions. The
specification is also vague, and does not tell how many TLVs should be
consumed at most when interpreting destination option header so I chose
to avoid adding code for these hash types to the flow dissector. I doubt
anyone will complain about it since nobody complains for Linux.

I'm also adding this so that we can extend it later.
max_indirection_table_length may grow for systems with 128+ CPUs, or
types may have other bits for new protocols in the future.


case TUNSETSTEERINGEBPF:
- ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
+ bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
+ if (IS_ERR(bpf_ret))
+ ret = PTR_ERR(bpf_ret);
+ else if (bpf_ret)
+ tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;

Don't make one feature disable another.

TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
functions. If one is enabled the other call should fail, with EBUSY
for instance.

+ case TUNSETVNETHASH:
+ len = sizeof(vnet_hash);
+ if (copy_from_user(&vnet_hash, argp, len)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
+ (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
+ !tun_is_little_endian(tun))) ||
+ vnet_hash.indirection_table_mask >=
+ TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
+ ret = -EINVAL;
+ break;
+ }
+
+ argp = (u8 __user *)argp + len;
+ len = (vnet_hash.indirection_table_mask + 1) * 2;
+ if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ argp = (u8 __user *)argp + len;
+ len = virtio_net_hash_key_length(vnet_hash.types);
+
+ if (copy_from_user(vnet_hash_key, argp, len)) {
+ ret = -EFAULT;
+ break;
+ }

Probably easier and less error-prone to define a fixed size control
struct with the max indirection table size.

I made its size variable because the indirection table and key may grow
in the future as I wrote above.


Btw: please trim the CC: list considerably on future patches.

I'll do so in the next version with the TUNSETSTEERINGEBPF change you
proposed.

To be clear: please don't just resubmit with that one change.

The skb and cb issues are quite fundamental issues that need to be resolved.

I'd like to understand why adjusting the existing BPF feature for this
exact purpose cannot be amended to return the key it produced.

eBPF steering program is not designed for this particular problem in my
understanding. It was introduced to derive hash values with an
understanding of application-specific semantics of packets instead of
generic IP/TCP/UDP semantics.

This problem is rather different in terms that the hash derivation is
strictly defined by virtio-net. I don't think it makes sense to
introduce the complexity of BPF when you always run the same code.

It can utilize the existing flow dissector and also make it easier to
use for the userspace by implementing this in the kernel.

Ok. There does appear to be overlap in functionality. But it might be
easier to deploy to just have standard Toeplitz available without
having to compile and load an eBPF program.

As for the sk_buff and cb[] changes. The first is really not needed.
sk_buff simply would not scale if every edge case needs a few bits.

An alternative is to move the bit to cb[] and clear it for every code paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.

I think we can put the bit in sk_buff for now. We can implement the alternative when we are short of bits.


For the control block, generally it is not safe to use that across
layers. In this case, between qdisc enqueue of a given device and
ndo_start_xmit of that device, I suppose it is. Though uncommon. I
wonder if there is any precedent.

That's one of the reasons modifying qdisc_skb_cb instead of creating another struct embedding it as bpf_skb_data_end and tc_skb_cb do. This clarifies that it is qdisc's responsibility to keep these data intact.


The data will have to be stored in the skb somewhere. A simpler option
is just skb->hash? This code would use skb_get_hash, if it would
always produce a Toeplitz hash, anyway.

We still need tun_vnet_hash_report to identify hash type.

skb_get_hash() uses Siphash instead of Toeplitz, and the configuration of Toeplitz relies on tun's internal state. It is still possible to store a hash calculated with tun's own logic to skb->hash, but someone may later overwrite it with __skb_get_hash() though it's unlikely.