Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

From: Akihiko Odaki
Date: Sat Mar 29 2025 - 05:15:58 EST

Next message: Mukesh Kumar Savaliya: "Re: [PATCH v4] i2c: npcm: Add clock toggle recovery"
Previous message: Michael Ellerman: "Re: [PATCH v2] powerpc/boot: Fix build with gcc 15"
In reply to: Jason Wang: "Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2025/03/24 13:40, Jason Wang wrote:

On Fri, Mar 21, 2025 at 1:57 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2025/03/21 10:13, Jason Wang wrote:

On Thu, Mar 20, 2025 at 1:33 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2025/03/20 10:31, Jason Wang wrote:

On Wed, Mar 19, 2025 at 1:29 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2025/03/19 9:58, Jason Wang wrote:

On Tue, Mar 18, 2025 at 6:10 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2025/03/18 9:15, Jason Wang wrote:

On Mon, Mar 17, 2025 at 3:07 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2025/03/17 10:12, Jason Wang wrote:

On Wed, Mar 12, 2025 at 1:03 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2025/03/12 11:35, Jason Wang wrote:

On Tue, Mar 11, 2025 at 2:11 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2025/03/11 9:38, Jason Wang wrote:

On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

On 2025/03/10 12:55, Jason Wang wrote:

On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote:

Hash reporting
==============

Allow the guest to reuse the hash value to make receive steering
consistent between the host and guest, and to save hash computation.

RSS
===

RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Signed-off-by: Akihiko Odaki <akihiko.odaki@xxxxxxxxxx>
Tested-by: Lei Yang <leiyang@xxxxxxxxxx>
---

[...]

What's more, we've already had virito-net uAPI. Why not simply reusing them?

See the above.

RSS and hash reporting must share
this parameter when both are enabled at the same time; otherwise RSS may
compute hash values that are not suited for hash reporting.

Is this mandated by the spec? If yes, we can add a check. If not,
userspace risk themselves as a mis-configuration which we don't need
to bother.

Yes, it is mandated. 5.1.6.4.3 Hash calculation for incoming packets says:
> A device attempts to calculate a per-packet hash in the following
> cases:
>
> - The feature VIRTIO_NET_F_RSS was negotiated. The device uses the
> hash to determine the receive virtqueue to place incoming packets.
> - The feature VIRTIO_NET_F_HASH_REPORT was negotiated. The device
> reports the hash value and the hash type with the packet.
>
> If the feature VIRTIO_NET_F_RSS was negotiated:
>
> - The device uses hash_types of the virtio_net_rss_config structure
> as ’Enabled hash types’ bitmask.
> - The device uses a key as defined in hash_key_data and
hash_key_length of the virtio_net_rss_config structure (see
> 5.1.6.5.7.1).
>
> If the feature VIRTIO_NET_F_RSS was not negotiated:
>
> - The device uses hash_types of the virtio_net_hash_config structure
> as ’Enabled hash types’ bitmask.
> - The device uses a key as defined in hash_key_data and
> hash_key_length of the virtio_net_hash_config structure (see
> .1.6.5.6.4).

So when both VIRTIO_NET_F_RSS and VIRTIO_NET_F_HASH_REPORT are
negotiated, virtio_net_rss_config not only controls RSS but also the
reported hash values and types. They cannot be divergent.

Note that spec use different commands for hash_report and rss.

TUNSETVNETHASH is different from these commands in terms that it also
negotiates VIRTIO_NET_F_HASH_REPORT and VIRTIO_NET_F_RSS.

There Are different "issues" here:

1) Whether or not we need to use a unified API for negotiating RSS and
HASH_REPORT features
2) Whether or not we need to sue a unified API for setting RSS and
HASH_REPORT configuration

What I want to say is point 2. But what you raise is point 1.

For simplicity, it looks to me like it's a call for having separated
ioctls for feature negotiation (for example via TUNSETIFF). You may
argue that either RSS or HASH_REPORT requires configurations, we can
just follow what spec defines or not (e.g what happens if
RSS/HASH_REPORT were negotiated but no configurations were set).

Unfortunately TUNSETIFF does not fit in this use case. The flags set
with TUNSETIFF are fixed, but the guest can request a different feature
set anytime by resetting the device.

TUNSETIFF, enables the device to be able to handle RSS/HASREPORT.
TUNSETHASH/RSS. dealing with RSS/HASH command from userspace.

We also needs to be able to disable them at runtime so that we can
handle resets.

Via TUNSETHASH/RSS? I think it should have a way to accept parameters
that disable RSS or hash report.

That's what this patch implements. TUNSETVNETHASH accepts parameters to
choose what features to be enabled.

This is the way we used to do for multi queue and vnet header.
TUNSETIFF requires CAP_NET_ADMIN, this could be an extra safe guard
for unprivileged userspace.

I intend to allow using this feature without privilege. A VMM is usually
unprivileged and requiring a privilege to configure tuntap is too
prohibitive.

For safety, tun is not allowed to be created by unprivileged users.
And it's not to configure the tuntap dynamically, it's about telling
the function that tuntap can have (not necessarily enabled though) .

I don't think we need another barrier for the new functions. Once an
unprivileged user get a file descriptor of tuntap from a privileged
user, they are free to enable RSS and/or hash reporting.

Only if such a feature is allowed by the privileged user.

I don't see a reason not to allow the feature to unprivileged users. It
only complicates the setup.

For safety, e.g reduce the chance for unprivileged user to explore
part of the kernel codes.

It indeed reduces the attack surface, but it's fine without the
reduction I guess? It's not a feature so complicated;

I don't know how to define complicated things here but simplicity
doesn't necessarily mean safety.

I saw there were
complicated changes like namespaces and io_uring that caused controversy
when exposing them to unprivilged users, but this feature is not like
them, I suppose.

We limit feature setting through tun_set_iff in the past. Instead of
trying to argue if RSS is safe to be enabled without TUNSETIFF,
following what has been used in the past is always simpler and easier.

> >> In the virtio-net specification, it is not defined what would
happen if

these features are negotiated but the VIRTIO_NET_CTRL_MQ_RSS_CONFIG or
VIRTIO_NET_CTRL_MQ_HASH_CONFIG commands are not sent. There is no such
ambiguity with TUNSETVNETHASH.

So I don't see advantages of unifying hash reports and rss into a
single ioctl. Let's just follow what has been done in the spec that
uses separated commands. Tuntap is not a good place to debate whether
those commands could be unified or not. We need to move it to the spec
but assuming spec has been done, it might be too late or too few
advantages for having another design.

It makes sense for the spec to reuse the generic feature negotiation
mechanism, but the situation is different for tuntap; we cannot use
TUNSETIFF and need to define another. Then why don't we exploit this
opportunity to have an interface with well-defined semantics?

That's perfectly fine, but it needs to be done in virtio-net's uAPI
not tun's. What's more, if you think two commands are not
well-defined, let's fix that in the virtio spec first.

The virtio
spec does its best as an interface between the host and guest and tuntap
does its best as an UAPI.

See above, let's fix the uAPI first. We don't want DPDK to use tun's
uAPI for RSS

virtio-net's UAPI is for the virtio spec which has a capable generic
feature negotiation mechanism. tuntap needs its own feature negotiation
and it's nothing to do with virtio-net's UAPI.

Well, I don't mean the part of the feature negotiation. I mean the
part for rss and hash report configuration.

The feature negotiation still matters when deciding the granularity of
ioctls. We need one ioctl for a feature negotiation, and to avoid having
an intermediate state,

I don't understand this. For example, driver can choose to

1) negotiate RSS
2) do something else.
3) configure RSS

Spec doesn't require those two to be configured at the same time, so
"intermediate state" is allowed.

The spec doesn't define what should happen in the intermediate state either.

Yes but my point is that in the uAPI layer we don't need to care about
the intermediate state. It can just work as other features, e.g having
a default state after feature negotiation is more than enough. This is
the way we deal with other features like vnet header etc.

> >>

For a hardware implementation I think it's fine whatever the
implementation defines as the intermediate state. But for the UAPI, it's
better avoiding having such a definition to keep the interface minimal
and maximize the UAPI stability.

Well, even if you think there's an issue:

1) I don't see how we can avoid the intermediate state consider guest
have such state
2) We need to "fix" virtio spec and virito-net first, tuntap is not
the right place to workaround virtio specific issues

Let me summarize my points that support having one ioctl to negotiate
features and configuration:

The virtio spec has a generic feature negotiation mechanism and reusing
it resulted in having an intermediate state between the feature
negotiation and configuration. There is nothing wrong about that so we
don't need to "fix" the virtio spec.

Good to know that.

tuntap can also perform feature negotitaion with TUNSETIFF, but
TUNSETIFF have a few problems:

TUNSETIFF is not feature negotiation, it's about device or queue
provisioning as well as the features. From the view of the virtio, it
is used to provision the device_features. For example, qemu only calls
TUNSETIFF when it tries to open the tap fd.

> >>

1. It requires a privilege. One can argue that it reduces the attack
surface and it indeed does, but it's fine without the reduction I guess?
It's not a feature so complicated; I saw there were complicated changes
like namespaces and io_uring that caused controversy when exposing them
to unprivilged users, but this feature is not like them.

I'm not asking to invent something new, but just reuse the security
stuff that has been already used for more than a decade. It would be
always easier to relax the check instead of enforce the check which
may break uAPI. I can imagine it would not take a lot of codes to
achieve this.

2. It cannot change the enabled feature set at runtime. The virtio spec
allows changing it by resetting.

RSS is not the first feature of those requirements. TUN has
implemented various virtio specific features in the past.

So we need to design a set of new ioctls for both feature negotiation
and configuration. When doing so, eliminating the intermediate state is
a good principle to determine the optimal size of ioctls.

As discussed, having a default state after TUNSETIFF is more than
enough. That is how a multi queue/vnet header works:

1) for multiqueue, when IFF_MULTIQUEUE is set, starting with 1 queue
2) for vnet header, vnet header will be zero unless TUNSETVETHDR is called

I don't see how RSS makes anything different.

For intermediate states, with your proposal, it still requires the
userspace to assume a default state when doing TUNSETVETRSS etc.

>>

In theory, it is possible to have small ioctls that set only one scalar
value or even one bit, but that doesn't make sense. This principle helps
determine the optimal size of ioctls; it minimizes the complexity of
both the userspace and the kernel.

Well the complexity is not measured by the number of ioctls or
structures. I basically meant:

1) IF_RSS to provision the device with the RSS features, this could be
fetched from TUNGETIFF

It requires changes for libvirt, qemu-bridge-helper, and potentially other VMMs and DPDK. I would like to avoid such chores if the only reason to do so is the presence of prior examples.

The features available via TUNSETIFF is fetched with TUNGETFEATURES, not TUNGETIFF.

2) Having a default state implemented in TUN that complies with the spec

If TUNSETIFF is only for device_feature, a natural choice will be to initialize driver_feature with VIRTIO_NET_F_HASH_REPORT and VIRTIO_NET_F_RSS unset.

3) TUNSETVET/GETHASH to send and receive RSS configuration

If TUNSETIFF is only for device_feature, we need two other ioctls:
- one to set driver_feature and the RSS/hash reporting configuration
- one that tells supported hash types

The former is implemented with TUNSETVNETHASH. The latter is implemented with TUNGETVNETHASHCAP.

the ioctl should also do the configuration. Hence
that one ioctl should do all of the feature negotiation and configuration.

The structures for two commands have unused or redundant fields and a
flexible array in the middle of the structure, but they are ABIs so we
can't change it.

DPDK is another reason to define tuntap's own UAPIs. They don't care
unused or redundant fields and a flexible array in middle that are
present in the virtio spec. It will also not want to deal with the
requirement of little endian. Constructing struct virtio_net_rss_config
is an extra burden for DPDK.

I meant for vhost-user implementation in DPDK, it needs to use
virtio-net uAPI not tuntap's for example.

The vhost-user implementation will use tuntap's UAPIs for its ethernet
device backend.

That sounds pretty weird, vhost-user has nothing related to tuntap.

My expression in the last email was weird. More precisely, the ethernet
backend of tuntap will use the UAPIs, and the vhost-user will use the
ethernet backend in turn.

I don't understand what "ethernet backend" means here.

It is a driver that serves the Ethernet Device API, which is agnostic on
application and driver. The Ethernet Device API, including RSS
configuration is documented at:
https://doc.dpdk.org/api/rte__ethdev_8h.html

The Ethernet API are not bound to the virtio spec since they are not
specific to the vhost application or the tuntap driver. Hence they
operate in native endian and do not have extra fields, and tuntap's
structures are more suited to the ethernet backend than the virtio ones.

vhost-user is the device implementation not an ethernet driver. Why
did it use tuntap's uAPI and do the useless endian conversion twice?

Here the ethernet backend refers to the code that interacts with tuntap instead of vhost-user. Please note that I wrote "the vhost-user will use the ethernet backend" earlier.

It uses the generic interface of ethernet device so for
RSS it will use functions like rte_eth_dev_rss_hash_update() for
example. tuntap's UAPIs are more suited to implement these interfaces as
they operate in native endian and don't have extra fields.

Nope, for example it needs to use le for virtio_net_hdr if a modern
device is used. But it needs a "native" endian according to the guest
endian via TUNSETVNETLE/BE. We don't have a choice as virtio-net hdr
support in tuntap is much earlier than modern devices.

Let's don't do the same thing (native endian) for tuntap as RSS
depends on modern, so we know it must be le.

virtio_net_hdr is the data path while the current discussion is about
the control path. All configuration knobs of tuntap operates in the
native endian.

Because they are not directly related to virtio specification. We
don't want to duplicate virtio-net with our own version every time E.g
once RSSv2 or aRFS were implemented. Or I would even introduce a
single uAPI to transport possible cvq commands then we can avoid
inventing new ioctls that just transport cvq commands.

So I think we should stick to the little endian for the data path while
we should stick to the native endian for the control path to maximize
the consistency.

I don't see a reason to differ datapath from control path. Virtio-net
uAPI has been reused by tuntap for more than a decade.

tuntap's control path all operate in the native endian.

It's just a description of the current status, people can easily say
tuntap's data path all operate in the native endian before the support
of version 1.0.

They never used
the endian of the data path in the control path.

Once virtio uAPI can be reused, we need to do that.

It discourages the usage of virtio UAPI; when the userspace wants native endian, why will we want to force using virtio UAPI, which requires little endian?

DPDk applications other than vhost-user also matter; they do not care
what virtio does at all.

> >>

On the other hand, Constructing tuntap-specific structures is not that
complicated for VMMs.

Not complicated but redundant.

A VMM will need to inspect struct
virtio_net_rss_config anyway to handle migration and check its size so
it can store the values it inspected to struct tun_vnet_hash and struct
tun_vnet_hash_rss and pass them to the kernel.

I don't see how rss and hash reports differ from what we have now.
Those inspections must be done anyhow for compatibility for example
the check of offloading features. Such steps could not be eliminated
no matter how we design the uAPI.

I explained the difference between the virtio and tuntap UAPIs, not
between RSS and hash reporting.

See above.

The overall userspace implementation will be simpler by having
structures specifically tailored for the communication between the
userspace and kernel.

This is exactly how a good uAPI should behave. If uAPI in virtio-net
can't do this, I don't understand why uAPI in tuntap can solve it.

The UAPI in virtio-net cannot do it because it's already fixed and it
also needs to perform endian conversion for the VM use case. tuntap
doesn't have these restrictions.

Same here.

I don't think there is an advantage to split ioctls to follow the spec
after all. It makes sense if we can pass-through virtio commands to
tuntap, but it is not possible as ioctl operation codes are different
from virtio commands.

I don't see a connection with the operation code. For example, we can
add new uAPIs in virtio-net which could be something like:

struct virtio_net_rss_config_header {
__le32 hash_types;
__le16 indirection_table_mask;
__le16 unclassified_queue;
__le16 indirection_table[];
}

struct virtio_net_rss_config_tailer {
__le16 max_tx_vq;
u8 hash_key_length;
u8 hash_key_data[];
}

These two are used by TUNSETVNETRSS. And simply reuse the
virtio_net_hash_config for TUNSETVETHASH.

> > With this, we can tweak the virtio-net driver with this new uAPI. Then

tap* can reuse this.

I implemented a UAPI and driver change accordingly:
https://lore.kernel.org/r/20250318-virtio-v1-0-344caf336ddd@xxxxxxxxxx

This is a nice improvement for the driver, but I still don't think it is
suited for the UAPI of tuntap.

Any reason for this? It should work like virtio_net_hdr.

The requirements of extra fields and
little endian cannot be removed from the virtio spec but they are
irrelevant for tuntap.

I don't understand this part. What fields are "extra" and need to be
removed from the spec?

All fields not included in struct tun_vnet_hash and struct
tun_vnet_hash_rss. Namely, for struct virtio_net_hash_config:
- reserved
- hash_key_length
- hash_key_data

For struct virtio_net_rss_config:
- max_tx_vq
- hash_key_length

See my above reply, and I basically meant

TUNSETVETHASH accept struct virtio_net_hash_config;
TUNSETVETRSS accept struct virtio_net_rss_config_hdr + struct
virtio_net_rss_config_trailer;

That still bring the extra fields I noted in the last email.

I don't know how to define "extra" here. Let's summarize here:

Method A:

1) virtio specification use separate commands for has_report and rss
2) hash_port ans rss doesn't depend on each other
3) reuse virtio-net uAPI

Method B:

1) trying to define and remove the "extra" fields in tuntap, and
redefine it in TUNTAP

It would always be much easier to start from simply reusing the
virtio-net uAPI. Method B makes both the implementation and reviewing
harder, as we need to

1) revisit the design of the virtio spec, this needs to be done in the
virtio community not here
2) audit the difference between virtio spec and TUN/TAP, that's why we
have a very long discussion here

For example, the root cause of why you think the max_tx_vq is "extra" is:

1) The spec defines VIRTIO_NET_F_RSS and VIRTIO_NET_F_MQ as independent features
2) Your code tries to re-use IFF_MULTI_QUEUE for both VIRTIO_NET_F_RSS
and VIRTIO_NET_F_MQ, this would have a lot of implications, e.g
automatic steering might be applied when only RSS is negotiated etc

The correct way to implement this is:

1) Introduce IFF_RSS and only set it during TUNSETIFF when device only
offers RSS

Please see the summary of "my points that support having one ioctl to
negotiate features and configuration" I wrote the above.

2) reuse virtio-net uAPI and accept max_tx_vq and use that to change
the queue(or queue paris) if necessary

I don't think it's possible; we need file descriptors associated with
queues, which is something you cannot express with the virtio-net
structures.

So:

1) Provisioning queues were still done via TUNSETIFF
2) We just need to hook max_tx_vq (the helpers were already there) to
the helpers to enable and disable a queue instead of depending on the
TUNSETQUEUE

1) is sufficient and 2) is redundant. We cannot provision queues according to max_tx_vq so the only way to remove this redundancy is not to have the field in tuntap's UAPI in the first place.

Regards,
Akihiko Odaki

Regards,
Akihiko Odaki

Thanks

Next message: Mukesh Kumar Savaliya: "Re: [PATCH v4] i2c: npcm: Add clock toggle recovery"
Previous message: Michael Ellerman: "Re: [PATCH v2] powerpc/boot: Fix build with gcc 15"
In reply to: Jason Wang: "Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]