Re: [PATCH net-next v2] net: openvswitch: decouple flow_table from ovs_mutex
From: Adrián Moreno
Date: Wed Apr 15 2026 - 02:17:11 EST
On Mon, Apr 13, 2026 at 10:39:37AM +0200, Paolo Abeni wrote:
> On 4/7/26 2:04 PM, Adrian Moreno wrote:
> > Currently the entire ovs module is write-protected using the global
> > ovs_mutex. While this simple approach works fine for control-plane
> > operations (such as vport configurations), requiring the global mutex
> > for flow modifications can be problematic.
> >
> > During periods of high control-plane operations, e.g: netdevs (vports)
> > coming and going, RTNL can suffer contention. This contention is easily
> > transferred to the ovs_mutex as RTNL nests inside ovs_mutex. Flow
> > modifications, however, are done as part of packet processing and having
> > them wait for RTNL pressure to go away can lead to packet drops.
> >
> > This patch decouples flow_table modifications from ovs_mutex by means of
> > the following:
> >
> > 1 - Make flow_table an rcu-protected pointer inside the datapath.
> > This allows both objects to be protected independently while reducing the
> > amount of changes required in "flow_table.c".
> >
> > 2 - Create a new mutex inside the flow_table that protects it from
> > concurrent modifications.
> > Putting the mutex inside flow_table makes it easier to consume for
> > functions inside flow_table.c that do not currently take pointers to the
> > datapath.
> > Some function signatures need to be changed to accept flow_table so that
> > lockdep checks can be performed.
> >
> > 3 - Create a reference count to temporarily extend rcu protection from
> > the datapath to the flow_table.
> > In order to use the flow_table without locking ovs_mutex, the flow_table
> > pointer must be first dereferenced within an rcu-protected region.
> > Next, the table->mutex needs to be locked to protect it from
> > concurrent writes but mutexes must not be locked inside an rcu-protected
> > region, so the rcu-protected region must be left at which point the
> > datapath can be concurrently freed.
> > To extend the protection beyond the rcu region, a reference count is used.
> > One reference is held by the datapath, the other is temporarily
> > increased during flow modifications. For example:
> >
> > Datapath deletion:
> >
> > ovs_lock();
> > table = rcu_dereference_protected(dp->table, ...);
> > rcu_assign_pointer(dp->table, NULL);
> > ovs_flow_tbl_put(table);
> > ovs_unlock();
> >
> > Flow modification:
> >
> > rcu_read_lock();
> > dp = get_dp(...);
> > table = rcu_dereference(dp->table);
> > ovs_flow_tbl_get(table);
> > rcu_read_unlock();
> >
> > mutex_lock(&table->lock);
> > /* Perform modifications on the flow_table */
> > mutex_unlock(&table->lock);
> > ovs_flow_tbl_put(table);
> >
> > Signed-off-by: Adrian Moreno <amorenoz@xxxxxxxxxx>
> > ---
> > v2: Fix argument in ovs_flow_tbl_put (sparse)
> > Remove rcu checks in ovs_dp_masks_rebalance
> > ---
> > net/openvswitch/datapath.c | 285 ++++++++++++++++++++++++-----------
> > net/openvswitch/datapath.h | 2 +-
> > net/openvswitch/flow.c | 13 +-
> > net/openvswitch/flow.h | 9 +-
> > net/openvswitch/flow_table.c | 180 ++++++++++++++--------
> > net/openvswitch/flow_table.h | 51 ++++++-
> > 6 files changed, 380 insertions(+), 160 deletions(-)
>
> This is too big for a single patch. The changelog above already suggests
> a way of splitting the change. At least the RCU-ification addition
> should be straight forward in a separate patch, which in turn should be
> easily reviewable.
I agree. I'll try to split it in the next version.
>
> > diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> > index e209099218b4..9c234993520c 100644
> > --- a/net/openvswitch/datapath.c
> > +++ b/net/openvswitch/datapath.c
> > @@ -88,13 +88,17 @@ static void ovs_notify(struct genl_family *family,
> > * DOC: Locking:
> > *
> > * All writes e.g. Writes to device state (add/remove datapath, port, set
> > - * operations on vports, etc.), Writes to other state (flow table
> > - * modifications, set miscellaneous datapath parameters, etc.) are protected
> > - * by ovs_lock.
> > + * operations on vports, etc.) and writes to other datapath parameters
> > + * are protected by ovs_lock.
> > + *
> > + * Writes to the flow table are NOT protected by ovs_lock. Instead, a per-table
> > + * mutex and reference count are used (see comment above "struct flow_table"
> > + * definition). On some few occasions, the per-flow table mutex is nested
> > + * inside ovs_mutex.
> > *
> > * Reads are protected by RCU.
> > *
> > - * There are a few special cases (mostly stats) that have their own
> > + * There are a few other special cases (mostly stats) that have their own
> > * synchronization but they nest under all of above and don't interact with
> > * each other.
> > *
> > @@ -166,7 +170,6 @@ static void destroy_dp_rcu(struct rcu_head *rcu)
> > {
> > struct datapath *dp = container_of(rcu, struct datapath, rcu);
> >
> > - ovs_flow_tbl_destroy(&dp->table);
> > free_percpu(dp->stats_percpu);
> > kfree(dp->ports);
> > ovs_meters_exit(dp);
> > @@ -247,6 +250,7 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> > struct ovs_pcpu_storage *ovs_pcpu = this_cpu_ptr(ovs_pcpu_storage);
> > const struct vport *p = OVS_CB(skb)->input_vport;
> > struct datapath *dp = p->dp;
> > + struct flow_table *table;
> > struct sw_flow *flow;
> > struct sw_flow_actions *sf_acts;
> > struct dp_stats_percpu *stats;
> > @@ -257,9 +261,16 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> > int error;
> >
> > stats = this_cpu_ptr(dp->stats_percpu);
> > + table = rcu_dereference(dp->table);
> > + if (!table) {
> > + net_dbg_ratelimited("ovs: no flow table on datapath %s\n",
> > + ovs_dp_name(dp));
> > + kfree_skb(skb);
> > + return;
> > + }
> >
> > /* Look up flow. */
> > - flow = ovs_flow_tbl_lookup_stats(&dp->table, key, skb_get_hash(skb),
> > + flow = ovs_flow_tbl_lookup_stats(table, key, skb_get_hash(skb),
> > &n_mask_hit, &n_cache_hit);
> > if (unlikely(!flow)) {
> > struct dp_upcall_info upcall;
> > @@ -752,12 +763,16 @@ static struct genl_family dp_packet_genl_family __ro_after_init = {
> > static void get_dp_stats(const struct datapath *dp, struct ovs_dp_stats *stats,
> > struct ovs_dp_megaflow_stats *mega_stats)
> > {
> > + struct flow_table *table = ovsl_dereference(dp->table);
>
> Should be rcu_dereference_ovs_tbl() ?
>
This function is only called with ovs_lock() (at least for now, we might
reconsider it based on your other comments below), so I kept the
stricter assertion.
> > int i;
> >
> > memset(mega_stats, 0, sizeof(*mega_stats));
> >
> > - stats->n_flows = ovs_flow_tbl_count(&dp->table);
> > - mega_stats->n_masks = ovs_flow_tbl_num_masks(&dp->table);
> > + if (table) {
> > + stats->n_flows = ovs_flow_tbl_count(table);
>
> As noted by Aaron, READ_ONCE() is now needed when accessing
> table->count. And WRITE_ONCE when writing it
>
Yep.
> > + mega_stats->n_masks = ovs_flow_tbl_num_masks(table);
>
> Sashiko says:
>
> ---
> get_dp_stats() accesses table->mask_array via ovs_flow_tbl_num_masks()
> while holding only ovs_mutex. Since this patch decouples flow table updates
> by moving them under table->lock, ovs_flow_cmd_new() can execute
> concurrently and trigger a reallocation of the mask array, freeing the old
> one via call_rcu().
> Because get_dp_stats() does not hold rcu_read_lock(), the thread can be
> preempted (as ovs_mutex is sleepable) and the RCU grace period might expire
> before the count is read. Can this lead to a use-after-free?
> ---
>
I guess it is possible that all that happens between the dereference of
table->mask_array and the read of "ma->count".
Maybe we could rcu_read_lock() the call to "ofs_flow_tbl_num_masks" or
use RCU protection for the entire flow-table-related statistics?
I think this would allow us to demote "rcu_dereference_ovs_tbl" to only
consider table lock and RCU and use to dereference "dp->table".
> Note that it also spotted pre-existing issues, please have a look:
>
> https://sashiko.dev/#/patchset/20260407120418.356718-1-amorenoz%40redhat.com
>
Interesting. I'll look deeper into this, thanks!
> [...]
> > @@ -71,15 +93,40 @@ struct flow_table {
> >
> > extern struct kmem_cache *flow_stats_cache;
> >
> > +#ifdef CONFIG_LOCKDEP
> > +int lockdep_ovs_tbl_is_held(const struct flow_table *table);
> > +#else
> > +static inline int lockdep_ovs_tbl_is_held(const struct flow_table *table)
> > +{
> > + (void)table;
>
> You can use the __always_unused annotation.
>
Thanks, will do.
> > + return 1;
> > +}
> > +#endif
> > +
> > +#define ASSERT_OVS_TBL(tbl) WARN_ON(!lockdep_ovs_tbl_is_held(tbl))
> > +
> > +/* Lock-protected update-allowed dereferences.*/
> > +#define ovs_tbl_dereference(p, tbl) \
> > + rcu_dereference_protected(p, lockdep_ovs_tbl_is_held(tbl))
> > +
> > +/* Read dereferences can be protected by either RCU, table lock or ovs_mutex. */
>
> Is this access schema really safe? I understand tables can be
> written/deleted under the table lock only. If so this should ignore the
> OVS mutex status.
>
Even if it was safe (which it might not), it's confusing and makes it
difficult to understand the overall locking strategy.
I'll revisit all uses carefully but I think the only use-cases for only
holding the ovs_mutex are datapath statistics reading (such as
ovs_dp_stats we are discussing above). I will try to make use RCU in
those cases which might make them safer and everything more robust.
Thank you!
Adrián
> /P
>