Re: [PATCH net] net/ipv6: Fix the RT cache flush via sysctl using a previous delay

From: Petr Pavlu
Date: Fri May 31 2024 - 04:53:30 EST


[Added back netdev@xxxxxxxxxxxxxxx and linux-kernel@xxxxxxxxxxxxxxx
which seem to be dropped by accident.]

On 5/30/24 17:59, Kuifeng Lee wrote:
> On Wed, May 29, 2024 at 6:53 AM Petr Pavlu <petr.pavlu@xxxxxxxx> wrote:
>>
>> The net.ipv6.route.flush system parameter takes a value which specifies
>> a delay used during the flush operation for aging exception routes. The
>> written value is however not used in the currently requested flush and
>> instead utilized only in the next one.
>>
>> A problem is that ipv6_sysctl_rtcache_flush() first reads the old value
>> of net->ipv6.sysctl.flush_delay into a local delay variable and then
>> calls proc_dointvec() which actually updates the sysctl based on the
>> provided input.
>
> If the problem we are trying to fix is using the old value, should we move
> the line reading the value to a place after updating it instead of a
> local copy of
> the whole ctl_table?

Just moving the read of net->ipv6.sysctl.flush_delay after the
proc_dointvec() call was actually my initial implementation. I then
opted for the proposed version because it looked useful to me to save
memory used to store net->ipv6.sysctl.flush_delay.

Another minor aspect is that these sysctl writes are not serialized. Two
invocations of ipv6_sysctl_rtcache_flush() could in theory occur at the
same time. It can then happen that they both first execute
proc_dointvec(). One of them ends up slower and thus its value gets
stored in net->ipv6.sysctl.flush_delay. Both runs then return to
ipv6_sysctl_rtcache_flush(), read the stored value and execute
fib6_run_gc(). It means one of them calls this function with a value
different that it was actually given on input. By having a purely local
variable, each write is independent and fib6_run_gc() is executed with
the right input delay.

The cost of making a copy of ctl_table is a few instructions and this
isn't on any hot path. The same pattern is used, for example, in
net/ipv6/addrconf.c, function addrconf_sysctl_forward().

So overall, the proposed version looked marginally better to me than
just moving the read of net->ipv6.sysctl.flush_delay later in
ipv6_sysctl_rtcache_flush().

Thanks,
Petr

>
>>
>> Fix the problem by removing net->ipv6.sysctl.flush_delay because the
>> value is never actually used after the flush operation and instead use
>> a temporary ctl_table in ipv6_sysctl_rtcache_flush() pointing directly
>> to the local delay variable.
>>
>> Fixes: 4990509f19e8 ("[NETNS][IPV6]: Make sysctls route per namespace.")
>> Signed-off-by: Petr Pavlu <petr.pavlu@xxxxxxxx>
>> ---
>>
>> Note that when testing this fix, I noticed that an aging exception route
>> (created via ICMP redirect) was not getting removed when triggering the
>> flush operation unless the associated fib6_info was an expiring route.
>> It looks the logic introduced in 5eb902b8e719 ("net/ipv6: Remove expired
>> routes with a separated list of routes.") otherwise missed registering
>> the fib6_info with the GC. That is potentially a separate issue, just
>> adding it here in case someone decides to test this patch and possibly
>> run into this problem too.
>>
>> include/net/netns/ipv6.h | 1 -
>> net/ipv6/route.c | 13 ++++++-------
>> 2 files changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
>> index 5f2cfd84570a..2ed7659013a4 100644
>> --- a/include/net/netns/ipv6.h
>> +++ b/include/net/netns/ipv6.h
>> @@ -20,7 +20,6 @@ struct netns_sysctl_ipv6 {
>> struct ctl_table_header *frags_hdr;
>> struct ctl_table_header *xfrm6_hdr;
>> #endif
>> - int flush_delay;
>> int ip6_rt_max_size;
>> int ip6_rt_gc_min_interval;
>> int ip6_rt_gc_timeout;
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index bbc2a0dd9314..f07f050003c3 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -6335,15 +6335,17 @@ static int rt6_stats_seq_show(struct seq_file *seq, void *v)
>> static int ipv6_sysctl_rtcache_flush(struct ctl_table *ctl, int write,
>> void *buffer, size_t *lenp, loff_t *ppos)
>> {
>> - struct net *net;
>> + struct net *net = ctl->extra1;
>> + struct ctl_table lctl;
>> int delay;
>> int ret;
>> +
>> if (!write)
>> return -EINVAL;
>>
>> - net = (struct net *)ctl->extra1;
>> - delay = net->ipv6.sysctl.flush_delay;
>> - ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
>> + lctl = *ctl;
>> + lctl.data = &delay;
>> + ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
>> if (ret)
>> return ret;
>>
>> @@ -6368,7 +6370,6 @@ static struct ctl_table ipv6_route_table_template[] = {
>> },
>> {
>> .procname = "flush",
>> - .data = &init_net.ipv6.sysctl.flush_delay,
>> .maxlen = sizeof(int),
>> .mode = 0200,
>> .proc_handler = ipv6_sysctl_rtcache_flush
>> @@ -6444,7 +6445,6 @@ struct ctl_table * __net_init ipv6_route_sysctl_init(struct net *net)
>> if (table) {
>> table[0].data = &net->ipv6.sysctl.ip6_rt_max_size;
>> table[1].data = &net->ipv6.ip6_dst_ops.gc_thresh;
>> - table[2].data = &net->ipv6.sysctl.flush_delay;
>> table[2].extra1 = net;
>> table[3].data = &net->ipv6.sysctl.ip6_rt_gc_min_interval;
>> table[4].data = &net->ipv6.sysctl.ip6_rt_gc_timeout;
>> @@ -6521,7 +6521,6 @@ static int __net_init ip6_route_net_init(struct net *net)
>> #endif
>> #endif
>>
>> - net->ipv6.sysctl.flush_delay = 0;
>> net->ipv6.sysctl.ip6_rt_max_size = INT_MAX;
>> net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
>> net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
>>
>> base-commit: 2bfcfd584ff5ccc8bb7acde19b42570414bf880b
>> --
>> 2.35.3
>>
>>