[Question] neighbor entry doesn't switch to the STALE state after the reachable timer expires
From: Zhang Changzhong
Date: Sat Jan 28 2023 - 22:08:48 EST
Hi,
We got the following weird neighbor cache entry on a machine that's been running for over a year:
172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLE
350520 seconds have elapsed since this entry was last updated, but it is still in the REACHABLE
state (base_reachable_time_ms is 30000), preventing lladdr from being updated through probe.
After some analysis, we found a scenario that may cause such a neighbor entry:
Entry used DELAY_PROBE_TIME expired
NUD_STALE ------------> NUD_DELAY ------------------------> NUD_PROBE
|
| DELAY_PROBE_TIME not expired
v
NUD_REACHABLE
The neigh_timer_handler() use time_before_eq() to compare 'now' with 'neigh->confirmed +
NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME)', but time_before_eq() only works if delta < ULONG_MAX/2.
This means that if an entry stays in the NUD_STALE state for more than ULONG_MAX/2 ticks, it enters
the NUD_RACHABLE state directly when it is used again and cannot be switched to the NUD_STALE state
(the timer is set too long).
On 64-bit machines, ULONG_MAX/2 ticks are a extremely long time, but in my case (32-bit machine and
kernel compiled with CONFIG_HZ=250), ULONG_MAX/2 ticks are about 99.42 days, which is possible in
reality.
Does anyone have a good idea to solve this problem? Or are there other scenarios that might cause
such a neighbor entry?
-----
Best Regards,
Changzhong Zhang