[Question] neighbor entry doesn't switch to the STALE state after the reachable timer expires

From: Zhang Changzhong
Date: Sat Jan 28 2023 - 22:08:48 EST


Hi,

We got the following weird neighbor cache entry on a machine that's been running for over a year:
172.16.1.18 dev bond0 lladdr 0a:0e:0f:01:12:01 ref 1 used 350521/15994171/350520 probes 4 REACHABLE

350520 seconds have elapsed since this entry was last updated, but it is still in the REACHABLE
state (base_reachable_time_ms is 30000), preventing lladdr from being updated through probe.

After some analysis, we found a scenario that may cause such a neighbor entry:

Entry used DELAY_PROBE_TIME expired
NUD_STALE ------------> NUD_DELAY ------------------------> NUD_PROBE
|
| DELAY_PROBE_TIME not expired
v
NUD_REACHABLE

The neigh_timer_handler() use time_before_eq() to compare 'now' with 'neigh->confirmed +
NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME)', but time_before_eq() only works if delta < ULONG_MAX/2.

This means that if an entry stays in the NUD_STALE state for more than ULONG_MAX/2 ticks, it enters
the NUD_RACHABLE state directly when it is used again and cannot be switched to the NUD_STALE state
(the timer is set too long).

On 64-bit machines, ULONG_MAX/2 ticks are a extremely long time, but in my case (32-bit machine and
kernel compiled with CONFIG_HZ=250), ULONG_MAX/2 ticks are about 99.42 days, which is possible in
reality.

Does anyone have a good idea to solve this problem? Or are there other scenarios that might cause
such a neighbor entry?

-----
Best Regards,
Changzhong Zhang