Re: [patch V2 3/4] atomics: Provide rcuref - scalable reference counting

From: Qiuxu Zhuo
Date: Thu Mar 09 2023 - 03:37:42 EST


Hi Thomas,

Some comments on the comments.
If I'm wrong, please correct me ;-).

> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> To: LKML <linux-kernel@xxxxxxxxxxxxxxx>
> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxx>,
> x86@xxxxxxxxxx, Wangyang Guo <wangyang.guo@xxxxxxxxx>,
> Arjan van De Ven <arjan@xxxxxxxxxxxxxxx>,
> "David S. Miller" <davem@xxxxxxxxxxxxx>,
> Eric Dumazet <edumazet@xxxxxxxxxx>,
> Jakub Kicinski <kuba@xxxxxxxxxx>, Paolo Abeni <pabeni@xxxxxxxxxx>,
> netdev@xxxxxxxxxxxxxxx, Will Deacon <will@xxxxxxxxxx>,
> Peter Zijlstra <peterz@xxxxxxxxxxxxx>,
> Boqun Feng <boqun.feng@xxxxxxxxx>,
> Mark Rutland <mark.rutland@xxxxxxx>,
> Marc Zyngier <maz@xxxxxxxxxx>,
> Arjan Van De Ven <arjan.van.de.ven@xxxxxxxxx>
> Subject: [patch V2 3/4] atomics: Provide rcuref - scalable reference counting
>
> atomic_t based reference counting, including refcount_t, uses
> atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is
> implemented with a atomic_try_cmpxchg() loop. High contention of the
> reference count leads to retry loops and scales badly. There is nothing to
> improve on this implementation as the semantics have to be preserved.
>
> Provide rcuref as a scalable alternative solution which is suitable for RCU
> managed objects. Similar to refcount_t it comes with overflow and underflow
> detection and mitigation.
>
> rcuref treats the underlying atomic_t as an unsigned integer and partitions
> this space into zones:
>
> 0x00000000 - 0x7FFFFFFF valid zone (1 .. INT_MAX references)

>From the point of rcuref_read()'s view:
0x00000000 encodes 1, ..., then 0x7FFFFFFF should encode INT_MAX + 1 references.

> 0x80000000 - 0xBFFFFFFF saturation zone
> 0xC0000000 - 0xFFFFFFFE dead zone
> 0xFFFFFFFF no reference
>
> rcuref_get() unconditionally increments the reference count with
> atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the
> reference count with atomic_add_negative_release().
>
> This unconditional increment avoids the inc_not_zero() problem, but
> requires a more complex implementation on the put() side when the count
> drops from 0 to -1.
>
> When this transition is detected then it is attempted to mark the reference
> count dead, by setting it to the midpoint of the dead zone with a single
> atomic_cmpxchg_release() operation. This operation can fail due to a
> concurrent rcuref_get() elevating the reference count from -1 to 0 again.
>
> If the unconditional increment in rcuref_get() hits a reference count which
> is marked dead (or saturated) it will detect it after the fact and bring
> back the reference count to the midpoint of the respective zone. The zones
> provide enough tolerance which makes it practically impossible to escape
> from a zone.

[...]

> + * Why not refcount?
> + * =================
> + *
> + * In principle it should be possible to make refcount use the rcuref
> + * scheme, but the destruction race described below cannot be prevented
> + * unless the protected object is RCU managed.
> + *
> + * Theory of operation
> + * ===================
> + *
> + * rcuref uses an unsigned integer reference counter. As long as the
> + * counter value is greater than or equal to RCUREF_ONEREF and not larger
> + * than RCUREF_MAXREF the reference is alive:
> + *
> + * ONEREF MAXREF SATURATED RELEASED DEAD NOREF
> + * 0 0x7FFFFFFF 0x8000000 0xA0000000 0xBFFFFFFF 0xC0000000 0xE0000000 0xFFFFFFFF
> + * <---valid --------> <-------saturation zone-------> <-----dead zone----->
> + *
> + * The get() and put() operations do unconditional increments and
> + * decrements. The result is checked after the operation. This optimizes
> + * for the fast path.
> + *
> + * If the reference count is saturated or dead, then the increments and
> + * decrements are not harmful as the reference count still stays in the
> + * respective zones and is always set back to STATURATED resp. DEAD. The
> + * zones have room for 2^28 racing operations in each direction, which
> + * makes it practically impossible to escape the zones.
> + *
> + * Once the last reference is dropped the reference count becomes
> + * RCUREF_NOREF which forces rcuref_put() into the slowpath operation. The
> + * slowpath then tries to set the reference count from RCUREF_NOREF to
> + * RCUREF_DEAD via a cmpxchg(). This opens a small window where a
> + * concurrent rcuref_get() can acquire the reference count and bring it
> + * back to RCUREF_ONEREF or even drop the reference again and mark it DEAD.
> + *
> + * If the cmpxchg() succeeds then a concurrent rcuref_get() will result in
> + * DEAD + 1, which is inside the dead zone. If that happens the reference
> + * count is put back to DEAD.
> + *
> + * The actual race is possible due to the unconditional increment and
> + * decrements in rcuref_get() and rcuref_put():
> + *
> + * T1 T2
> + * get() put()
> + * if (atomic_add_negative(1, &ref->refcnt))

For T2 put() here:
"if (atomic_add_negative(1, &ref->refcnt))" ->
"if (atomic_add_negative(-1, &ref->refcnt))"

> + * succeeds-> atomic_cmpxchg(&ref->refcnt, -1, DEAD);

Is it more readable if 's/-1/NODEF/g' ?

> + *
> + * atomic_add_negative(1, &ref->refcnt); <- Elevates refcount to DEAD + 1
> + *
> + * As the result of T1's add is negative, the get() goes into the slow path
> + * and observes refcnt being in the dead zone which makes the operation fail.
> + *
> + * Possible critical states:
> + *
> + * Context Counter References Operation
> + * T1 0 1 init()
> + * T2 1 2 get()
> + * T1 0 1 put()
> + * T2 -1 0 put() tries to mark dead
> + * T1 0 1 get()
> + * T2 0 1 put() mark dead fails
> + * T1 -1 0 put() tries to mark dead
> + * T1 DEAD 0 put() mark dead succeeds
> + * T2 DEAD+1 0 get() fails and puts it back to DEAD
> + *
> + * Of course there are more complex scenarios, but the above illustrates
> + * the working principle. The rest is left to the imagination of the
> + * reader.
> + *
> + * Deconstruction race
> + * ===================
> + *
> + * The release operation must be protected by prohibiting a grace period in
> + * order to prevent a possible use after free:
> + *
> + * T1 T2
> + * put() get()
> + * // ref->refcnt = ONEREF
> + * if (atomic_add_negative(-1, &ref->cnt))

For T1 put() here:
"if (atomic_add_negative(-1, &ref->cnt))" ->
"if (!atomic_add_negative(-1, &ref->cnt))"

> + * return false; <- Not taken
> + *
> + * // ref->refcnt == NOREF
> + * --> preemption
> + * // Elevates ref->c to ONEREF

s/ref->c/ref->refcnt/g

> + * if (!atomic_add_negative(1, &ref->refcnt))
> + * return true; <- taken
> + *
> + * if (put(&p->ref)) { <-- Succeeds
> + * remove_pointer(p);
> + * kfree_rcu(p, rcu);
> + * }
> + *
> + * RCU grace period ends, object is freed
> + *
> + * atomic_cmpxchg(&ref->refcnt, NONE, DEAD); <- UAF

s/NONE/NOREF/g

[...]