Re: [PATCH v2] mm/damon: replace damon_rand() with a per-ctx lockless PRNG
From: Jiayuan Chen
Date: Wed May 06 2026 - 19:42:04 EST
Hello SJ,
On 5/7/26 12:37 AM, SeongJae Park wrote:
Hello Jiayuan,
Thank you for posting this second version!
On Tue, 5 May 2026 22:52:06 +0800 Jiayuan Chen <jiayuan.chen@xxxxxxxxx> wrote:
From: Jiayuan Chen <jiayuan.chen@xxxxxxxxxx>Could we add some links to the algorithm? I found a blog [1] from Google and
damon_rand() on the sampling_addr hot path called
get_random_u32_below(), which takes a local_lock_irqsave() around a
per-CPU batched entropy pool and periodically refills it with
ChaCha20. At elevated nr_regions counts (20k+), the lock_acquire /
local_lock pair plus __get_random_u32_below() dominate kdamond perf
profiles.
Replace the helper with a lockless lfsr113 generator (struct rnd_state)
held per damon_ctx and seeded from get_random_u64() in damon_new_ctx().
kdamond is the single consumer of a given ctx, so no synchronization
is required. Range mapping uses Lemire's (u64)rnd * span >> 32 on the
fast path;
guessing that is what you're saying, but I'm not really sure since I failed at
finding time to thoroughtly read it.
The link covers a related family of techniques, but I don't
think we need an external reference. The pattern is just the
multiply-shift fast path that the kernel already uses in
__get_random_u32_below() (drivers/char/random.c) -- (u64)rnd *
span >> 32 -- which the existing comment there calls "traditional
reciprocal multiplication". Of course I can add simple comment like
"traditional reciprocal multiplication, similar as get_random_u32_below()"
for spans larger than U32_MAX (only reachable on 64-bit) theWhy are you adding the above link? It would be good to add description of the
slow path combines two u32 outputs and uses mul_u64_u64_shr() at 64-bit
width. On 32-bit the slow path is dead code and gets eliminated by
the compiler.
The new helper takes a ctx parameter; damon_split_regions_of() and
the kunit tests that call it directly are updated accordingly.
lfsr113 is a linear PRNG and MUST NOT be used for anything
security-sensitive. DAMON's sampling_addr is not exposed to userspace
and is only consumed as a probe point for PTE accessed-bit sampling,
so a non-cryptographic PRNG is appropriate here.
Tested with paddr monitoring and max_nr_regions=20000: kdamond CPU
usage reduced from ~72% to ~50% of one core.
Link: https://lore.kernel.org/damon/20260426173346.86238-1-sj@xxxxxxxxxx/T/#m4f1fd74112728f83a41511e394e8c3fef703039c
link on the commit message.
Other than that, commit message looks good to me.
Will move it to the patch changelog (after ---) instead of the
commit message
Cc: Jiayuan Chen <jiayuan.chen@xxxxxxxxx>Maybe we can remove random.h?
Signed-off-by: Jiayuan Chen <jiayuan.chen@xxxxxxxxxx>
---
include/linux/damon.h | 27 +++++++++++++++++++++------
mm/damon/core.c | 12 ++++++++----
mm/damon/paddr.c | 8 ++++----
mm/damon/tests/core-kunit.h | 28 ++++++++++++++++++++++------
mm/damon/vaddr.c | 7 ++++---
5 files changed, 59 insertions(+), 23 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h
index f2cdb7c3f5e6..e16012a7f41a 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -8,8 +8,10 @@
#ifndef _DAMON_H_
#define _DAMON_H_
+#include <linux/math64.h>
#include <linux/memcontrol.h>
#include <linux/mutex.h>
+#include <linux/prandom.h>
#include <linux/time64.h>
#include <linux/types.h>
#include <linux/random.h>
Right !
@@ -19,12 +21,6 @@I was unable to find time to thoroughly read the link I found, or understand
/* Max priority score for DAMON-based operation schemes */
#define DAMOS_MAX_SCORE (99)
-/* Get a random number in [l, r) */
-static inline unsigned long damon_rand(unsigned long l, unsigned long r)
-{
- return l + get_random_u32_below(r - l);
-}
-
/**
* struct damon_addr_range - Represents an address region of [@start, @end).
* @start: Start address of the region (inclusive).
@@ -843,8 +839,27 @@ struct damon_ctx {
struct list_head adaptive_targets;
struct list_head schemes;
+
+ /* Per-ctx PRNG state for damon_rand(); kdamond is the sole consumer. */
+ struct rnd_state rnd_state;
};
+/* Get a random number in [@l, @r) using @ctx's lockless PRNG. */
+static inline unsigned long damon_rand(struct damon_ctx *ctx,
+ unsigned long l, unsigned long r)
+{
+ unsigned long span = r - l;
+ u64 rnd;
+
+ if (span <= U32_MAX) {
+ rnd = prandom_u32_state(&ctx->rnd_state);
+ return l + (unsigned long)((rnd * span) >> 32);
+ }
+ rnd = ((u64)prandom_u32_state(&ctx->rnd_state) << 32) |
+ prandom_u32_state(&ctx->rnd_state);
+ return l + mul_u64_u64_shr(rnd, span, 64);
+}
+
the algorithm with only the code, so tested by myself like below.
'''
$ cat ./foo.c
#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <stdint.h>
int main(int argc, char *argv[])
{
unsigned long span;
int i;
if (argc != 2) {
printf("Usge: %s <span>\n", argv[0]);
return -1;
}
span = atoi(argv[1]);
for (i = 0; i < 1000; i++) {
uint64_t rnd = rand();
printf("%lu\n", (uint64_t)((rnd * span) >> 32));
}
return 0;
}
$ gcc ./foo.c
$ ./a.out 10 | sort -n | uniq --count
195 0
194 1
188 2
223 3
200 4
'''
So I expected the program to generate number 0-9 randomly with similar
proportions. But it is generating only numbers 0-4. I confirmed doubling
'span' value makes it work in my expected way. Should we do that here, too?
I wrote a kmod to verify. The helper is a verbatim copy of
damon_rand() from the patch:
static inline unsigned long damon_rand_test(struct rnd_state *state,
unsigned long l, unsigned long r)
{
unsigned long span = r - l;
u64 rnd;
if (span <= U32_MAX) {
rnd = prandom_u32_state(state);
return l + (unsigned long)((rnd * span) >> 32);
}
rnd = ((u64)prandom_u32_state(state) << 32) |
prandom_u32_state(state);
return l + mul_u64_u64_shr(rnd, span, 64);
}
static int __init damon_rand_test_init(void)
{
struct rnd_state state;
unsigned long counts[SPAN] = {0};
unsigned long i, v;
int j;
prandom_seed_state(&state, get_random_u64());
for (i = 0; i < ITERATIONS; i++) {
v = damon_rand_test(&state, 0, SPAN);
if (v >= SPAN) {
pr_err("out of range: %lu\n", v);
return -EINVAL;
}
counts[v]++;
}
pr_info("damon_rand_test: span=%lu iterations=%lu\n",
SPAN, ITERATIONS);
for (j = 0; j < SPAN; j++)
pr_info("damon_rand_test: %d -> %lu\n", j, counts[j]);
return 0;
}
dmesg output (span=10, 1000 iterations):
damon_rand_test: span=10 iterations=1000
damon_rand_test: 0 -> 99
damon_rand_test: 1 -> 107
damon_rand_test: 2 -> 79
damon_rand_test: 3 -> 94
damon_rand_test: 4 -> 97
damon_rand_test: 5 -> 97
damon_rand_test: 6 -> 94
damon_rand_test: 7 -> 100
damon_rand_test: 8 -> 114
damon_rand_test: 9 -> 119
All 10 buckets get hits, roughly uniform.
Looking into the usrspace test program, the issue is that libc rand()
returns a value in [0, RAND_MAX]. On glibc RAND_MAX is
2^31 - 1 (not 2^32 - 1), so the upper bit is missing and
the multiply-shift only fills the lower half of the output
range. prandom_u32_state() returns a full u32 in [0, 2^32),
so the in-kernel formula behaves correctly without doubling.
[...]
Other than above, code changes look good to me.
[1] https://lemire.me/blog/2024/08/17/faster-random-integer-generation-with-batching/
Thanks,
SJ