Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit
From: Raghavendra K T
Date: Tue Mar 25 2025 - 02:37:14 EST
+kinseyho and yuanchu
On 3/22/2025 2:05 AM, Davidlohr Bueso wrote:
On Fri, 21 Mar 2025, Raghavendra K T wrote:
But a longer running/ more memory workload may make more difference.
I will comeback with that number.
base NUMAB=2 Patched NUMAB=0
time in sec time in sec
===================================================
8G: 134.33 (0.19) 119.88 ( 0.25)
16G: 292.24 (0.60) 325.06 (11.11)
32G: 585.06 (0.24) 546.15 ( 0.50)
64G: 1278.98 (0.27) 1221.41 ( 1.54)
We can see that numbers have not changed much between NUMAB=1 NUMAB=0 in
patched case.
Thanks. Since this might vary across workloads, another important metric
here is numa hit/misses statistics.
Hello David, sorry for coming back late.
Yes I did collect some of the other stats along with this (posting for
8GB only). I did not se much difference in total numa_hit. But there are
differences in in numa_local etc.. (not pasted here)
#grep -A2 completed abench_cxl_6.14.0-rc6-kmmscand+_8G.log
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log
abench_cxl_6.14.0-rc6-kmmscand+_8G.log:Benchmark completed in
120292376.0 us, Total thread execution time 7490922681.0 us
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_hit 6376927
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-kmmscand+_8G.log:Benchmark completed in
119583939.0 us, Total thread execution time 7461705291.0 us
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_hit 6373409
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-kmmscand+_8G.log:Benchmark completed in
119784117.0 us, Total thread execution time 7482710944.0 us
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_hit 6378384
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log:Benchmark completed in
134481344.0 us, Total thread execution time 8409840511.0 us
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_hit 6303300
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log:Benchmark completed in
133967260.0 us, Total thread execution time 8352886349.0 us
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_hit 6304063
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log:Benchmark completed in
134554911.0 us, Total thread execution time 8444951713.0 us
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_hit 6302506
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_miss 0
fyi I have also been trying this series to get some numbers as well, but
noticed overnight things went south (so no chance before LSFMM):
This issue looks to be different. Could you please let me know any ways
to reproduce?
I had tested perf bench numa mem, did not find anything.
The issue I know of currently is:
kmmscand:
for_each_mm
for_each_vma
scan_vma and get accessed_folo_list
add to migration_list() // does not check for duplicate
kmmmigrated:
for_each_folio in migration_list
migrate_misplaced_folio()
there is also
cleanup_migration_list() in mm teardown
migration_list is protected by single lock, and kmmscand is too
aggressive and can potentially bombard with migration_list (practical
workload may generate lesser pages though). That results in non-fatal
softlockup that will be fixed with mmslot as I noted somewhere.
But now main challenge to solve in kmmscand is, it generates:
t1-> migration_list1 (of recently accessed folios)
t2-> migration_list2
How do I get the union of migration_list1 and migration_list2 so that
instead of migrating on first access, we can get a hotter page to
promote.
I had few solutions in mind: (That I wanted to get opinion / suggestion
from exerts during LSFMM)
1. Reusing DAMON VA scanning. scanning params are controlled in KMMSCAND
(current heuristics)
2. Can we use LRU information to filter access list (LRU active/ folio
is in (n-1) generation?)
(I do see Kinseyho just posted LRU based approach)
3. Can we split the address range to 2MB to monitor? PMD level access
monitoring.
4. Any possible ways of using bloom-filters for list1,list2
- Raghu
[snip...]