Re: [RFC PATCH 0/6] mm/damon: hardware-sampled access reports

From: Akinobu Mita

Date: Fri May 29 2026 - 23:02:27 EST

Hello Ravi and SeongJae,

2026年5月30日(土) 9:05 SeongJae Park <sj@xxxxxxxxxx>:
>
> On Fri, 29 May 2026 09:56:34 -0700 Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxx> wrote:
>
> > This series introduces a vendor and PMU-agnostic substrate inside DAMON
> > that consumes hardware-sampled access reports through the standard
> > perf-event interface. Userspace selects the PMU through sysfs (raw
> > type/config knobs), driving either Intel PEBS L3-miss sampling or AMD
> > IBS Op sampling.
> >
> > Why a unified perf-event substrate
> >
> > Earlier hardware-sampled access-monitoring proposal [1] took an AMD IBS
> > specific module path backend, owning its own probe configuration,
> > sysfs knobs, and lifecycle.
> >
> > SeongJae Park has previously highlighted the advantage of Akinobu
> > Mita's perf-event proposal [2]: let DAMON register kernel-counter perf
> > events and consume samples from any sampling PMU that perf core knows
> > about. This series builds on that direction
>
> Ah great, so we have no unclear challenge (additional loadable module support
> and conflicts with other IBS modules) on our road for now! That is, we can
> reuse the stable perf event interface and achieve all our goals! As I
> previously shared [1], it would take time, but I'm very optimistic about the
> success of this project. I don't like promising too much, but this project
> looks like something that we can "consider it done".
>
> We can also say that the current candidate of the first
> damon_report_access()-based data attributes monitoring (milestone 2 [1] final
> deliverable) is the perf event based monitoring.

That's good!

>From a quick look, it seems to have all the features I need, so I'd like to
evaluate it based on Ravi's patch. If any extensions require changes, I will
let you know as feedback.

Ravi,
You can also add my Co-developed-by and Signed-off-by tags to the appropriate
patch, so please post to the mailing list.

I am currently working on a change to allow selecting perf events from the damo
tool by specifying the event name, similar to the perf record -e option (e.g.,
"cpu/mem-loads,ldlat=30,freq=5000/P" or "cpu/mem-stores,freq=5000/P").

I'll share the progress once it reaches a certain point. A change to the perf
file, as shown in the attachment, will be necessary, but I believe it can be
handled without changing Ravi's current patch set.

> > with the changes we
> > needed to run it cross-vendor:
> >
> > - a per-CPU lockless ring between the NMI sample handler and the
> > kdamond drain,
> > - per-CPU events that follow CPU hotplug cleanly,
> > - events fire only while the monitor is running -- created disabled,
> > armed when kdamond starts, disarmed and drained when it stops,
> > - all-or-nothing init across CPUs: a partial-CPU create failure rolls
> > the whole event back rather than leaving silent gaps,
> > - safe handling of vendor sample-validity flags so a stale or
> > unpopulated address is never mistaken for a valid sample.
> >
> > What the series adds
> >
> > Patch 1 introduces the substrate's data types: a per-event
> > configuration struct and a per-context list to hang them on. A
> > CONFIG_PERF_EVENTS=n build folds to no-op stubs.
> >
> > Patch 2 exposes those types through sysfs. Each entry maps to one
> > perf event and lets userspace pick the PMU and how to sample it: the
> > raw PMU type/config, addressing flags, and period or frequency. The
> > defaults are tuned for Intel PEBS; userspace overrides them for other
> > PMUs.
> >
> > Patch 3 wires the sysfs apply path so configured events get attached
> > to the running monitoring context.
> >
> > Patch 4 is the core of the series. It replaces the mutex-protected
> > report queue with a per-CPU lockless ring fed from NMI by the perf
> > overflow handler and drained once per sample tick by the kdamond.
> > Drained reports are matched to monitored regions by binary search
> > over a per-tick snapshot. The patch also wires the per-event
> > lifecycle into kdamond: events arm when the monitor starts, disarm
> > and drain when it stops, roll back cleanly when per-CPU init fails on
> > some CPUs, and a second context that asks for the substrate while
> > it is in use is rejected with -EBUSY.
> >
> > Patch 5 is the perf-event backend. Two stateless overflow handlers
> > (one vaddr-keyed, one paddr-keyed) are picked at event creation time
> > and submit samples into the per-CPU ring. Vendor-specific sample
> > validity is honored at this layer.
> >
> > Patch 6 adds a tracepoint at every node_eligible_mem_bp quota-goal
> > evaluation so userspace can watch goal convergence without polling
> > sysfs.
> >
> > Userspace setup model
> >
> > Userspace selects the sampling PMU by pointing the perf event's
> > `type` / `config` at it, and chooses the scheme topology that suits
> > the address space the PMU reports on. No module load or unload step
> > is involved; `echo on > state` arms the substrate, `echo off > state`
> > disarms it.
> >
> > Two configurations were used for validation.
> >
> > Configuration A: AMD IBS Op, paddr ops, system-wide PULL+PUSH tiering
> >
> > IBS Op stamps samples with physical addresses, so DAMON reasons over
> > every backing page in the system regardless of which task or guest
> > touched it -- the substrate becomes a system-wide tiering controller.
> >
> > Setup (abridged; `D=/sys/kernel/mm/damon/admin/kdamonds/0`):
> >
> > echo 1 > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
> > echo 1 > $D/contexts/nr_contexts
> > echo paddr > $D/contexts/0/operations
> >
> > # Two regions, one per NUMA node (DRAM + CXL). PA ranges
> > # are derived per host from /proc/iomem; omitted here.
> > echo 1 > $D/contexts/0/targets/nr_targets
> > echo 2 > $D/contexts/0/targets/0/regions/nr_regions
> > echo <DRAM_LO> > $D/contexts/0/targets/0/regions/0/start
> > echo <DRAM_HI> > $D/contexts/0/targets/0/regions/0/end
> > echo <CXL_LO> > $D/contexts/0/targets/0/regions/1/start
> > echo <CXL_HI> > $D/contexts/0/targets/0/regions/1/end
> >
> > # IBS Op event, period-based, paddr-stamped:
> > PE=$D/contexts/0/monitoring_attrs/sample/perf_events
> > echo 1 > $PE/nr_perf_events
> > echo $(cat /sys/bus/event_source/devices/ibs_op/type) > $PE/0/type
> > echo 0 > $PE/0/config
> > echo 1 > $PE/0/sample_phys_addr
> > echo 0 > $PE/0/freq
> > echo 262144 > $PE/0/sample_period
> > echo 0 > $PE/0/exclude_kernel
> > echo 0 > $PE/0/exclude_hv
>
> FYI, and as you may already know, the current plan [1] is to use the attributes
> probe interface. With it, the above IBS Op event setup part would look like,
>
> mon_attr=/sys/kernel/mm/damon/admin/kdamonds/0/contexts/0/monitoring_attrs
> echo 1 > $mon_attr/probes/nr_probes
> probe=$mon_attr/probes/0
> echo 1 > $probe/filters/nr_filters
> filter=$probe/filters/0
> echo perf_event > $filter/type
> echo ibs_op > $filter/perf_event_type
> echo Y > $filter/allow
>
> Of course, more details could change later.
>
> >
> > # PULL scheme: migrate_hot toward DRAM, gated on
> > # node_eligible_mem_bp(nid=DRAM) goal target_value=TARGET_BP.
> > # addr filter restricts source to the CXL range.
> > # PUSH scheme: migrate_hot toward CXL, gated on
> > # node_eligible_mem_bp(nid=CXL) target_value=10000-TARGET_BP.
> > # addr filter restricts source to the DRAM range.
> > # Both schemes are migrate_hot; they converge from opposite
> > # directions on the same hot working set.
> >
> > echo on > $D/state
> >
> > Userspace tunes the steady-state DRAM:CXL split by writing the goal
> > `target_value`s; DAMON's quota autotuner drives migration intensity
> > to match.
> >
> > Workload: a QEMU/KVM guest pinned to one NUMA node, running 32
> > multichase multiload threads each touching a 4 GiB working set
> > (~128 GiB aggregate) with the memcpy-libc kernel. The guest sees
> > a flat single-NUMA layout and has no direct view of the host's
> > tiering topology, yet its hot pages are migrated to DRAM and cold
> > pages pushed to CXL by host-side DAMON acting on IBS-stamped
> > physical addresses -- the application inside the guest benefits
> > from tiering it never had to be aware of. Validated on AMD Turin
> > (132-CPU EPYC). The configuration converged to its target ratio
> > in seconds and remained stable for 7+ hours continuously, with no
> > perf core auto-throttle and no measurable drift in the achieved
> > interleave ratio.
> >
> > Configuration B: Intel PEBS L3-miss, vaddr ops, per-PID weighted-dest
> >
> > PEBS reports vaddr samples in the context of the running task.
> > DAMON's vaddr ops monitors a specific PID.
> >
> > Setup (abridged):
> >
> > echo 1 > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
> > echo 1 > $D/contexts/nr_contexts
> > echo vaddr > $D/contexts/0/operations
> >
> > echo 1 > $D/contexts/0/targets/nr_targets
> > echo $PID > $D/contexts/0/targets/0/pid_target
> > echo 0 > $D/contexts/0/targets/0/regions/nr_regions
> >
> > # PEBS MEM_LOAD_RETIRED.L3_MISS, frequency-based, vaddr-stamped:
> > echo 1 > $PE/nr_perf_events
> > echo 4 > $PE/0/type # PERF_TYPE_RAW
> > echo 0x20d1 > $PE/0/config # umask=0x20 event=0xd1
> > echo 0 > $PE/0/sample_phys_addr
> > echo 1 > $PE/0/freq
> > echo 5003 > $PE/0/sample_freq
> > echo 2 > $PE/0/precise_ip
> > echo 1 > $PE/0/wakeup_events
> >
> > # Single migrate_hot scheme with two weighted destinations
> > # (DRAM + CXL). Userspace tunes the steady-state interleave by
> > # writing dests/{0,1}/weight.
> >
> > echo on > $D/state
> >
> > Workload: 32 multichase multiload threads with a 4 GiB working set
> > each (~128 GiB aggregate) running directly on the host, monitored
> > by DAMON via the multiload PID. Validated on Intel Granite Rapids
> > (144-CPU). Convergence is fast and the system is stable.
>
> Thank you so much for sharing the great prototype implementation and test
> results!
>
> I will try to make fast progress on milestone 1. I will hold reviewing details
> of this series for now, as there could be more changes. But in the high level,
> this looks promising.
>
> >
> > [1] https://lore.kernel.org/linux-mm/20260516223439.4033-1-ravis.opensrc@xxxxxxxxx/
> > [2] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@xxxxxxxxx
>
> [1] https://lore.kernel.org/20260525225208.1179-1-sj@xxxxxxxxxx/
>
>
> Thanks,
> SJ
>
> [...]
From b11bb44d228d51272c6138559cb772297ba471ac Mon Sep 17 00:00:00 2001
From: Akinobu Mita <mita@xxxxxxxxxxxx>
Date: Wed, 27 May 2026 19:00:41 +0900
Subject: [PATCH 10/10] perf python: Add member access to config1 and config2
of evsel

This change is necessary to specify the same PMU event selection as the
'perf record' -e option from DAMON's userspace tools.

For example, a Python script like the following will allow you to obtain
the values to be set in the type, config, config1, and config2 members of
perf_event_attr by providing a symbolic event name.

import perf

if __name__ == '__main__':
evlist = perf.parse_events("cpu/mem-loads,ldlat=30/P")
for evsel in evlist:
print(f"{evsel}: type={evsel.type} config={evsel.config}",
f"config1={evsel.config1} config2={evsel.config2}")

Signed-off-by: Akinobu Mita <akinobu.mita@xxxxxxxxx>
---
tools/perf/util/python.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index cc1019d29a5d..56903617ba4c 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -1241,6 +1241,8 @@ static PyMemberDef pyrf_evsel__members[] = {
evsel_attr_member_def(sample_type, T_ULONGLONG, "attribute sample_type."),
evsel_attr_member_def(read_format, T_ULONGLONG, "attribute read_format."),
evsel_attr_member_def(wakeup_events, T_UINT, "attribute wakeup_events."),
+ evsel_attr_member_def(config1, T_ULONGLONG, "attribute config1."),
+ evsel_attr_member_def(config2, T_ULONGLONG, "attribute config2."),
{ .name = NULL, },
};

--
2.43.0