Re: [RFC PATCH 0/6] mm/damon: hardware-sampled access reports

From: Ravi Jonnalagadda

Date: Sat May 30 2026 - 01:03:41 EST

Hi SeongJae and Akinobu,

Thank you both for the warm reception and for the clear direction.

On Fri, May 29, 2026 at 8:02 PM Akinobu Mita <akinobu.mita@xxxxxxxxx> wrote:
>
> Hello Ravi and SeongJae,
>
> 2026年5月30日(土) 9:05 SeongJae Park <sj@xxxxxxxxxx>:
> >
> > On Fri, 29 May 2026 09:56:34 -0700 Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxx> wrote:
> >
> > > This series introduces a vendor and PMU-agnostic substrate inside DAMON
> > > that consumes hardware-sampled access reports through the standard
> > > perf-event interface. Userspace selects the PMU through sysfs (raw
> > > type/config knobs), driving either Intel PEBS L3-miss sampling or AMD
> > > IBS Op sampling.
> > >
> > > Why a unified perf-event substrate
> > >
> > > Earlier hardware-sampled access-monitoring proposal [1] took an AMD IBS
> > > specific module path backend, owning its own probe configuration,
> > > sysfs knobs, and lifecycle.
> > >
> > > SeongJae Park has previously highlighted the advantage of Akinobu
> > > Mita's perf-event proposal [2]: let DAMON register kernel-counter perf
> > > events and consume samples from any sampling PMU that perf core knows
> > > about. This series builds on that direction
> >
> > Ah great, so we have no unclear challenge (additional loadable module support
> > and conflicts with other IBS modules) on our road for now! That is, we can
> > reuse the stable perf event interface and achieve all our goals! As I
> > previously shared [1], it would take time, but I'm very optimistic about the
> > success of this project. I don't like promising too much, but this project
> > looks like something that we can "consider it done".
> >
> > We can also say that the current candidate of the first
> > damon_report_access()-based data attributes monitoring (milestone 2 [1] final
> > deliverable) is the perf event based monitoring.

Glad this aligns with the milestone roadmap.

>
> That's good!
>
> From a quick look, it seems to have all the features I need, so I'd like to
> evaluate it based on Ravi's patch. If any extensions require changes, I will
> let you know as feedback.

Great, please do. Happy to fold any feedback into v2.

>
> Ravi,
> You can also add my Co-developed-by and Signed-off-by tags to the appropriate
> patch, so please post to the mailing list.
>

Will do. In v2 will add Co-developed-by and Signed-off-by tags
to patches 1, 4, and 5:

- Patch 1 (`struct damon_perf_event{,_attr}` + per-ctx list)
- Patch 4 (per-CPU SPSC ring drain + perf-event lifecycle)
- Patch 5 (vaddr/paddr perf-event backend)

Patches 2 and 3 are the sysfs surface that will move to the
probes/filters interface; patch 6 is the unrelated
`damos_node_eligible_mem_bp` tracepoint.

> I am currently working on a change to allow selecting perf events from the damo
> tool by specifying the event name, similar to the perf record -e option (e.g.,
> "cpu/mem-loads,ldlat=30,freq=5000/P" or "cpu/mem-stores,freq=5000/P").
>
> I'll share the progress once it reaches a certain point. A change to the perf
> file, as shown in the attachment, will be necessary, but I believe it can be
> handled without changing Ravi's current patch set.

Nice. When the damo side is ready I will rerun the existing AMD IBS
and Intel PEBS configurations through it.

>
> > > with the changes we
> > > needed to run it cross-vendor:
> > >
> > > - a per-CPU lockless ring between the NMI sample handler and the
> > > kdamond drain,
> > > - per-CPU events that follow CPU hotplug cleanly,
> > > - events fire only while the monitor is running -- created disabled,
> > > armed when kdamond starts, disarmed and drained when it stops,
> > > - all-or-nothing init across CPUs: a partial-CPU create failure rolls
> > > the whole event back rather than leaving silent gaps,
> > > - safe handling of vendor sample-validity flags so a stale or
> > > unpopulated address is never mistaken for a valid sample.
> > >
> > > What the series adds
> > >
> > > Patch 1 introduces the substrate's data types: a per-event
> > > configuration struct and a per-context list to hang them on. A
> > > CONFIG_PERF_EVENTS=n build folds to no-op stubs.
> > >
> > > Patch 2 exposes those types through sysfs. Each entry maps to one
> > > perf event and lets userspace pick the PMU and how to sample it: the
> > > raw PMU type/config, addressing flags, and period or frequency. The
> > > defaults are tuned for Intel PEBS; userspace overrides them for other
> > > PMUs.
> > >
> > > Patch 3 wires the sysfs apply path so configured events get attached
> > > to the running monitoring context.
> > >
> > > Patch 4 is the core of the series. It replaces the mutex-protected
> > > report queue with a per-CPU lockless ring fed from NMI by the perf
> > > overflow handler and drained once per sample tick by the kdamond.
> > > Drained reports are matched to monitored regions by binary search
> > > over a per-tick snapshot. The patch also wires the per-event
> > > lifecycle into kdamond: events arm when the monitor starts, disarm
> > > and drain when it stops, roll back cleanly when per-CPU init fails on
> > > some CPUs, and a second context that asks for the substrate while
> > > it is in use is rejected with -EBUSY.
> > >
> > > Patch 5 is the perf-event backend. Two stateless overflow handlers
> > > (one vaddr-keyed, one paddr-keyed) are picked at event creation time
> > > and submit samples into the per-CPU ring. Vendor-specific sample
> > > validity is honored at this layer.
> > >
> > > Patch 6 adds a tracepoint at every node_eligible_mem_bp quota-goal
> > > evaluation so userspace can watch goal convergence without polling
> > > sysfs.
> > >
> > > Userspace setup model
> > >
> > > Userspace selects the sampling PMU by pointing the perf event's
> > > `type` / `config` at it, and chooses the scheme topology that suits
> > > the address space the PMU reports on. No module load or unload step
> > > is involved; `echo on > state` arms the substrate, `echo off > state`
> > > disarms it.
> > >
> > > Two configurations were used for validation.
> > >
> > > Configuration A: AMD IBS Op, paddr ops, system-wide PULL+PUSH tiering
> > >
> > > IBS Op stamps samples with physical addresses, so DAMON reasons over
> > > every backing page in the system regardless of which task or guest
> > > touched it -- the substrate becomes a system-wide tiering controller.
> > >
> > > Setup (abridged; `D=/sys/kernel/mm/damon/admin/kdamonds/0`):
> > >
> > > echo 1 > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
> > > echo 1 > $D/contexts/nr_contexts
> > > echo paddr > $D/contexts/0/operations
> > >
> > > # Two regions, one per NUMA node (DRAM + CXL). PA ranges
> > > # are derived per host from /proc/iomem; omitted here.
> > > echo 1 > $D/contexts/0/targets/nr_targets
> > > echo 2 > $D/contexts/0/targets/0/regions/nr_regions
> > > echo <DRAM_LO> > $D/contexts/0/targets/0/regions/0/start
> > > echo <DRAM_HI> > $D/contexts/0/targets/0/regions/0/end
> > > echo <CXL_LO> > $D/contexts/0/targets/0/regions/1/start
> > > echo <CXL_HI> > $D/contexts/0/targets/0/regions/1/end
> > >
> > > # IBS Op event, period-based, paddr-stamped:
> > > PE=$D/contexts/0/monitoring_attrs/sample/perf_events
> > > echo 1 > $PE/nr_perf_events
> > > echo $(cat /sys/bus/event_source/devices/ibs_op/type) > $PE/0/type
> > > echo 0 > $PE/0/config
> > > echo 1 > $PE/0/sample_phys_addr
> > > echo 0 > $PE/0/freq
> > > echo 262144 > $PE/0/sample_period
> > > echo 0 > $PE/0/exclude_kernel
> > > echo 0 > $PE/0/exclude_hv
> >
> > FYI, and as you may already know, the current plan [1] is to use the attributes
> > probe interface. With it, the above IBS Op event setup part would look like,
> >
> > mon_attr=/sys/kernel/mm/damon/admin/kdamonds/0/contexts/0/monitoring_attrs
> > echo 1 > $mon_attr/probes/nr_probes
> > probe=$mon_attr/probes/0
> > echo 1 > $probe/filters/nr_filters
> > filter=$probe/filters/0
> > echo perf_event > $filter/type
> > echo ibs_op > $filter/perf_event_type
> > echo Y > $filter/allow
> >
> > Of course, more details could change later.

Understood. will hold for milestone 1.

> >
> > >
> > > # PULL scheme: migrate_hot toward DRAM, gated on
> > > # node_eligible_mem_bp(nid=DRAM) goal target_value=TARGET_BP.
> > > # addr filter restricts source to the CXL range.
> > > # PUSH scheme: migrate_hot toward CXL, gated on
> > > # node_eligible_mem_bp(nid=CXL) target_value=10000-TARGET_BP.
> > > # addr filter restricts source to the DRAM range.
> > > # Both schemes are migrate_hot; they converge from opposite
> > > # directions on the same hot working set.
> > >
> > > echo on > $D/state
> > >
> > > Userspace tunes the steady-state DRAM:CXL split by writing the goal
> > > `target_value`s; DAMON's quota autotuner drives migration intensity
> > > to match.
> > >
> > > Workload: a QEMU/KVM guest pinned to one NUMA node, running 32
> > > multichase multiload threads each touching a 4 GiB working set
> > > (~128 GiB aggregate) with the memcpy-libc kernel. The guest sees
> > > a flat single-NUMA layout and has no direct view of the host's
> > > tiering topology, yet its hot pages are migrated to DRAM and cold
> > > pages pushed to CXL by host-side DAMON acting on IBS-stamped
> > > physical addresses -- the application inside the guest benefits
> > > from tiering it never had to be aware of. Validated on AMD Turin
> > > (132-CPU EPYC). The configuration converged to its target ratio
> > > in seconds and remained stable for 7+ hours continuously, with no
> > > perf core auto-throttle and no measurable drift in the achieved
> > > interleave ratio.
> > >
> > > Configuration B: Intel PEBS L3-miss, vaddr ops, per-PID weighted-dest
> > >
> > > PEBS reports vaddr samples in the context of the running task.
> > > DAMON's vaddr ops monitors a specific PID.
> > >
> > > Setup (abridged):
> > >
> > > echo 1 > /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds
> > > echo 1 > $D/contexts/nr_contexts
> > > echo vaddr > $D/contexts/0/operations
> > >
> > > echo 1 > $D/contexts/0/targets/nr_targets
> > > echo $PID > $D/contexts/0/targets/0/pid_target
> > > echo 0 > $D/contexts/0/targets/0/regions/nr_regions
> > >
> > > # PEBS MEM_LOAD_RETIRED.L3_MISS, frequency-based, vaddr-stamped:
> > > echo 1 > $PE/nr_perf_events
> > > echo 4 > $PE/0/type # PERF_TYPE_RAW
> > > echo 0x20d1 > $PE/0/config # umask=0x20 event=0xd1
> > > echo 0 > $PE/0/sample_phys_addr
> > > echo 1 > $PE/0/freq
> > > echo 5003 > $PE/0/sample_freq
> > > echo 2 > $PE/0/precise_ip
> > > echo 1 > $PE/0/wakeup_events
> > >
> > > # Single migrate_hot scheme with two weighted destinations
> > > # (DRAM + CXL). Userspace tunes the steady-state interleave by
> > > # writing dests/{0,1}/weight.
> > >
> > > echo on > $D/state
> > >
> > > Workload: 32 multichase multiload threads with a 4 GiB working set
> > > each (~128 GiB aggregate) running directly on the host, monitored
> > > by DAMON via the multiload PID. Validated on Intel Granite Rapids
> > > (144-CPU). Convergence is fast and the system is stable.
> >
> > Thank you so much for sharing the great prototype implementation and test
> > results!
> >
> > I will try to make fast progress on milestone 1. I will hold reviewing details
> > of this series for now, as there could be more changes. But in the high level,
> > this looks promising.
> >
> > >
> > > [1] https://lore.kernel.org/linux-mm/20260516223439.4033-1-ravis.opensrc@xxxxxxxxx/
> > > [2] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@xxxxxxxxx
> >
> > [1] https://lore.kernel.org/20260525225208.1179-1-sj@xxxxxxxxxx/
> >
> >
> > Thanks,
> > SJ
> >
> > [...]

Thanks,
Ravi.