RE: [EXT] Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
From: Ajay Joshi
Date: Wed Dec 04 2024 - 07:45:06 EST
Micron Confidential
Micron Confidential
+AD4- From: Jonathan Cameron +ADw-Jonathan.Cameron+AEA-huawei.com+AD4-
+AD4- Sent: Wednesday, November 27, 2024 10:05 PM
+AD4-
+AD4-
+AD4- On Thu, 21 Nov 2024 10:18:41 +-0000
+AD4- Jonathan Cameron +ADw-Jonathan.Cameron+AEA-huawei.com+AD4- wrote:
+AD4-
+AD4- +AD4- The CXL specification release 3.2 is now available under a click
+AD4- +AD4- through at
+AD4- +AD4-
+AD4- https://nam10.safelinks.protection.outlook.com/?url+AD0-https+ACU-3A+ACU-2F+ACU-2Fcom
+AD4- p
+AD4- +AD4- uteexpresslink.org+ACU-2Fcxl-
+AD4- specification+ACU-2F+ACY-data+AD0-05+ACU-7C02+ACU-7Cajayjoshi+ACU-40micron.com+ACU-7Ce59092c
+AD4- 80eed4878d9cc08dd0f016a78+ACU-7Cf38a5ecd28134862b11bac1d563c806f+ACU-
+AD4- 7C0+ACU-7C0+ACU-7C638683221020661525+ACU-7CUnknown+ACU-7CTWFpbGZsb3d8eyJF
+AD4- bXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiT
+AD4- WFpbCIsIldUIjoyfQ+ACU-3D+ACU-3D+ACU-7C0+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-A6OYPhky94PnkzYn
+AD4- 4bfB1usIFDQzR1GlY1QFK3hBVtY+ACU-3D+ACY-reserved+AD0-0 and it brings new shiny
+AD4- toys.
+AD4-
+AD4- If anyone wants to play, basic emulation on my CXL QEMU staging tree
+AD4- https://nam10.safelinks.protection.outlook.com/?url+AD0-https+ACU-3A+ACU-2F+ACU-2Fgitla
+AD4- b.com+ACU-2Fjic23+ACU-2Fqemu+ACU-2F-
+AD4- +ACU-2Fcommit+ACU-2Fe89b35d264c1bcc04807e7afab1254f35ffc8cb9+ACY-data+AD0-05+ACU-7
+AD4- C02+ACU-7Cajayjoshi+ACU-40micron.com+ACU-7Ce59092c80eed4878d9cc08dd0f016a7
+AD4- 8+ACU-7Cf38a5ecd28134862b11bac1d563c806f+ACU-7C0+ACU-7C0+ACU-7C638683221020
+AD4- 676260+ACU-7CUnknown+ACU-7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi
+AD4- OiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ+ACU-3D+ACU-3D
+AD4- +ACU-7C0+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-Un0fB5v+ACU-2BBKTnQPldKKoRwOpw9GrGdDwBrXm
+AD4- JamKEIvA+ACU-3D+ACY-reserved+AD0-0
This is interesting. We are definitely trying this and let you know how it goes.
+AD4-
+AD4- Branch with a few other things on top is:
+AD4- https://nam10.safelinks.protection.outlook.com/?url+AD0-https+ACU-3A+ACU-2F+ACU-2Fgitla
+AD4- b.com+ACU-2Fjic23+ACU-2Fqemu+ACU-2F-+ACU-2Fcommits+ACU-2Fcxl-2024-11-
+AD4- 27+ACY-data+AD0-05+ACU-7C02+ACU-7Cajayjoshi+ACU-40micron.com+ACU-7Ce59092c80eed4878d9
+AD4- cc08dd0f016a78+ACU-7Cf38a5ecd28134862b11bac1d563c806f+ACU-7C0+ACU-7C0+ACU-7C
+AD4- 638683221020684284+ACU-7CUnknown+ACU-7CTWFpbGZsb3d8eyJFbXB0eU1hcGk
+AD4- iOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIj
+AD4- oyfQ+ACU-3D+ACU-3D+ACU-7C0+ACU-7C+ACU-7C+ACU-7C+ACY-sdata+AD0-V451+ACU-2BM9UKiC0RfBUviNTY3fZH
+AD4- UGHdjJEgGuR0DowJZM+ACU-3D+ACY-reserved+AD0-0
+AD4-
+AD4- Note that this currently doesn't produce real data. I have a plan / initial PoC /
+AD4- hack to hook that up via an addition to the QEMU cache plugin and an
+AD4- external tool to emulate the hotness tracker counting hardware. Will be a little
+AD4- while before I get that finished, so in a meantime the above exercises the
+AD4- driver.
+AD4-
+AD4- Jonathan
+AD4-
+AD4- +AD4-
+AD4- +AD4- RFC reason
+AD4- +AD4- - Whilst trace capture with a particular configuration is potentially useful
+AD4- +AD4- the intent is that CXL HMU units will be used to drive various forms of
+AD4- +AD4- hotpage migration for memory tiering setups. This driver doesn't do this
+AD4- +AD4- (yet), but rather provides data capture etc for experimentation and
+AD4- +AD4- for working out how to mostly put the allocations in the right place to
+AD4- +AD4- start with by tuning applications.
+AD4- +AD4-
+AD4- +AD4- CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The
+AD4- +AD4- intent of this is to provide a way to establish which units of memory
+AD4- +AD4- (typically pages or larger) in CXL attached memory are hot. The
+AD4- +AD4- implementation details and algorithm are all implementation defined.
+AD4- +AD4- The specification simply describes the 'interface' which takes the
+AD4- +AD4- form of ring buffer of hotness records in a PCI BAR and defined
+AD4- +AD4- capability, configuration and status registers.
+AD4- +AD4-
+AD4- +AD4- The hardware may have constraints on what it can track, granularity
+AD4- +AD4- etc and on how accurately it tracks (e.g. counter exhaustion,
+AD4- +AD4- inaccurate trackers). Some of these constraints are discoverable from
+AD4- +AD4- the hardware registers, others such as loss of accuracy have no
+AD4- +AD4- universally accepted measures as they are typically access pattern
+AD4- +AD4- dependent. Sadly it is very unlikely any hardware will implement a
+AD4- +AD4- truly precise tracker given the large resource requirements for tracking at a
+AD4- useful granularity.
+AD4- +AD4-
+AD4- +AD4- There are two fundamental operation modes:
+AD4- +AD4-
+AD4- +AD4- +ACo- Epoch based. Counters are checked after a period of time (Epoch) and
+AD4- +AD4- if over a threshold added to the hotlist.
+AD4- +AD4- +ACo- Always on. Counters run until a threshold is reached, after that the
+AD4- +AD4- hot unit is added to the hotlist and the counter released.
+AD4- +AD4-
+AD4- +AD4- Counting can be filtered on:
+AD4- +AD4-
+AD4- +AD4- +ACo- Region of CXL DPA space (256MiB per bit in a bitmap).
+AD4- +AD4- +ACo- Type of access - Trusted and non trusted or non trusted only, R/W/RW
+AD4- +AD4-
+AD4- +AD4- Sampling can be modified by:
+AD4- +AD4-
+AD4- +AD4- +ACo- Downsampling including potentially randomized downsampling.
+AD4- +AD4-
+AD4- +AD4- The driver presented here is intended to be useful in its own right
+AD4- +AD4- but also to act as the first step of a possible path towards hotness
+AD4- +AD4- monitoring based hot page migration. Those steps might look like.
+AD4- +AD4-
+AD4- +AD4- 1. Gather data - drivers provide telemetry like solutions to get that
+AD4- +AD4- data. May be enhanced, for example in this driver by providing the
+AD4- +AD4- HPA address rather than DPA Unit Address. Userspace can access enough
+AD4- +AD4- information to do this so maybe not.
+AD4- +AD4- 2. Userspace algorithm development, possibly combined with userspace
+AD4- +AD4- triggered migration by PA. Working out how to use different levels
+AD4- +AD4- of constrained hardware resources will be challenging.
+AD4- +AD4- 3. Move those algorithms in kernel. Will require generalization across
+AD4- +AD4- different hotpage trackers etc.
+AD4- +AD4-
+AD4- +AD4- So far this driver just gives access to the raw data. I will probably
+AD4- +AD4- kick of a longer discussion on how to do adaptive sampling needed to
+AD4- +AD4- actually use these units for tiering etc, sometime soon (if no one one
+AD4- +AD4- else beats me too it). There is a follow up topic of how to
+AD4- +AD4- virtualize this stuff for memory stranding cases (VM gets a fixed
+AD4- +AD4- mixture of fast and slow memory and should do it's own tiering).
+AD4- +AD4-
+AD4- +AD4- More details in the Documentation patch but typical commands are:
+AD4- +AD4-
+AD4- +AD4- +ACQ-perf record -a -e cxl+AF8-hmu+AF8-mem0.0.0/epoch+AF8-type+AD0-0,access+AF8-type+AD0-6,+AFw-
+AD4- +AD4-
+AD4- +AD4-
+AD4- hotness+AF8-threshold+AD0-1024,epoch+AF8-multiplier+AD0-4,epoch+AF8-scale+AD0-4,range+AF8-base+AD0-0,+AFw-
+AD4- +AD4- range+AF8-size+AD0-1024,randomized+AF8-downsampling+AD0-0,downsampling+AF8-factor+AD0-32,+AFw-
+AD4- +AD4- hotness+AF8-granual+AD0-12
+AD4- +AD4-
+AD4- +AD4- +ACQ-perf report --dump-raw-traces
+AD4- +AD4-
+AD4- +AD4- Example output. With a counter+AF8-width of 16 (0x10) the least
+AD4- +AD4- significant
+AD4- +AD4- 4 bytes are the counter value and the unit index is bits 16-63.
+AD4- +AD4- Here all units are over the threshold and the indexes are 0,1,2 etc.
+AD4- +AD4-
+AD4- +AD4- . ... CXL+AF8-HMU data: size 33512 bytes
+AD4- +AD4- Header 0: units: 29c counter+AF8-width 10
+AD4- +AD4- Header 1 : deadbeef
+AD4- +AD4- 0000000000000283
+AD4- +AD4- 0000000000010364
+AD4- +AD4- 0000000000020366
+AD4- +AD4- 000000000003033c
+AD4- +AD4- 0000000000040343
+AD4- +AD4- 00000000000502ff
+AD4- +AD4- 000000000006030d
+AD4- +AD4- 000000000007031a
+AD4- +AD4-
+AD4- +AD4- Which will produce a list of hotness entries.
+AD4- +AD4- Bits+AFs-N-1:0+AF0- counter value
+AD4- +AD4- Bits+AFs-63:N+AF0- Unit ID (combine with unit size and DPA base +- HDM decoder
+AD4- +AD4- config to get to a Host Physical Address)
+AD4- +AD4-
+AD4- +AD4- Specific RFC questions.
+AD4- +AD4- - What should be in the header added to the aux buffer.
+AD4- +AD4- Currently just the minimum is provided. Number of records
+AD4- +AD4- and the counter width needed to decode them.
+AD4- +AD4- - Should we reset the counters when doing sampling +ACI--F X+ACI-
+AD4- +AD4- If the frequency is higher than the epoch we never see any hot units.
+AD4- +AD4- If so, when should we reset them?
+AD4- +AD4-
+AD4- +AD4- Note testing has been light and on emulation only +- as perf tool is a
+AD4- +AD4- pain to build on a striped back VM, build testing has all be on
+AD4- +AD4- arm64 so far. The driver loads though on both arm64 and x86 so any
+AD4- +AD4- problems are likely in the perf tool arch specific code which is build
+AD4- +AD4- tested (on wrong machine)
+AD4- +AD4-
+AD4- +AD4- The QEMU emulation needs some cleanup, but I should be able to post
+AD4- +AD4- that shortly to let people actually play with this. There are lots of
+AD4- +AD4- open questions there on how 'right' we want the emulation to be and
+AD4- +AD4- what counting uarch to emulate.
+AD4- +AD4-
+AD4- +AD4- Jonathan Cameron (4):
+AD4- +AD4- cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
+AD4- +AD4- cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
+AD4- +AD4- perf: Add support for CXL Hotness Monitoring Units (CHMU)
+AD4- +AD4- hwtrace: Document CXL Hotness Monitoring Unit driver
+AD4- +AD4-
+AD4- +AD4- Documentation/trace/cxl-hmu.rst +AHw- 197 +-+-+-+-+-+-+-
+AD4- +AD4- Documentation/trace/index.rst +AHw- 1 +-
+AD4- +AD4- drivers/cxl/Kconfig +AHw- 6 +-
+AD4- +AD4- drivers/cxl/Makefile +AHw- 3 +-
+AD4- +AD4- drivers/cxl/core/Makefile +AHw- 1 +-
+AD4- +AD4- drivers/cxl/core/core.h +AHw- 1 +-
+AD4- +AD4- drivers/cxl/core/hmu.c +AHw- 64 +-+-
+AD4- +AD4- drivers/cxl/core/port.c +AHw- 2 +-
+AD4- +AD4- drivers/cxl/core/regs.c +AHw- 14 +-
+AD4- +AD4- drivers/cxl/cxl.h +AHw- 5 +-
+AD4- +AD4- drivers/cxl/cxlpci.h +AHw- 1 +-
+AD4- +AD4- drivers/cxl/hmu.c +AHw- 880 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
+AD4- +AD4- drivers/cxl/hmu.h +AHw- 23 +-
+AD4- +AD4- drivers/cxl/pci.c +AHw- 26 +--
+AD4- +AD4- tools/perf/arch/arm/util/auxtrace.c +AHw- 58 +-+-
+AD4- +AD4- tools/perf/arch/x86/util/auxtrace.c +AHw- 76 +-+-+-
+AD4- +AD4- tools/perf/util/Build +AHw- 1 +-
+AD4- +AD4- tools/perf/util/auxtrace.c +AHw- 4 +-
+AD4- +AD4- tools/perf/util/auxtrace.h +AHw- 1 +-
+AD4- +AD4- tools/perf/util/cxl-hmu.c +AHw- 367 +-+-+-+-+-+-+-+-+-+-+-+-
+AD4- +AD4- tools/perf/util/cxl-hmu.h +AHw- 18 +-
+AD4- +AD4- 21 files changed, 1748 insertions(+-), 1 deletion(-) create mode
+AD4- +AD4- 100644 Documentation/trace/cxl-hmu.rst create mode 100644
+AD4- +AD4- drivers/cxl/core/hmu.c create mode 100644 drivers/cxl/hmu.c create
+AD4- +AD4- mode 100644 drivers/cxl/hmu.h create mode 100644
+AD4- +AD4- tools/perf/util/cxl-hmu.c create mode 100644
+AD4- +AD4- tools/perf/util/cxl-hmu.h
+AD4- +AD4-