Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers

From: Jonathan Cameron
Date: Tue Jun 14 2022 - 12:46:25 EST


On Mon, 13 Jun 2022 10:05:06 -0400
Johannes Weiner <hannes@xxxxxxxxxxx> wrote:

> On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
> > On Thu, 9 Jun 2022 16:41:04 -0400
> > Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > > On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote:
> > > Would it make more sense to have the platform/devicetree/driver
> > > provide more fine-grained distance values similar to NUMA distances,
> > > and have a driver-scope tunable to override/correct? And then have the
> > > distance value function as the unique tier ID and rank in one.
> >
> > Absolutely a good thing to provide that information, but it's black
> > magic. There are too many contradicting metrics (latency vs bandwidth etc)
> > even not including a more complex system model like Jerome Glisse proposed
> > a few years back. https://lore.kernel.org/all/20190118174512.GA3060@xxxxxxxxxx/
> > CXL 2.0 got this more right than anything else I've seen as provides
> > discoverable topology along with details like latency to cross between
> > particular switch ports. Actually using that data (other than by throwing
> > it to userspace controls for HPC apps etc) is going to take some figuring out.
> > Even the question of what + how we expose this info to userspace is non
> > obvious.

Was offline for a few days. At risk of splitting a complex thread
even more....

>
> Right, I don't think those would be scientifically accurate - but
> neither is a number between 1 and 3.

The 3 tiers in this proposal are just a starting point (and one I'd
expect we'll move beyond very quickly) - aim is to define a userspace
that is flexible enough, but then only use a tiny bit of that flexibility
to get an initial version in place. Even relatively trivial CXL systems
will include.

1) Direct connected volatile memory, (similar to a memory only NUMA node / socket)
2) Direct connected non volatile (similar to pmem Numa node, but maybe not
similar enough to fuse with socket connected pmem)
3) Switch connected volatile memory (typically a disagregated memory device,
so huge, high bandwidth, not great latency)
4) Switch connected non volatile (typically huge, high bandwidth, even wors
latency).
5) Much more fun if we care about bandwidth as interleaving going on
in hardware across either similar, or mixed sets of switch connected
and direct connected.

Sure we might fuse some of those. But just the CXL driver is likely to have
groups separate enough we want to handle them as 4 tiers and migrate between
those tiers... Obviously might want a clever strategy for cold / hot migration!

> The way I look at it is more
> about spreading out the address space a bit, to allow expressing
> nuanced differences without risking conflicts and overlaps. Hopefully
> this results in the shipped values stabilizing over time and thus
> requiring less and less intervention and overriding from userspace.

I don't think they ever will stabilize, because the right answer isn't
definable in terms of just one number. We'll end up with the old mess of
magic values in SLIT in which systems have been tuned against particular
use cases. HMAT was meant to solve that, but not yet clear it it will.

>
> > > Going further, it could be useful to separate the business of hardware
> > > properties (and configuring quirks) from the business of configuring
> > > MM policies that should be applied to the resulting tier hierarchy.
> > > They're somewhat orthogonal tuning tasks, and one of them might become
> > > obsolete before the other (if the quality of distance values provided
> > > by drivers improves before the quality of MM heuristics ;). Separating
> > > them might help clarify the interface for both designers and users.
> > >
> > > E.g. a memdev class scope with a driver-wide distance value, and a
> > > memdev scope for per-device values that default to "inherit driver
> > > value". The memtier subtree would then have an r/o structure, but
> > > allow tuning per-tier interleaving ratio[1], demotion rules etc.
> >
> > Ok that makes sense. I'm not sure if that ends up as an implementation
> > detail, or effects the userspace interface of this particular element.
> >
> > I'm not sure completely read only is flexible enough (though mostly RO is fine)
> > as we keep sketching out cases where any attempt to do things automatically
> > does the wrong thing and where we need to add an extra tier to get
> > everything to work. Short of having a lot of tiers I'm not sure how
> > we could have the default work well. Maybe a lot of "tiers" is fine
> > though perhaps we need to rename them if going this way and then they
> > don't really work as current concept of tier.
> >
> > Imagine a system with subtle difference between different memories such
> > as 10% latency increase for same bandwidth. To get an advantage from
> > demoting to such a tier will require really stable usage and long
> > run times. Whilst you could design a demotion scheme that takes that
> > into account, I think we are a long way from that today.
>
> Good point: there can be a clear hardware difference, but it's a
> policy choice whether the MM should treat them as one or two tiers.
>
> What do you think of a per-driver/per-device (overridable) distance
> number, combined with a configurable distance cutoff for what
> constitutes separate tiers. E.g. cutoff=20 means two devices with
> distances of 10 and 20 respectively would be in the same tier, devices
> with 10 and 100 would be in separate ones. The kernel then generates
> and populates the tiers based on distances and grouping cutoff, and
> populates the memtier directory tree and nodemasks in sysfs.

I think we'll need something along those lines, though I was envisioning
it sitting at the level of what we do with the tiers, rather than how
we create them. So particularly usecases would decide to treat
sets of tiers as if they were one. Have enough tiers and we'll end up
with k-means or similar to figure out the groupings. Of course there
is then a soft of 'tier group for use XX' concept so maybe not much
difference until we have a bunch of usecases.

>
> It could be simple tier0, tier1, tier2 numbering again, but the
> numbers now would mean something to the user. A rank tunable is no
> longer necessary.

This feels like it might make tier assignments a bit less stable
and hence run into question of how to hook up accounting. Not my
area of expertise though, but it was put forward as one of the reasons
we didn't want hotplug to potentially end up shuffling other tiers
around. The desire was for a 'stable' entity. Can avoid that with
'space' between them but then we sort of still have rank, just in a
form that makes updating it messy (need to create a new tier to do
it).

>
> I think even the nodemasks in the memtier tree could be read-only
> then, since corrections should only be necessary when either the
> device distance is wrong or the tier grouping cutoff.
>
> Can you think of scenarios where that scheme would fall apart?

Simplest (I think) is the GPU one. Often those have very nice
memory that we CPU software developers would love to use, but
some pesky GPGPU folk think it is for GPU related data. Anyhow, folk
who care about GPUs have requested that it be in a tier that
is lower rank than main memory.

If you just categorize it by performance (from CPUs) then it
might well end up elsewhere. These folk do want to demote
to CPU attached DRAM though. Which raises the question of
'where is your distance between?'

Definitely policy decision, and one we can't get from perf
characteristics. It's a blurry line. There are classes
of fairly low spec memory attached accelerators on the horizon.
For those preventing migration to the memory they are associated
with might generally not make sense.

Tweaking policy by messing with anything that claims to be a
distance is a bit nasty at looks like the SLIT table tuning
that's still happens. Could have a per device rank though
and make it clear this isn't cleanly related to any perf
characterstics. So ultimately that moves rank to devices
and then we have to put them into nodes. Not sure it gained
us much other than seeming more complex to me.

Jonathan