On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
I'm not sure completely read only is flexible enough (though mostly RO is fine)
as we keep sketching out cases where any attempt to do things automatically
does the wrong thing and where we need to add an extra tier to get
everything to work. Short of having a lot of tiers I'm not sure how
we could have the default work well. Maybe a lot of "tiers" is fine
though perhaps we need to rename them if going this way and then they
don't really work as current concept of tier.
Imagine a system with subtle difference between different memories such
as 10% latency increase for same bandwidth. To get an advantage from
demoting to such a tier will require really stable usage and long
run times. Whilst you could design a demotion scheme that takes that
into account, I think we are a long way from that today.
Good point: there can be a clear hardware difference, but it's a
policy choice whether the MM should treat them as one or two tiers.
What do you think of a per-driver/per-device (overridable) distance
number, combined with a configurable distance cutoff for what
constitutes separate tiers. E.g. cutoff=20 means two devices with
distances of 10 and 20 respectively would be in the same tier, devices
with 10 and 100 would be in separate ones. The kernel then generates
and populates the tiers based on distances and grouping cutoff, and
populates the memtier directory tree and nodemasks in sysfs.
It could be simple tier0, tier1, tier2 numbering again, but the
numbers now would mean something to the user. A rank tunable is no
longer necessary.
I think even the nodemasks in the memtier tree could be read-only
then, since corrections should only be necessary when either the
device distance is wrong or the tier grouping cutoff.
Can you think of scenarios where that scheme would fall apart?