Re: [PATCH RFC v2 14/18] dax/region: Support DAX device creation on dynamic DAX regions

From: Dan Williams
Date: Wed Sep 13 2023 - 14:00:08 EST


Ira Weiny wrote:
[..]
> >
> > Given that one of the expected DCD use cases is to provide just in time
> > memory for specific jobs the "first-available" search for free capacity
> > in a Sparse DAX Region collides with the need to keep allocations
> > bounded by tag.
>
> How does it collide?
>
> My attempt here is to leave dax devices 'unlabeled'. As such they will use
> space on a 'first-available' search regardless of extent labels.
>
> Effectively I have defined 'no label' as being 'any label'. I apologize
> for this detail being implicit and not explicit.
>
> My envisioned path would be that older daxctl would continue to work like
> this because the kernel would not restrict unlabeled dax device creation.
>
> Newer daxctl could use dax device labels to control the extents used. But
> only when dax device labeling is introduced in a future kernel. Use of a
> newer daxctl on an older DCD kernel could continue to work sans label.
>
> In this way I envisioned a path where the policy is completely dictated by
> user space restricted only by the software available.

Tags are a core concept in DCD. "Allocate by tag" does not feel like
something that can come later at least in terms of when the DCD ABI is
ready for upstream. So, yes, it can remain out of this patchset, but the
upstream merge of all of DCD would be gated on that facility arriving.

> > I agree with Jonathan that unless and until the allocation scheme is
> > updated to be tag aware then there is no reason for allocate by tag to
> > exist in the interface.
>
> I will agree that it was perhaps premature to introduce labels on the
> extents. However, I did so to give tags a space to be informationally
> surfaced.
>
> IMO we must have a plan forward or wait until that plan is fully formed
> and implemented. The size of this set is rather large. Therefore, I was
> hoping that a plan would be enough to move forward.

Leave it out for now to focus on the core mechanisms and then we can
circle back to it.

> > That said, the next question, "is DCD enabling considered a toy until
> > the ability to allocate by tag is present?" I think yes, to the point
> > where old daxctl binaries should be made fail to create device instances
> > by forcing a tag to be selected at allocation time for Sparse DAX
> > Regions.
>
> Interesting. I was not considering allocate by label to be a requirement
> but rather an enhancement. Labels IMO are a further refinement of the
> memory space allocation. I can see a very valid use case (not toy use
> case) where all the DCD memory allocated to a node is dedicated to a
> singular job and is done without tags or even ignoring tags. Many HPC
> sites run with singular jobs per host.

Is HPC going to use DCD? My impression is that HPC is statically
provisioned per node and that DCD is more targeted at Cloud use cases
where dynamic provisioning is common.

> > The last question is whether *writable* tags are needed to allow for
> > repurposing memory allocated to a host without needing to round trip it
> > through the FM to get it re-tagged. While that is something the host and
> > orchestrator can figure out on their own, it looks like a nice to have
> > until the above questions are answered.
>
> Needed? No. Of course not. As you said the orchestrator software can
> keep iterating with the FM until it gets what it wants. It was you who
> had the idea of a writable labels and I agreed.

Yeah, it was an idea for how to solve the problem of repurposing tag
without needing to round trip with the FM.

> "Seemed like a good idea at the time..." ;-)
>
> As I have reviewed and rewritten this message I worry that writable labels
> are a bad idea. Interleaving will most likely depend on grouping extent
> tags into the CXL/DAX extent. With this in mind adjusting extents is
> potentially going to require an FM interaction to get things set up
> anyway.
>
> [Again re-reading my message I thought of another issue. What
> happens if the user decides to change the label on an extent after
> some dax device with the old label? That seems like an additional
> complication which is best left out by not allowing extent labels
> to be writable.]

At least for this point extents can not be relabeled while allocated to
an instance.

[..]
> My current view is:
> 1) No. Current dax devices can be defined as 'no label'
> 2) I'm not sure. I can see both ways having benefits.
> 3) No I think the ROI is not worth it.
> 4) The use of 'any extent label' in #2 means that available size
> retains it's meaning for no label dax devices. Labeled dax
> devices would require a future enhancement to size information.

If the ABI is going to change in the future I don't want every debug
session to start with "which version of daxctl were you using", or "do
your scripts comprehend Sparse DAX Regions?". This stance is motivated
by having seen the problems that the current ABI causes for people that want
to do things like mitigate the "noisy neighbor" phenomenon in memory
side caches. The allocation ABI is too simple and DCD seems to need
more.

The kernel enforced requirement for Sparse DAX Region aware tooling just
makes it easier on us to maintain. If it means waiting until we ahve
agreement on the allocation ABI I think that's a simple release valve.

The fundamental mechanisms can be reviewed in the meantime.