Re: [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation

From: Jerome Glisse
Date: Wed Dec 05 2018 - 13:01:36 EST


On Wed, Dec 05, 2018 at 10:25:31AM -0700, Logan Gunthorpe wrote:
>
>
> On 2018-12-04 7:37 p.m., Jerome Glisse wrote:
> >>
> >> This came up before for apis even better defined than HMS as well as
> >> more limited scope, i.e. experimental ABI availability only for -rc
> >> kernels. Linus said this:
> >>
> >> "There are no loopholes. No "but it's been only one release". No, no,
> >> no. The whole point is that users are supposed to be able to *trust*
> >> the kernel. If we do something, we keep on doing it.
> >>
> >> And if it makes it harder to add new user-visible interfaces, then
> >> that's a *good* thing." [1]
> >>
> >> The takeaway being don't land work-in-progress ABIs in the kernel.
> >> Once an application depends on it, there are no more incompatible
> >> changes possible regardless of the warnings, experimental notices, or
> >> "staging" designation. DAX is experimental because there are cases
> >> where it currently does not work with respect to another kernel
> >> feature like xfs-reflink, RDMA. The plan is to fix those, not continue
> >> to hide behind an experimental designation, and fix them in a way that
> >> preserves the user visible behavior that has already been exposed,
> >> i.e. no regressions.
> >>
> >> [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html
> >
> > So i guess i am heading down the vXX road ... such is my life :)
>
> I recommend against it. I really haven't been convinced by any of your
> arguments for having a second topology tree. The existing topology tree
> in sysfs already better describes the links between hardware right now,
> except for the missing GPU links (and those should be addressable within
> the GPU community). Plus, maybe, some other enhancements to sockets/numa
> node descriptions if there's something missing there.
>
> Then, 'hbind' is another issue but I suspect it would be better
> implemented as an ioctl on existing GPU interfaces. I certainly can't
> see any benefit in using it myself.
>
> It's better to take an approach that would be less controversial with
> the community than to brow beat them with a patch set 20+ times until
> they take it.

So here is what i am gonna do because i need this code now. I am gonna
split the helper code that does policy and hbind out from its sysfs
peerage and i am gonna turn it into helpers that each device driver
can use. I will move the sysfs and syscall to be a patchset on its own
which use the exact same above infrastructure.

This means that i am loosing feature as it means that userspace can
not provide a list of multiple device memory to use (which is much more
common that you might think) but at least i can provide something for
the single device case through ioctl.

I am not giving up on sysfs or syscall as this is needed long term so
i am gonna improve it, port existing userspace (OpenCL, ROCm, ...) to
use it (in branch) and demonstrate how it get use by end application.
I will beat it again and again until either i convince people through
hard evidence or i get bored. I do not get bored easily :)

Cheers,
Jérôme