Glad to hear that more sharable code is desirable.
IMHO, for a common MM subsystem, it is more beneficial for
GMEM to extend core MM instead of building a separate one.
As stated in the beginning of my RFC letter, MM systems are
large and similar. Even a sophisticated one like Linux MM
that has evolved over decades still suffers from an increasing
number of bugs[1]. So, directly extending core MM to support
devices not only avoids opening a new box of bugs, but also
allows the community to concentrate on maintaining one single
MM system. On the other side, GMEM does no hurt to core MM
If a CPU process is not attached with device contexts.
@Christian, could you provide more information on what AMD
proposed with KFD and why it was rejected?
[1] Huang, Jian, Moinuddin K. Qureshi, and Karsten Schwan. "An evolutionary study of linux memory management for fun and profit." 2016 USENIX Annual Technical Conference (USENIX ATC 16). 2016.
Thanks,
Weixi
-----Original Message-----
From: Dave Airlie <airlied@xxxxxxxxx>
Sent: Wednesday, November 29, 2023 1:15 PM
To: Christian König <christian.koenig@xxxxxxx>
Cc: zhuweixi <weixi.zhu@xxxxxxxxxx>; linux-mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; weixi.zhu@xxxxxxxxxxxx; mgorman@xxxxxxx; jglisse@xxxxxxxxxx; rcampbell@xxxxxxxxxx; jhubbard@xxxxxxxxxx; apopple@xxxxxxxxxx; mhairgrove@xxxxxxxxxx; ziy@xxxxxxxxxx; alexander.deucher@xxxxxxx; Xinhui.Pan@xxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Felix.Kuehling@xxxxxxx; ogabbay@xxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; jgg@xxxxxxxxxx; leonro@xxxxxxxxxx; zhenyuw@xxxxxxxxxxxxxxx; zhi.a.wang@xxxxxxxxx; intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; jani.nikula@xxxxxxxxxxxxxxx; joonas.lahtinen@xxxxxxxxxxxxxxx; rodrigo.vivi@xxxxxxxxx; tvrtko.ursulin@xxxxxxxxxxxxxxx
Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices
On Tue, 28 Nov 2023 at 23:07, Christian König <christian.koenig@xxxxxxx> wrote:
Am 28.11.23 um 13:50 schrieb Weixi Zhu:
The problem:Well that is pretty much exactly what AMD has already proposed with KFD
Accelerator driver developers are forced to reinvent external MM subsystems
case by case, because Linux core MM only considers host memory resources.
These reinvented MM subsystems have similar orders of magnitude of LoC as
Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and Huawei NPU has
30K. Meanwhile, more and more vendors are implementing their own
accelerators, e.g. Microsoft's Maia 100. At the same time,
application-level developers suffer from poor programmability -- they must
consider parallel address spaces and be careful about the limited device
DRAM capacity. This can be alleviated if a malloc()-ed virtual address can
be shared by the accelerator, or the abundant host DRAM can further
transparently backup the device local memory.
These external MM systems share similar mechanisms except for the
hardware-dependent part, so reinventing them is effectively introducing
redundant code (14K~70K for each case). Such developing/maintaining is not
cheap. Furthermore, to share a malloc()-ed virtual address, device drivers
need to deeply interact with Linux MM via low-level MM APIs, e.g. MMU
notifiers/HMM. This raises the bar for driver development, since developers
must understand how Linux MM works. Further, it creates code maintenance
problems -- any changes to Linux MM potentially require coordinated changes
to accelerator drivers using low-level MM APIs.
Putting a cache-coherent bus between host and device will not make these
external MM subsystems disappear. For example, a throughput-oriented
accelerator will not tolerate executing heavy memory access workload with
a host MMU/IOMMU via a remote bus. Therefore, devices will still have
their own MMU and pick a simpler page table format for lower address
translation overhead, requiring external MM subsystems.
--------------------
What GMEM (Generalized Memory Management [1]) does:
GMEM extends Linux MM to share its machine-independent MM code. Only
high-level interface is provided for device drivers. This prevents
accelerator drivers from reinventing the wheel, but relies on drivers to
implement their hardware-dependent functions declared by GMEM. GMEM's key
interface include gm_dev_create(), gm_as_create(), gm_as_attach() and
gm_dev_register_physmem(). Here briefly describe how a device driver
utilizes them:
1. At boot time, call gm_dev_create() and registers the implementation of
hardware-dependent functions as declared in struct gm_mmu.
- If the device has local DRAM, call gm_dev_register_physmem() to
register available physical addresses.
2. When a device context is initialized (e.g. triggered by ioctl), check if
the current CPU process has been attached to a gmem address space
(struct gm_as). If not, call gm_as_create() and point current->mm->gm_as
to it.
3. Call gm_as_attach() to attach the device context to a gmem address space.
4. Invoke gm_dev_fault() to resolve a page fault or prepare data before
device computation happens.
GMEM has changed the following assumptions in Linux MM:
1. An mm_struct not only handle a single CPU context, but may also handle
external memory contexts encapsulated as gm_context listed in
mm->gm_as. An external memory context can include a few or all of the
following parts: an external MMU (that requires TLB invalidation), an
external page table (that requires PTE manipulation) and external DRAM
(that requires physical memory management).
and was rejected for rather good reasons.
MMU functionsWell to be honest all of this sounds like history to me. We have already
The MMU functions peer_map() and peer_unmap() overlap other functions,
leaving a question if the MMU functions should be decoupled as more basic
operations. Decoupling them could potentially prevent device drivers
coalescing these basic steps within a single host-device communication
operation, while coupling them makes it more difficult for device drivers
to utilize GMEM interface.
seen the same basic approach in KFD, HMM and to some extend in TTM as well.
And all of them more or less failed. Why should this here be different?
Any info we have on why this has failed to work in the past would be
useful to provide. This is one of those cases where we may not have
documented the bad ideas to stop future developers from thinking they
are bad.
I do think we would want more common code in this area, but I would
think we'd have it more on the driver infrastructure side, than in the
core mm.
Dave.