Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall
From: Andy Lutomirski
Date: Tue Sep 19 2023 - 14:52:55 EST
On Tue, Sep 19, 2023, at 11:20 AM, Gregory Price wrote:
> On Tue, Sep 19, 2023 at 10:59:33AM -0700, Andy Lutomirski wrote:
>>
>> I'm not complaining about the name. I'm objecting about the semantics.
>>
>> Apparently you have a system to collect usage statistics of physical addresses, but you have no idea what those pages map do (without crawling /proc or /sys, anyway). But that means you have no idea when the logical contents of those pages *changes*. So you fundamentally have a nasty race: anything else that swaps or migrates those pages will mess up your statistics, and you'll start trying to migrate the wrong thing.
>
> How does this change if I use virtual address based migration?
>
> I could do sampling based on virtual address (page faults, IBS/PEBs,
> whatever), and by the time I make a decision, the kernel could have
> migrated the data or even my task from Node A to Node B. The sample I
> took is now stale, and I could make a poor migration decision.
The window is a lot narrower. If you’re sampling by VA, you collect stats and associate them with the logical page (the tuple (mapping, VA), for example). The kernel can do this without races from page faults handlers. If you sample based on PA, you fundamentally race against anything that does migration.
>
> If I do move_pages(pid, some_virt_addr, some_node) and it migrates the
> page from NodeA to NodeB, then the device-side collection is likewise
> no longer valid. This problem doesn't change because I used virtual
> address compared to physical address.
Sure it does, as long as you collect those samples when you migrate. And I think the kernel migrating to or from device memory (or more generally allocating and freeing device memory and possibly even regular memory) *should* be aware of whatever hotness statistics are in use.
>
> But if i have a 512GB memory device, and i can see a wide swath of that
> 512GB is hot, while a good chunk of my local DRAM is not - then I
> probably don't care *what* gets migrated up to DRAM, i just care that a
> vast majority of that hot data does.
>
> The goal here isn't 100% precision, you will never get there. The goal
> here is broad-scope performance enhancements of the overall system
> while minimizing the cost to compute the migration actions to be taken.
>
> I don't think the contents of the page are always relevant. The entire
> concept here is to enable migration without caring about what programs
> are using the memory for - just so long as the memcg's and zoning is
> respected.
>
At the very least I think you need to be aware of page *size*. And if you want to avoid excessive fragmentation, you probably also want to be aware of the boundaries of a logical allocation.
I think that doing this entire process by PA, blind, from userspace will end up stuck in a not-so-good solution, and the ABI will be set in stone, and it will not be a great situation for long term maintainability or performance.