Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW instructions

From: David Hildenbrand
Date: Tue Jul 02 2024 - 09:50:53 EST


Changing to this new behaviour would only be a partial solution for your use
case, since you would still have 2 faults. But it would remove the cost of the
shattering and ensure you have a THP immediately after the write fault. But I
can't think of a reason why this wouldn't be a generally useful change
regardless? Any thoughts?

The "let's read before we write" as used by QEMU migration code is the desire
to not waste memory by populating the zeropages. Deferring consuming memory
until really required.

     /*
      * We read one byte of each page; this will preallocate page tables if
      * required and populate the shared zeropage on MAP_PRIVATE anonymous
memory
      * where no page was populated yet. This might require adaption when
      * supporting other mappings, like shmem.
      */

So QEMU is concerned with preallocatiing page tables? I would have thought you
could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE
call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip
through the allocator, fewer pud/pmd lock/unlocks, etc).

I think we are only concerned about the "shared zeropage" part. Everything else
is just unnecessary detail that adds confusion here :) One requires the other.


BTW, I was quoting the wrong QEMU code. The relevant QEMU commit that first added
that handling is:

commit 211ea74022f51164a7729030b28eec90b6c99a08
Author: Peter Lieven <pl@xxxxxxx>
Date: Mon Jun 10 12:14:20 2013 +0200

migration: do not overwrite zero pages
on incoming migration do not memset pages to zero if they already read as zero.
this will allocate a new zero page and consume memory unnecessarily. even
if we madvise a MADV_DONTNEED later this will only deallocate the memory
asynchronously.

(note that the MADV_DONTNEED handling in that form does not really apply
anymore AFAIK)


Sorry I don't quite follow your comment. As I understand it, the zeropage
concept is intended as a memory-saving mechanism for applications that read
memory but never write it.

"never written" -- then we wouldn't support COW faults on it, right? :P

I don't think that really applies in your migration
case, because you are definitely going to write all the memory eventually, I
think?

Especially with memory ballooning and things like free-page-reporting we might
be migrating a whole bunch of zero-memory and only have to make sure that the
destination is actually zero. We don't want to consume memory.

I recently fixed that handling for s390x where customers were running into
precisely that issue (and zeropages in s390x VMs were disabled for historical
reasons).

So I guess you are not interested in the "memory-saving" property, but in
the side-effect, which is the pre-allocation of pagetables? (if you just wanted
the memory-saving property, why not just skip the "read one byte of each page"
op? It's not important though, so let's not go down the rabbit hole.

There are cases where, during an incoming migration, some memory pages might
already have been populated and might not be zero. And IIRC there are some cases
(postcopy -> uffd) where I think it is important that we actually do have a page
in place. But I forgot some of the details around userfaultfd handling in QEMU.



Note that this is from migration code where we're supposed to write a single
page we received from the migration source right now (not more). And we want to
avoid allcoating memory if it can be avoided (usually for overcommit).




TBH I always assumed in the past the that huge zero page is only useful because
its a placeholder for a real THP that would be populated on write. But that's
obviously not the case at the moment. So other than a hack to preallocate the
pgtables with only 1 fault per 2M, what other benefits does it have?

I don't quite udnerstand that question. [2] has some details why the huge
zeropage was added -- because we would have never otherwise received huge
zeropages with THP enabled but always anon THP directly on read.




Without THP this works as expected. With THP this currently also works as
expected, but of course with the price [1] of not getting anon THP
immediately, which usually we don't care about. As you note, khugepaged might
fix this up later.

If we disable the huge zeropage, we would get anon THPs when reading instead of
small zeropages.

I wasn't aware of that behaviour either. Although that sounds like another
reason why allocating a THP over the huge zero page on write fault should be the
"more consistent" behaviour.

Reading [2] I think the huge zeropage was added to avoid the allocation of THP
on read. Maybe for really only large readable regions, not sure why exactly.

I might raise this on the THP call tomorrow, if Kyril joins and get his view.

Makes sense.

--
Cheers,

David / dhildenb