Changing to this new behaviour would only be a partial solution for your use
case, since you would still have 2 faults. But it would remove the cost of the
shattering and ensure you have a THP immediately after the write fault. But I
can't think of a reason why this wouldn't be a generally useful change
regardless? Any thoughts?
The "let's read before we write" as used by QEMU migration code is the desire
to not waste memory by populating the zeropages. Deferring consuming memory
until really required.
/*
* We read one byte of each page; this will preallocate page tables if
* required and populate the shared zeropage on MAP_PRIVATE anonymous
memory
* where no page was populated yet. This might require adaption when
* supporting other mappings, like shmem.
*/
So QEMU is concerned with preallocatiing page tables? I would have thought you
could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE
call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip
through the allocator, fewer pud/pmd lock/unlocks, etc).
I think we are only concerned about the "shared zeropage" part. Everything else
is just unnecessary detail that adds confusion here :) One requires the other.
Sorry I don't quite follow your comment. As I understand it, the zeropage
concept is intended as a memory-saving mechanism for applications that read
memory but never write it.
I don't think that really applies in your migration
case, because you are definitely going to write all the memory eventually, I
think?
So I guess you are not interested in the "memory-saving" property, but in
the side-effect, which is the pre-allocation of pagetables? (if you just wanted
the memory-saving property, why not just skip the "read one byte of each page"
op? It's not important though, so let's not go down the rabbit hole.
Note that this is from migration code where we're supposed to write a single
page we received from the migration source right now (not more). And we want to
avoid allcoating memory if it can be avoided (usually for overcommit).
TBH I always assumed in the past the that huge zero page is only useful because
its a placeholder for a real THP that would be populated on write. But that's
obviously not the case at the moment. So other than a hack to preallocate the
pgtables with only 1 fault per 2M, what other benefits does it have?
I don't quite udnerstand that question. [2] has some details why the huge
zeropage was added -- because we would have never otherwise received huge
zeropages with THP enabled but always anon THP directly on read.
Without THP this works as expected. With THP this currently also works as
expected, but of course with the price [1] of not getting anon THP
immediately, which usually we don't care about. As you note, khugepaged might
fix this up later.
If we disable the huge zeropage, we would get anon THPs when reading instead of
small zeropages.
I wasn't aware of that behaviour either. Although that sounds like another
reason why allocating a THP over the huge zero page on write fault should be the
"more consistent" behaviour.
Reading [2] I think the huge zeropage was added to avoid the allocation of THP
on read. Maybe for really only large readable regions, not sure why exactly.
I might raise this on the THP call tomorrow, if Kyril joins and get his view.