On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote:
Currently, copy-on-write is only used for the mapped memory; the child
process still needs to copy the entire page table from the parent
process during forking. The parent process might take a lot of time and
memory to copy the page table when the parent has a big page table
allocated. For example, the memory usage of a process after forking with
1 GB mapped memory is as follows:
For some reason, I was not able to reproduce performance improvements
with a simple fork() performance measurement program. The results that
I saw are the following:
Base:
Fork latency per gigabyte: 0.004416 seconds
Fork latency per gigabyte: 0.004382 seconds
Fork latency per gigabyte: 0.004442 seconds
COW kernel:
Fork latency per gigabyte: 0.004524 seconds
Fork latency per gigabyte: 0.004764 seconds
Fork latency per gigabyte: 0.004547 seconds
AMD EPYC 7B12 64-Core Processor
Base:
Fork latency per gigabyte: 0.003923 seconds
Fork latency per gigabyte: 0.003909 seconds
Fork latency per gigabyte: 0.003955 seconds
COW kernel:
Fork latency per gigabyte: 0.004221 seconds
Fork latency per gigabyte: 0.003882 seconds
Fork latency per gigabyte: 0.003854 seconds
Given, that page table for child is not copied, I was expecting the
performance to be better with COW kernel, and also not to depend on
the size of the parent.
Yes, the child won't duplicate the page table, but fork will still
traverse all the page table entries to do the accounting.
And, since this patch expends the COW to the PTE table level, it's not
the mapped page (page table entry) grained anymore, so we have to
guarantee that all the mapped page is available to do COW mapping in
the such page table.
This kind of checking also costs some time.
As a result, since the accounting and the checking, the COW PTE fork
still depends on the size of the parent so the improvement might not
be significant.
The current version of the series does not provide any performance
improvements for fork(). I would recommend removing claims from the
cover letter about better fork() performance, as this may be
misleading for those looking for a way to speed up forking. In my
From v3 to v4, I changed the implementation of the COW fork() part to do
the accounting and checking. At the time, I also removed most of the
descriptions about the better fork() performance. Maybe it's not enough
and still has some misleading. I will fix this in the next version.
Thanks.
case, I was looking to speed up Redis OSS, which relies on fork() to
create consistent snapshots for driving replicates/backups. The O(N)
per-page operation causes fork() to be slow, so I was hoping that this
series, which does not duplicate the VA during fork(), would make the
operation much quicker.
Indeed, at first, I tried to avoid the O(N) per-page operation by
deferring the accounting and the swap stuff to the page fault. But,
as I mentioned, it's not suitable for the mainline.
Honestly, for improving the fork(), I have an idea to skip the per-page
operation without breaking the logic. However, this will introduce the
complicated mechanism and may has the overhead for other features. It
might not be worth it. It's hard to strike a balance between the
over-complicated mechanism with (probably) better performance and data
consistency with the page status. So, I would focus on the safety and
stable approach at first.
Actually, at the RFC v1 and v2, we proposed the version of skipping
those works, and we got a significant improvement. You can see the
number from RFC v2 cover letter [1]:
"In short, with 512 MB mapped memory, COW PTE decreases latency by 93%
for normal fork"
I suspect the 93% improvement (when the mapcount was not updated) was
only for VAs with 4K pages. With 2M mappings this series did not
provide any benefit is this correct?
Yes. In this case, the COW PTE performance is similar to the normal
fork().