On Wed, Nov 27, 2019 at 07:06:01PM -0800, Hugh Dickins wrote:
On Tue, 26 Nov 2019, Yang Shi wrote:I don't have a firm position here. Maybe you are right and we should try
On 11/25/19 11:33 AM, Yang Shi wrote:Sorry, I haven't managed to set aside enough time for this until now.
On 11/25/19 10:33 AM, Kirill A. Shutemov wrote:
On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
Currently when truncating shmem file, if the range is partial of
THP
(start or end is in the middle of THP), the pages actually will
just get
cleared rather than being freed unless the range cover the whole
THP.
Even though all the subpages are truncated (randomly or
sequentially),
the THP may still be kept in page cache. This might be fine for
some
usecases which prefer preserving THP.
But, when doing balloon inflation in QEMU, QEMU actually does hole
punch
or MADV_DONTNEED in base page size granulairty if hugetlbfs is not
used.
So, when using shmem THP as memory backend QEMU inflation actually
doesn't
work as expected since it doesn't free memory. But, the inflation
usecase really needs get the memory freed. Anonymous THP will not
get
freed right away too but it will be freed eventually when all
subpages are
unmapped, but shmem THP would still stay in page cache.
To protect the usecases which may prefer preserving THP, introduce
a
new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP
is
preferred behavior if truncating partial THP. This mode just makes
sense to tmpfs for the time being.
First off, let me say that I firmly believe this punch-split behavior
should be the standard behavior (like in my huge tmpfs implementation),
and we should not need a special FALLOC_FL_SPLIT_HPAGE to do it.
But I don't know if I'll be able to persuade Kirill of that.
If the caller wants to write zeroes into the file, she can do so with the
write syscall: the caller has asked to punch a hole or truncate the file,
and in our case, like your QEMU case, hopes that memory and memcg charge
will be freed by doing so. I'll be surprised if changing the behavior
to yours and mine turns out to introduce a regression, but if it does,
I guess we'll then have to put it behind a sysctl or whatever.
IIUC the reason that it's currently implemented by clearing the hole
is because split_huge_page() (unlike in older refcounting days) cannot
be guaranteed to succeed. Which is unfortunate, and none of us is very
keen to build a filesystem on unreliable behavior; but the failure cases
appear in practice to be rare enough, that it's on balance better to give
the punch-hole-truncate caller what she asked for whenever possible.
to split pages right away.
It might be useful to consider case wider than shmem.
On traditional filesystem with a backing storage semantics of the same
punch hole operation is somewhat different. It doesn't have explicit
implications on memory footprint. It's about managing persistent storage.
With shmem/tmpfs it is lumped together.
It might be nice to write down pages that can be discarded under memory
pressure and leave the huge page intact until then...
[ I don't see a problem with your patch as long as we agree that it's
desired semantics for the interface. ]