Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE

From: Andrea Arcangeli
Date: Wed May 24 2017 - 11:23:08 EST


Hello,

On Wed, May 24, 2017 at 05:27:36PM +0300, Mike Rapoport wrote:
> khugepaged does skip over VMAs which have userfault. We could register the
> regions with userfault before populating them to avoid collapses in the
> transition period. But then we'll have to populate these regions with
> UFFDIO_COPY which adds quite an overhead.

Yes, in fact with postcopy-only mode, there's no issue because of the
above.

The case where THP has to be temporarily disabled by CRIU is before
postcopy/userfaults engages, i.e. during the precopy with a
precopy+postcopy mode.

QEMU preferred mode is to do one pass of precopy before starting
postcopy/userfaults. During QEMU precopy phase VM_HUGEPAGE is set for
maximum performance and to back with THP in the destination as many
readonly (i.e. no source-redirtied) pages as possible. The dirty
logging in the source happens at 4k granularity by forcing the KVM
shadow MMU to map all pages at 4k granularity and by tracking the
dirty bit in software for the updates happening through the primary
MMU (linux pagetables dirty bit are ignored because soft dirty would
be too slow with O(N) complexity where N is linear with the size of
the VM, not with the number of re-dirtied pages in a precopy
pass). After that we track which 4k pages aren't uptodate on
destination and we zap them at 4k granularity with MADV_DONTNEED (we
badly need madvisev in fact to reduce the totally unnecessary flood of
4k wide MADV_DONTNEED there). So before calling the MADV_DONTNEED
flood, QEMU sets VM_NOHUGEPAGE, and after calling UFFDIO_REGISTER QEMU
sets back VM_HUGEPAGE (as the UFFDIO registration will keep khugepaged
at bay until postcopy completes). QEMU then finally calls
UFFDIO_UNREGISTER and khugepaged starts compacting everything that was
migrated through 4k wide userfaults.

CRIU doesn't attempt to populate destination with THP at all to be
simpler, but the problem is similar. It still has to call
VM_NOHUGEPAGE somehow during precopy (i.e. during the whole precopy
phase, precisely to avoid having to call MADV_DONTNEED to zap
4k not-uptodate fragments).

QEMU gets away with setting VM_NOHUGEPAGE and then back to VM_HUGEPAGE
without any issue because it's cooperative. CRIU as opposed has to
restore the same vm_flags that the vma had in the source to avoid
changing the behavior of the app after precopy+postcopy
completes. This is where the need of clearing the VM_*HUGEPAGE bits
from vm_flags comes into play.

Thanks,
Andrea