Re: kvm splat in mmu_spte_clear_track_bits

From: Andrea Arcangeli
Date: Tue Aug 29 2017 - 10:09:43 EST


Hello,

On Tue, Aug 29, 2017 at 02:59:23PM +0200, Adam Borowski wrote:
> On Tue, Aug 29, 2017 at 02:45:41PM +0200, Takashi Iwai wrote:
> > [Put more people to Cc, sorry for growing too much...]
>
> We're all interested in 4.13.0 not crashing on us, so that's ok.
>
> > On Tue, 29 Aug 2017 11:19:13 +0200,
> > Bernhard Held wrote:
> > >
> > > On 08/28/2017 at 06:56 PM, Nadav Amit wrote:
> > > > Donât blame me for the TLB stuff... My money is on aac2fea94f7a .
> > >
> > > Amit, thanks for your courage to expose your patch!
> > >
> > > I'm more and more confident that aac2fea94f7a is the culprit. Maybe it
> > > just accelerates the triggering of the splash. To be more sure the
> > > kernel needs to be tested for a couple of days. It would be great if
> > > others could assist in testing aac2fea94f7a.
> >
> > I'm testing with the revert for a while and it seems working.
>
> With nothing but aac2fea94f7a reverted, no explosions for me either.

The aforementioned commit has 3 bugs.

1) mmu_notifier_invalidate_range cannot be used in replacement of
mmu_notifier_invalidate_range_start/end. For KVM
mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
notifier implementation has to implement either
->invalidate_range method or the invalidate_range_start/end
methods, not both. And if you implement invalidate_range_start/end
like KVM is forced to do, calling mmu_notifier_invalidate_range in
common code is a noop for KVM.

For those MMU notifiers that can get away only implementing
->invalidate_range, the ->invalidate_range is implicitly called by
mmu_notifier_invalidate_range_end(). And only those secondary MMUs
that share the same pagetable with the primary MMU (like AMD
iommuv2) can get away only implementing ->invalidate_range.

So all cases (THP on/off) are broken right now.

To fix this is enough to replace mmu_notifier_invalidate_range with
mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end. Either
that or call multiple mmu_notifier_invalidate_page like before.

2) address + (1UL << compound_order(page) is buggy, it should be
PAGE_SIZE << compound_order(page), it's bytes not pages, 2M not 512.

3) The whole invalidate_range thing was an attempt to call a single
invalidate while walking multiple 4k ptes that maps the same THP
(after a pmd virtual split without physical compound page THP
split). It's unclear if the rmap_walk will always provide an
address that is 2M aligned as parameter to try_to_unmap_one, in
presence of THP. I think it needs also an address &= (PAGE_SIZE <<
compound_order(page)) - 1 to be safe.

The other bug where you can reproduce the same corruption with OOM is
unrelated and caused by the OOM reaper. OOM reaper was even corrupting
data if a task was writing to disk and stuck in OOM in write() syscall
or async io write.

To fix the KVM corruption in the OOM reaper, it needs to call
mmu_notifier_invalidate_start/end around
oom_kill.c:unmap_page_range. This additional
mmu_notifier_invalidate_start will not be good for the OOM reaper
because it's yet another case (like the mmap_sem for writing) that
will prevent the OOM reaper to run, so hindering its ability to hide
XFS OOM deadlocks, and making those resurface. Not in KVM case because
we use a spinlock to serialize against the secondary MMU activity and
the KVM critical section under spinlock isn't going to allocate
memory, but range_start can schedule or block on slow hardware where
the secondary MMU is accessed through PCI (not KVM case).

My preference is still to make the OOM reaper a config option and let
it grow into the VM at zero cost if disabled, while at the same time
having the option to keep the VM simpler and spend the time fixing the
filesystem bugs instead (while still being able to reproduce them more
easily with OOM reaper disabled).

Thanks,
Andrea