Re: [PATCH 08 of 11] anon-vma-rwsem

From: Andrea Arcangeli
Date: Wed May 07 2008 - 19:41:22 EST

Next message: Yinghai Lu: "Re: [PATCH] mptsas: add mptsas_shutdown to call pci_disable_msi"
Previous message: Nishanth Aravamudan: "Re: [PATCH] x86: fix PAE pmd_bad bootup warning"
In reply to: Linus Torvalds: "Re: [PATCH 08 of 11] anon-vma-rwsem"
Next in thread: Linus Torvalds: "Re: [PATCH 08 of 11] anon-vma-rwsem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Andrew,

On Wed, May 07, 2008 at 03:59:14PM -0700, Andrew Morton wrote:
> CPU0: CPU1:
>
> spin_lock(global_lock)
> spin_lock(a->lock); spin_lock(b->lock);
================== mmu_notifier_register()
> spin_lock(b->lock); spin_unlock(b->lock);
> spin_lock(a->lock);
> spin_unlock(a->lock);
>
> also OK.

But the problem is that we've to stop the critical section in the
place I marked with "========" while mmu_notifier_register
runs. Otherwise the driver calling mmu_notifier_register won't know if
it's safe to start establishing secondary sptes/tlbs. If the driver
will establish sptes/tlbs with get_user_pages/follow_page the page
could be freed immediately later when zap_page_range starts.

So if CPU1 doesn't take the global_lock before proceeding in
zap_page_range (inside vmtruncate i_mmap_lock that is represented as
b->lock above) we're in trouble.

What we can do is to replace the mm_lock with a
spin_lock(&global_lock) only if all places that takes i_mmap_lock
takes the global lock first and that hurts scalability of the fast
paths that are performance critical like vmtruncate and
anon_vma->lock. Perhaps they're not so performance critical, but
surely much more performant critical than mmu_notifier_register ;).

The idea of polluting various scalable paths like truncate() syscall
in the VM with a global spinlock frightens me, I'd rather return to
invalidate_page() inside the PT lock removing both
invalidate_range_start/end. Then all serialization against the mmu
notifiers will be provided by the PT lock that the secondary mmu page
fault also has to take in get_user_pages (or follow_page). In any case
that is a better solution that won't slowdown the VM when
MMU_NOTIFIER=y even if it's a bit slower for GRU, for KVM performance
is about the same with or without invalidate_range_start/end. I didn't
think anybody could care about how long mmu_notifier_register takes
until it returns compared to all heavyweight operations that happens
to start a VM (not only in the kernel but in the guest too).

Infact if it's security that we worry about here, can put a cap of
_time_ that mmu_notifier_register can take before it fails, and we
fail to start a VM if it takes more than 5sec, that's still fine as
the failure could happen for other reasons too like vmalloc shortage
and we already handle it just fine. This 5sec delay can't possibly happen in
practice anyway in the only interesting scenario, just like the
vmalloc shortage. This is obviously a superior solution than polluting
the VM with an useless global spinlock that will destroy truncate/AIM
on numa.

Anyway Christoph, I uploaded my last version here:

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v16

(applies and runs fine on 26-rc1)

You're more than welcome to takeover from it, I kind of feel my time
now may be better spent to emulate the mmu-notifier-core with kprobes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Yinghai Lu: "Re: [PATCH] mptsas: add mptsas_shutdown to call pci_disable_msi"
Previous message: Nishanth Aravamudan: "Re: [PATCH] x86: fix PAE pmd_bad bootup warning"
In reply to: Linus Torvalds: "Re: [PATCH 08 of 11] anon-vma-rwsem"
Next in thread: Linus Torvalds: "Re: [PATCH 08 of 11] anon-vma-rwsem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]