Re: [PATCH 00/10] perf/uprobe: Optimize uprobes

From: Andrii Nakryiko
Date: Tue Aug 06 2024 - 00:08:36 EST


On Sun, Aug 4, 2024 at 4:22 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Sat, Aug 3, 2024 at 1:53 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, Aug 02, 2024 at 10:47:15PM -0700, Andrii Nakryiko wrote:
> >
> > > Is there any reason why the approach below won't work?
> >
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 8be9e34e786a..e21b68a39f13 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -2251,6 +2251,52 @@ static struct uprobe
> > > *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
> > > struct uprobe *uprobe = NULL;
> > > struct vm_area_struct *vma;
> > >
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > + vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE, vm_flags;
> > > + struct file *vm_file;
> > > + struct inode *vm_inode;
> > > + unsigned long vm_pgoff, vm_start, vm_end;
> > > + int vm_lock_seq;
> > > + loff_t offset;
> > > +
> > > + rcu_read_lock();
> > > +
> > > + vma = vma_lookup(mm, bp_vaddr);
> > > + if (!vma)
> > > + goto retry_with_lock;
> > > +
> > > + vm_lock_seq = READ_ONCE(vma->vm_lock_seq);
> >
> > So vma->vm_lock_seq is only updated on vma_start_write()
>
> yep, I've looked a bit more at the implementation now
>
> >
> > > +
> > > + vm_file = READ_ONCE(vma->vm_file);
> > > + vm_flags = READ_ONCE(vma->vm_flags);
> > > + if (!vm_file || (vm_flags & flags) != VM_MAYEXEC)
> > > + goto retry_with_lock;
> > > +
> > > + vm_inode = READ_ONCE(vm_file->f_inode);
> > > + vm_pgoff = READ_ONCE(vma->vm_pgoff);
> > > + vm_start = READ_ONCE(vma->vm_start);
> > > + vm_end = READ_ONCE(vma->vm_end);
> >
> > None of those are written with WRITE_ONCE(), so this buys you nothing.
> > Compiler could be updating them one byte at a time while you load some
> > franken-update.
> >
> > Also, if you're in the middle of split_vma() you might not get a
> > consistent set.
>
> I used READ_ONCE() only to prevent the compiler from re-reading those
> values. We assume those values are garbage anyways and double-check
> everything, so lack of WRITE_ONCE doesn't matter. Same for
> inconsistency if we are in the middle of split_vma().
>
> We use the result of all this speculative calculation only if we find
> a valid uprobe (which could be a false positive) *and* if we detect
> that nothing about VMA changed (which is what I got wrong, but
> honestly I was actually betting on others to help me get this right
> anyways).
>
> >
> > > + if (bp_vaddr < vm_start || bp_vaddr >= vm_end)
> > > + goto retry_with_lock;
> > > +
> > > + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
> > > + uprobe = find_uprobe_rcu(vm_inode, offset);
> > > + if (!uprobe)
> > > + goto retry_with_lock;
> > > +
> > > + /* now double check that nothing about VMA changed */
> > > + if (vm_lock_seq != READ_ONCE(vma->vm_lock_seq))
> > > + goto retry_with_lock;
> >
> > Since vma->vma_lock_seq is only ever updated at vma_start_write() you're
> > checking you're in or after the same modification cycle.
> >
> > The point of sequence locks is to check you *IN* a modification cycle
> > and retry if you are. You're now explicitly continuing if you're in a
> > modification.
> >
> > You really need:
> >
> > seq++;
> > wmb();
> >
> > ... do modification
> >
> > wmb();
> > seq++;
> >
> > vs
> >
> > do {
> > s = READ_ONCE(seq) & ~1;
> > rmb();
> >
> > ... read stuff
> >
> > } while (rmb(), seq != s);
> >
> >
> > The thing to note is that seq will be odd while inside a modification
> > and even outside, further if the pre and post seq are both even but not
> > identical, you've crossed a modification and also need to retry.
> >
>
> Ok, I don't think I got everything you have written above, sorry. But
> let me explain what I think I need to do and please correct what I
> (still) got wrong.
>
> a) before starting speculation,
> a.1) read and remember current->mm->mm_lock_seq (using
> smp_load_acquire(), right?)
> a.2) read vma->vm_lock_seq (using smp_load_acquire() I presume)
> a.3) if vm_lock_seq is odd, we are already modifying VMA, so bail
> out, try with proper mmap_lock
> b) proceed with the inode pointer fetch and offset calculation as I've coded it
> c) lookup uprobe by inode+offset, if failed -- bail out (if succeeded,
> this could still be wrong)
> d) re-read vma->vm_lock_seq, if it changed, we started modifying/have
> already modified VMA, bail out
> e) re-read mm->mm_lock_seq, if that changed -- presume VMA got
> modified, bail out
>
> At this point we should have a guarantee that nothing about mm
> changed, nor that VMA started being modified during our speculative
> calculation+uprobe lookup. So if we found a valid uprobe, it must be a
> correct one that we need.
>
> Is that enough? Any holes in the approach? And thanks for thoroughly
> thinking about this, btw!

Ok, with slight modifications to the details of the above (e.g., there
is actually no "odd means VMA is being modified" thing with
vm_lock_seq), I ended up with the implementation below. Basically we
validate that mm->mm_lock_seq didn't change and that vm_lock_seq !=
mm_lock_seq (which otherwise would mean "VMA is being modified").
There is a possibility that vm_lock_seq == mm_lock_seq just by
accident, which is not a correctness problem, we'll just fallback to
locked implementation until something about VMA or mm_struct itself
changes. Which is fine, and if mm folks ever change this locking
schema, this might go away.

If this seems on the right track, I think we can just move
mm_start_vma_specuation()/mm_end_vma_speculation() into
include/linux/mm.h.

And after thinking a bit more about READ_ONCE() usage, I changed them
to data_race() to not trigger KCSAN warnings. Initially I kept
READ_ONCE() only around vma->vm_file access, but given we never change
it until vma is freed and reused (which would be prevented by
guard(rcu)), I dropped READ_ONCE() and only added data_race(). And
even data_race() is probably not necessary.

Anyways, please see the patch below. Would be nice if mm folks
(Suren?) could confirm that this is not broken.



Author: Andrii Nakryiko <andrii@xxxxxxxxxx>
Date: Fri Aug 2 22:16:40 2024 -0700

uprobes: add speculative lockless VMA to inode resolution

Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 3de311c56d47..bee7a929ff02 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2244,6 +2244,70 @@ static int is_trap_at_addr(struct mm_struct
*mm, unsigned long vaddr)
return is_trap_insn(&opcode);
}

+#ifdef CONFIG_PER_VMA_LOCK
+static inline void mm_start_vma_speculation(struct mm_struct *mm, int
*mm_lock_seq)
+{
+ *mm_lock_seq = smp_load_acquire(&mm->mm_lock_seq);
+}
+
+/* returns true if speculation was safe (no mm and vma modification
happened) */
+static inline bool mm_end_vma_speculation(struct vm_area_struct *vma,
int mm_lock_seq)
+{
+ int mm_seq, vma_seq;
+
+ mm_seq = smp_load_acquire(&vma->vm_mm->mm_lock_seq);
+ vma_seq = READ_ONCE(vma->vm_lock_seq);
+
+ return mm_seq == mm_lock_seq && vma_seq != mm_seq;
+}
+
+static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
+{
+ const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
+ struct mm_struct *mm = current->mm;
+ struct uprobe *uprobe;
+ struct vm_area_struct *vma;
+ struct file *vm_file;
+ struct inode *vm_inode;
+ unsigned long vm_pgoff, vm_start;
+ int mm_lock_seq;
+ loff_t offset;
+
+ guard(rcu)();
+
+ mm_start_vma_speculation(mm, &mm_lock_seq);
+
+ vma = vma_lookup(mm, bp_vaddr);
+ if (!vma)
+ return NULL;
+
+ vm_file = data_race(vma->vm_file);
+ if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC)
+ return NULL;
+
+ vm_inode = data_race(vm_file->f_inode);
+ vm_pgoff = data_race(vma->vm_pgoff);
+ vm_start = data_race(vma->vm_start);
+
+ offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start);
+ uprobe = find_uprobe_rcu(vm_inode, offset);
+ if (!uprobe)
+ return NULL;
+
+ /* now double check that nothing about MM and VMA changed */
+ if (!mm_end_vma_speculation(vma, mm_lock_seq))
+ return NULL;
+
+ /* happy case, we speculated successfully */
+ return uprobe;
+}
+#else /* !CONFIG_PER_VMA_LOCK */
+static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr)
+{
+ return NULL;
+}
+#endif /* CONFIG_PER_VMA_LOCK */
+
/* assumes being inside RCU protected region */
static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr,
int *is_swbp)
{
@@ -2251,6 +2315,10 @@ static struct uprobe
*find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swb
struct uprobe *uprobe = NULL;
struct vm_area_struct *vma;

+ uprobe = find_active_uprobe_speculative(bp_vaddr);
+ if (uprobe)
+ return uprobe;
+
mmap_read_lock(mm);
vma = vma_lookup(mm, bp_vaddr);
if (vma) {
diff --git a/kernel/fork.c b/kernel/fork.c
index cc760491f201..211a84ee92b4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3160,7 +3160,7 @@ void __init proc_caches_init(void)
NULL);
files_cachep = kmem_cache_create("files_cache",
sizeof(struct files_struct), 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
+
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_TYPESAFE_BY_RCU,
NULL);
fs_cachep = kmem_cache_create("fs_cache",
sizeof(struct fs_struct), 0,


>
> P.S. This is basically the last big blocker towards linear uprobes
> scalability with the number of active CPUs. I have
> uretprobe+SRCU+timeout implemented and it seems to work fine, will
> post soon-ish.
>
> P.P.S Also, funny enough, below was another big scalability limiter
> (and the last one) :) I'm not sure if we can just drop it, or I should
> use per-CPU counter, but with the below change and speculative VMA
> lookup (however buggy, works ok for benchmarking), I finally get
> linear scaling of uprobe triggering throughput with number of CPUs. We
> are very close.
>
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index f7443e996b1b..64c2bc316a08 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1508,7 +1508,7 @@ static int uprobe_dispatcher(struct
> uprobe_consumer *con, struct pt_regs *regs)
> int ret = 0;
>
> tu = container_of(con, struct trace_uprobe, consumer);
> - tu->nhit++;
> + //tu->nhit++;
>
> udd.tu = tu;
> udd.bp_addr = instruction_pointer(regs);
>
>
> > > +
> > > + /* happy case, we speculated successfully */
> > > + rcu_read_unlock();
> > > + return uprobe;
> > > +
> > > +retry_with_lock:
> > > + rcu_read_unlock();
> > > + uprobe = NULL;
> > > +#endif
> > > +
> > > mmap_read_lock(mm);
> > > vma = vma_lookup(mm, bp_vaddr);
> > > if (vma) {
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index cc760491f201..211a84ee92b4 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -3160,7 +3160,7 @@ void __init proc_caches_init(void)
> > > NULL);
> > > files_cachep = kmem_cache_create("files_cache",
> > > sizeof(struct files_struct), 0,
> > > - SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> > > + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT|SLAB_TYPESAFE_BY_RCU,
> > > NULL);
> > > fs_cachep = kmem_cache_create("fs_cache",
> > > sizeof(struct fs_struct), 0,