Re: [RFC][PATCH 0/5] Optimize percpu-rwsem

From: Linus Torvalds
Date: Tue May 26 2015 - 14:12:12 EST

Next message: Eric Dumazet: "Re: [PATCH] net: tcp: Fix a PTO timing granularity issue"
Previous message: Peter Hurley: "Re: [PATCH] n_tty: Fix auditing support for cannonical mode"
In reply to: Oleg Nesterov: "ring_buffer_attach && cond_synchronize_rcu (Was: percpu-rwsem: Optimize readers and reduce global impact)"
Next in thread: Peter Zijlstra: "Re: [RFC][PATCH 0/5] Optimize percpu-rwsem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, May 26, 2015 at 4:43 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> This is a derived work of the cpu hotplug lock rework I did in 2013 which never
> really went anywhere because Linus didn't like it.
>
> This applies those same optimizations to the percpu-rwsem. Seeing how we did
> all the work it seemed a waste to not use it at all.

So I *still* don't like it.

We literally have one single percpu-rwsem IN THE WHOLE KERNEL TREE.

One.

Not "a couple". Not "hundreds". ONE.

And that single use that you are optimizing for isn't even explained.
It's not really even clear that that thing is needed: fork() already
takes the mmap_sem for writing, and I'm not at all convinced that the
uprobes code couldn't just use mmap_sem for every mm it traverses,
rather than use this global percpu lock that we use absolutely nowhere
else.

So I'm not convinced that the right thing to do is to try to optimize
that lock at ALL. I'd rather get rid of it entirely.

Is there some new use that I don't know about? Have people really
looked at that uprobes code deeply? OF COURSE global locks will have
problems, I'm not at all convinced that "let's make that global lock
really complicated and clever" is the proper solution.

For example, the write lock is only ever taken when registering an
uprobe. Not exactly likely to be timing-critical. Is there some reason
why we couldn't get rid of the magical per-cpu rwsem entirely, and
replace it with something truly simple like a hashed rmsem, where the
readers (fork), will just read-lock one hashed rwsem, and the writer
will just write-lock every hashed rwsem.

The hash might be "pick the currently executing CPU at the time of
fork(), modulo 16". That one doesn't even need to disable preemption,
since it doesn't actually matter that the CPU stays constant, it's
just a trivial heurisitic to avoid cacheline pingpongs under very
heavy fork() load.

So we'd have

#define NR_UPROBE_RWSEM 16

// One cacheline per rwsem.
struct cache_line_aligned_rw_semaphore uprobe_rwsem[NR_UPROBE_RWSEM];

and fork() would basically do

int idx = raw_smp_processor_id() & (NR_UPROBE_RWSEM-1);
struct rw_sem *usem = &uprobe_rwsem[idx].rwsem;
..
down_read(usem);
... do the dup_mmap() here ..
up_read(usem);

and the uprobe register thing could just do

for (i = 0; i < NR_UPROBE_RWSEM; i++)
down_write(&uprobe_rwsem[i].rwsem);
...
for (i = 0; ... unlock them all ..)

which doesn't look any slower than what we do now (I suspect the
fork() part would be faster), and is much simpler, and would get rid
of the percpu-rwsem entirely.

Hmm?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Eric Dumazet: "Re: [PATCH] net: tcp: Fix a PTO timing granularity issue"
Previous message: Peter Hurley: "Re: [PATCH] n_tty: Fix auditing support for cannonical mode"
In reply to: Oleg Nesterov: "ring_buffer_attach && cond_synchronize_rcu (Was: percpu-rwsem: Optimize readers and reduce global impact)"
Next in thread: Peter Zijlstra: "Re: [RFC][PATCH 0/5] Optimize percpu-rwsem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]