Re: [PATCH] netfilter: use per-cpu recursive lock (v11)

From: Ingo Molnar
Date: Wed Apr 22 2009 - 03:36:25 EST



* Eric Dumazet <dada1@xxxxxxxxxxxxx> wrote:

> Ingo Molnar a écrit :
> >
> > Why not use the obvious solution: a _single_ wrlock for global
> > access and read_can_lock() plus per cpu locks in the fastpath?
>
> Obvious is not the qualifier I would use :)
>
> Brilliant yes :)

thanks :)

> > That way there's no global cacheline bouncing (just the
> > _reading_ of a global cacheline - which will be nicely localized
> > - on NUMA too) - and we will hold at most 1-2 locks at once!
> >
> > Something like:
> >
> > __cacheline_aligned DEFINE_RWLOCK(global_wrlock);
> >
> > DEFINE_PER_CPU(rwlock_t local_lock);
> >
> >
> > void local_read_lock(void)
> > {
> > again:
> > read_lock(&per_cpu(local_lock, this_cpu));
>
> Hmm... here we can see global_wrlock locked by on writer, while
> this cpu already called local_read_lock(), and calls again this
> function -> Deadlock, because we hold our local_lock locked.

Yeah, indeed.

I wasnt really concentrating on the nested case, i was concentrating
on the scalability and lock nesting angle. I think the code
submitted here looks rather incestous in that regard.

Allowing nested locking _on the same CPU_ is asking for trouble. Use
short critical sections and if there's any exclusion needed, use an
irq-safe lock or a softirq-safe lock. Problem solved.

> Very interesting and could be changed to use spinlock + depth per
> cpu.
>
> -> we can detect recursion and avoid the deadlock, and we only use
> one atomic operation per lock/unlock pair in fastpath (this was
> the reason we tried hard to use a percpu spinlock during this
> thread)
>
>
> __cacheline_aligned DEFINE_RWLOCK(global_wrlock);
>
> struct ingo_local_lock {
> spinlock_t lock;
> int depth;
> };
> DEFINE_PER_CPU(struct ingo_local_lock local_lock);
>
>
> void local_read_lock(void)
> {
> struct ingo_local_lock *lck;
>
> local_bh_and_preempt_disable();
> lck = &get_cpu_var(local_lock);
> if (++lck->depth > 0) /* already locked */
> return;
> again:
> spin_lock(&lck->lock);
>
> if (unlikely(!read_can_lock(&global_wrlock))) {
> spin_unlock(&lck->lock);
> /*
> * Just wait for any global write activity:
> */
> read_unlock_wait(&global_wrlock);
> goto again;
> }
> }
>
> void global_write_lock(void)
> {
> write_lock(&global_wrlock);
>
> for_each_possible_cpu(i)
> spin_unlock_wait(&per_cpu(local_lock, i));
> }
>
> Hmm ?

Yeah, this looks IRQ-nesting safe. But i really have to ask: why
does this code try so hard to allow same-CPU nesting?

Nesting on the same CPU is _bad_ for several reasons:

1) Performance: it rips apart critical sections cache-wise. Instead
of a nice:

C1 ... C2 ... C3 ... C4

serial sequence of critical sections, we get:

C1A ... ( C2 ) ... C1B ... C3 ... C4

Note that C1 got "ripped apart" into C1A and C1B with C2 injected
- reducing cache locality between C1A and C1B. We have to execute
C1B no matter what, so we didnt actually win anything in terms of
total work to do, by processing C2 out of order.

[ Preemption of work (which this kind of nesting is really about)
is _the anti-thesis of performance_, and we try to delay it as
much as possible and we try to batch up as much as possible.
For example the CPU scheduler will try _real_ hard to not
preempt a typical workload, as long as external latency
boundaries allow that. ]

2) Locking complexity and robustness. Nested locking is rank #1 in
terms of introducing regressions into the kernel.

3) Instrumentation/checking complexity. Checking locking
dependencies is good and catches a boatload of bugs before they
hit upstream, and nested locks are supported but cause an
exponential explosion in terms of dependencies to check.

Also, whenever instrumentation explodes is typically the sign of
some true, physical complexity that has been introduced into the
code. So it often is a canary for a misdesign at a fundamental
level, not a failure in the instrumentation framework.

In the past i saw lock nesting often used as a wrong solution when
the critical sections were too long (causing too long latencies for
critical work - e.g. delaying hardirq completion processing
unreasonably), or just plain out of confusion about the items above.

I dont know whether that's the case here - it could be one of the
rare exceptions calling for a new locking primitive (which should
then be introduced at the core kernel level IMHO) - i dont know the
code that well.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/