RE: [PATCH v2] futex: lower the lock contention on the HB lock during wake up
From: Zhu Jefferry
Date: Wed Sep 16 2015 - 07:13:36 EST
> On Wed, 16 Sep 2015, Zhu Jefferry wrote:
> > The application is a multi-thread program, to use the pairs of
> > mutex_lock and mutex_unlock to protect the shared data structure. The
> > type of this mutex is PTHREAD_MUTEX_PI_RECURSIVE_NP. After running
> > long time, to say several days, the mutex_lock data structure in user
> space looks like corrupt.
> >
> > thread 0 can do mutex_lock/unlock
> > __lock = this thread | FUTEX_WAITERS
> > __owner = 0, should be this thread
>
> The kernel does not know about __owner.
Correct, it shows the last failure is in mutex_unlock,
which clear the __owner in user space.
>
> > __counter keep increasing, although there is no recursive mutex_lock
> call.
> >
> > thread 1 will be stuck
> >
> > The primary debugging shows the content of __lock is wrong in first.
> > After a call of Mutex_unlock, the value of __lock should not be this
> > thread self. But we observed The value of __lock is still self after
> > unlock. So, other threads will be stuck,
>
> How did you observe that?
Add one assert in mutex_unlock, after it finish the __lock modify either in
User space or kernel space, before return.
>
> > This thread could lock due to recursive type and __counter keep
> > increasing, although mutex_unlock return fails, due to the wrong value
> > of __owner, but the application did not check the return value. So the
> > thread 0 looks like fine. But thread 1 will be stuck forever.
>
> Oh well. So thread 0 looks all fine, despite not checking return values.
>
Correct.
Actually, I'm not clear how about the state changing of futex in kernel.
I search the Internet, see a similar failure from other users. He is using
Kernel 2.6.38. Our customer is using kernel 2.6.34 (WindRiver Linux 4.1)
====
http://www.programdoc.com/1272_157986_1.htm
Maybe, there is a bug about pi-futex, it would let the program in
user-space going to hang.
We have a board: CPU is powerpc 8572, two core. after ran one month,
the state of pi-futex in user-space got bad:
mutex->__data.__lock is 0x8000023e,
mutex->__data.__count is 0,
mutex->__data.__owner is 0.
But I can not understand the sample failure case which he mentioned. But I think
It might be helpful for you to analyze the corner case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/