Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice

From: Khalid Aziz
Date: Tue Dec 23 2014 - 15:49:35 EST

Next message: Andy Lutomirski: "Re: [PATCH] ASLRv3: randomize_va_space=3 preventing offset2lib attack"
Previous message: Pavel Machek: "Re: [PATCH] media: i2c/adp1653: devicetree support for adp1653"
In reply to: Rik van Riel: "Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice"
Next in thread: Rik van Riel: "Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 12/23/2014 11:46 AM, Rik van Riel wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/23/2014 10:13 AM, Khalid Aziz wrote:

On 12/23/2014 03:52 AM, Ingo Molnar wrote:

to implement what Thomas suggested in the discussion: a proper
futex like spin mechanism? That looks like a totally acceptable
solution to me, without the disadvantages of your proposed
solution.

Hi Ingo,

Thank you for taking the time to respond. It is indeed possible to
implement a futex like spin mechanism. Futex like mechanism will
be clean and elegant. That is where I had started when I was given
this problem to solve. Trouble I run into is the primary
application I am looking at to help with this solution is Database
which implements its own locking mechanism without using POSIX
semaphore or futex. Since the locking is entirely in userspace,
kernel has no clue when the userspace has acquired one of these
locks. So I can see only two ways to solve this - find a solution
in userspace entirely, or have userspace tell the kernel when it
acquires one of these locks. I will spend more time on finding a
way to solve it in userspace and see if I can find a way to
leverage futex mechanism without causing significant change to
database code. There may be a way to use priority inheritance to
avoid contention. Database performance people tell me that their
testing has shown the cost of making any system calls in this code
easily offsets any gains from optimizing for contention avoidance,
so that is one big challenge. Database rewriting their locking code
is extremely unlikely scenario. Am I missing a third option here?

An uncontended futex is taken without ever going into kernel
space. Adaptive spinning allows short duration futexes to be
taken without going into kernel space.

You are right. Uncontended futex is very fast since it never goes into kernel. Queuing problem happens when the lock holder has been pre-empted. Adaptive spinning does the smart thing os spin-waiting only if the lock holder is still running on another core. If lock holder is not scheduled on any core, even adaptive spinning has to go into the kernel to be put on wait queue. What would avoid queuing problem and reduce the cost of contention is a combination of adaptive spinning, and a way to keep the lock holder running on one of the cores just a little longer so it can release the lock. Without creating special case and a new API in kernel, one way I can think of accomplishing the second part is to boost the priority of lock holder when contention happens and priority ceiling is meant to do exactly that. Priority ceiling implementation in glibc boosts the priority by calling into scheduler which does incur the cost of a system call. Priority boost is a reliable solution that does not change scheduling semantics. The solution allowing lock holder to use one extra timeslice is not a definitive solution but tpcc workload shows it does work and it works without requiring changes to database locking code.

Theoretically a new locking library that uses both these techniques will help solve the problem but being a new locking library, there is a big unknown of what new problems, performance and otherwise, it will bring and database has to recode to this new library. Nevertheless this is the path I am exploring now. The challenge being how to do this without requiring changes to database code or the kernel. The hooks available to me into current database code are schedctl_init(), schedctl_start() and schedctl_stop() which are no-op on Linux at this time. Database folks can replace these no-ops with real code in their library to solve the queuing problem. schedctl_start() and schedctl_stop() are called only when one of the highly contended locks is acquired or released. schedctl_start() is called after the lock has been acquired which means I can not rely upon it to solve contention issue. schedctl_stop() is called after the lock has been released.

Thanks,
Khalid

Only long held locks cause a thread to go into kernel space,
where it goes to sleep, freeing up the cpu, and increasing
the chance that the lock holder will run.

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUmbihAAoJEM553pKExN6DDlQH/1vvy9YYuP2dCAZSU3fz855e
pj4796Qja929I2dStsbLl6Qhcg2ELtwtPkLoAePQ/4j2l7DCYgSNLXlC+RzQ32ay
rbMIfwiriEVGp2hsvYTOCpnur19IHf7v726ivaDXVOM/nrRaHsB8wwspLQQyfSIE
b7M7jxvT4S2pEELOGB6JQfEZZhbf5wBv9HBk+fkCBMaO4WZrnYczyD0/omiADm65
xSm/8pCMK22u8Tzn9EpKpIVdIFrl9AlZ1uiRBV2Br1oqwaBTvJVknW4bvIk0DWZU
ErwR/073UYKpl+xce3nbnixH8FeRP7/mq73Xd8e+iCgn6Dtzr1tANsu27EigMZ0=
=WHb3
-----END PGP SIGNATURE-----

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andy Lutomirski: "Re: [PATCH] ASLRv3: randomize_va_space=3 preventing offset2lib attack"
Previous message: Pavel Machek: "Re: [PATCH] media: i2c/adp1653: devicetree support for adp1653"
In reply to: Rik van Riel: "Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice"
Next in thread: Rik van Riel: "Re: [PATCH RESEND v4] sched/fair: Add advisory flag for borrowing a timeslice"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]