Re: [PATCH RFC ticketlock] Auto-queued ticketlock

From: Waiman Long
Date: Tue Jun 11 2013 - 14:42:18 EST

Next message: Lee Jones: "Re: [PATCH 10/22] mfd/ti_am335x_tscadc: Add DT support"
Previous message: vcaputo: "Re: adopt(pid_t pid) syscall proposal [patch included]"
In reply to: Paul E. McKenney: "Re: [PATCH RFC ticketlock] Auto-queued ticketlock"
Next in thread: Davidlohr Bueso: "Re: [PATCH RFC ticketlock] Auto-queued ticketlock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 06/11/2013 12:36 PM, Paul E. McKenney wrote:

I am a bit concern about the size of the head queue table itself.
RHEL6, for example, had defined CONFIG_NR_CPUS to be 4096 which mean
a table size of 256. Maybe it is better to dynamically allocate the
table at init time depending on the actual number of CPUs in the
system.

But if your kernel is built for 4096 CPUs, the 32*256=8192 bytes of memory
is way down in the noise. Systems that care about that small an amount
of memory probably have a small enough number of CPUs that they can just
turn off queueing at build time using CONFIG_TICKET_LOCK_QUEUED=n, right?

My concern is more about the latency on the table scan than the actual memory that was used.

+/*
+ * Return a pointer to the queue header associated with the specified lock,
+ * or return NULL if there is no queue for the lock or if the lock's queue
+ * is in transition.
+ */
+static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *asp)
+{
+ int i;
+ int start;
+
+ start = i = tkt_q_hash(asp);
+ do
+ if (tkt_q_heads[i].ref == asp)
+ return&tkt_q_heads[i];
+ while ((i = tkt_q_next_slot(i)) != start);
+ return NULL;
+}

With a table size of 256 and you have to scan the whole table to
find the right head queue. This can be a significant overhead. I
will suggest setting a limiting of how many entries it scans before
it aborts rather than checking the whole table.

But it will scan 256 entries only if there are 256 other locks in queued
mode, which is -very- unlikely, even given 4096 CPUs. That said, if you
show me that this results in a real latency problem on a real system,
I would be happy to provide a way to limit the search.

Looking at the code more carefully, the chance of actually scanning 256 entries is very small. However, I now have some concern on the way you set up the initial queue.

+/*
+ * Start handling a period of high contention by finding a queue to associate
+ * with this lock. Returns true if successful (in which case we hold the
+ * lock) and false otherwise (in which case we do -not- hold the lock).
+ */
+bool tkt_q_start_contend(arch_spinlock_t *asp, struct __raw_tickets inc)
+{
+ int i;
+ int start;
+
+ /* Hash the lock address to find a starting point. */
+ start = i = tkt_q_hash(asp);
+
+ /*
+ * Each pass through the following loop attempts to associate
+ * the lock with the corresponding queue.
+ */
+ do {
+ /*
+ * Use 0x1 to mark the queue in use, but also avoiding
+ * any spinners trying to use it before we get it all
+ * initialized.
+ */
+ if (cmpxchg(&tkt_q_heads[i].ref,
+ NULL,
+ (arch_spinlock_t *)0x1) == NULL) {
+
+ /* Succeeded, now go initialize it. */
+ return tkt_q_init_contend(i, asp, inc);
+ }
+
+ /* If someone beat us to it, go spin on their queue. */
+ if (ACCESS_ONCE(asp->tickets.head)& 0x1)
+ return tkt_q_do_spin(asp, inc);
+ } while ((i = tkt_q_next_slot(i)) != start);
+
+ /* All the queues are in use, revert to spinning on the ticket lock. */
+ return false;
+}
+

Unconditional cmpxchg() can be a source of high contention by itself. Considering that 16 threads may be doing cmpxchg() more or less simultaneously on the same cache line, it can cause a lot of contention. It will be better if you check to see if tkt_q_heads[i] is NULL first before doing cmpxchg.

Another point is that the 16 threads maybe setting up the queues in consecutive slots in the head table. This is both a source of contention and a waste of effort. One possible solution is to add one more field (set to cpuid + 1, for example) to indicate that that setup is being done with asp set to the target lock address immediately. We will need to use cmpxchg128() for 64-bit machine, though. Another solution is to have only that thread with ticket number that is a fixed distance from head (e.g. 16*2) to do the queue setup while the rest wait until the setup is done before spinning on the queue.

As my colleague Davidlohr had reported there are more regressions than performance improvement in the AIM7 benchmark. I believe that queue setup contention is likely a source of performance regression.

Regards,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Lee Jones: "Re: [PATCH 10/22] mfd/ti_am335x_tscadc: Add DT support"
Previous message: vcaputo: "Re: adopt(pid_t pid) syscall proposal [patch included]"
In reply to: Paul E. McKenney: "Re: [PATCH RFC ticketlock] Auto-queued ticketlock"
Next in thread: Davidlohr Bueso: "Re: [PATCH RFC ticketlock] Auto-queued ticketlock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]