Re: [RFC patch 4/7] futex: Add support for attached futexes

From: Rasmus Villemoes
Date: Sat Apr 02 2016 - 19:48:16 EST

Next message: kbuild test robot: "arch/xtensa/kernel/coprocessor.S:93: Error: unknown opcode or format name 'rur.ae_ovf_sar'"
Previous message: KY Srinivasan: "RE: [PATCH 0/6] Drivers: hv: vmbus: Cleanup and mmio management."
In reply to: Andy Lutomirski: "Re: [RFC patch 4/7] futex: Add support for attached futexes"
Next in thread: Thomas Gleixner: "Re: [RFC patch 4/7] futex: Add support for attached futexes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Apr 02 2016, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:

> The standard futex mechanism in the Linux kernel uses a global hash to store
> transient state. Collisions on that hash can lead to performance degradation
> and on real-time enabled kernels even to priority inversions.
>
> To guarantee futexes without collisions on the global kernel hash, we provide
> a mechanism to attach to a futex. This creates futex private state which
> avoids hash collisions and on NUMA systems also cross node memory access.

Hi,

A few minor comments inline below, and a question about the design:

How is an application supposed to handle it when the kernel fails to
achieve the no collision-goal? With any reasonable upper bound on the
size of the local hash table (which of course has to be there, whether
sysctl'ed or not), and regardless of the hashing scheme used, it
seems inevitable that someone is going to get -ENOSPC when trying to
attach. Moreover, since different threads can attach to different sets
of futexes, one thread may succesfully attach to a futex, while another
fails - the second thread is then permanently prevented from operating
on that futex (?).

Why not use some sort of tree instead? Or fall back to a traditional
chained hash table once we're no longer allowed to increase the table
size? Of course these have worse lookup performance, and maybe failing
the attach in the rare case is better than penalizing the common case,
but it would be nice to have some mention of this in the change log.

Alternatively [this is not really thought through], maybe one could move
the decision and the complexity to userspace: On succesful FUTEX_ATTACH,
return an index into a small per-task array of struct futex_state*. On
subsequent FUTEX_ATTACHED operations on that futex, userspace passes in
this index somehow (either instead of uaddr, in which case the kernel
array would have to include this in addition to the futex_state pointer,
or by making uaddr actually point to a struct { int *futex_addr; int
attach_idx; }, or...) Then each thread would have to maintain a (futex
address => index) mapping, but that's more or less what the kernel
otherwise has to do.

> +
> +static unsigned int hash_prime(unsigned int size)
> +{
> + switch(size) {
> + case 16:
> + default: return 13;
> + case 32: return 31;
> + case 64: return 61;
> + case 128: return 127;
> + case 256: return 251;
> + case 512: return 509;
> + case 1024: return 1021;
> + case 2048: return 2039;
> + case 4096: return 4093;
> + }
> +}
> +

There should probably be some mention of TASK_CACHE_{BASE,MAX}_SIZE here
so that anyone updating those would also look here, so we don't end up
using 13 out of 8192 slots... BUILD_BUG_ON(TASK_CACHE_{BASE,MAX}_SIZE
!= {16,4096}) should do it.

> +static struct futex_cache *futex_alloc_cache(int cache_size)
> +{
> + struct futex_cache *tc;
> + size_t size;
> +
> + /* Allocate a new task cache */
> + size = sizeof(*tc) + cache_size * sizeof(struct futex_cache_slot);
> + tc = kzalloc_node(size, GFP_KERNEL, numa_node_id());
> + if (tc) {
> + tc->hash_prime = hash_prime(cache_size);
> + tc->cache_size = cache_size;
> + }
> + return tc;
> +}
> +
> +static int
> +futex_rehash_task_cache(struct futex_cache *tc, struct futex_cache *tcnew)
> +{
> + unsigned long *newmap = tcnew->cache_map;
> + unsigned int prime = tcnew->hash_prime;
> + unsigned long *map = tc->cache_map;
> + unsigned int size = tc->cache_size;
> + unsigned int slot, newslot;
> +
> + slot = find_first_bit(map, size);
> + for (; slot < size; slot = find_next_bit(map, size, slot + 1)) {
> + newslot = hash_local_futex(tc->slots[slot].uaddr, prime);
> + /*
> + * Paranoia. Rehashing to a larger cache should not result in
> + * collisions which did not exist in the small one.
> + */

This doesn't sound right when you're doing mod prime hashing; two
numbers can easily have the same remainder mod 31 while having distinct
remainders mod 13.

> + if (__test_and_set_bit(newslot, newmap))
> + return -ENOSPC;
> + /* Copy uaddr and futex state pointer */
> + tcnew->slots[newslot] = tc->slots[slot];
> + }
> + return 0;
> +}
> +
> +/**
> + * futex_get_task_cache_slot - Get a slot in the tasks local cache
> + *
> + * If the cache is not yet available it's allocated. If the existing cache is
> + * too small the cache is extended.
> + *
> + * Returns a valid slot or an error code
> + */
> +static int futex_get_task_cache_slot(u32 __user *uaddr)
> +{
> + struct futex_cache *tcnew, *tc = current->futex_cache;
> + unsigned int cache_size;
> + int slot;
> +
> + /* First caller allocates the initial cache */
> + if (!tc) {
> + tc = futex_alloc_cache(TASK_CACHE_BASE_SIZE);
> + if (!tc)
> + return -ENOMEM;
> + current->futex_cache = tc;
> + slot = hash_local_futex(uaddr, tc->hash_prime);
> + return slot;
> + }
> +
> + slot = hash_local_futex(uaddr, tc->hash_prime);
> +
> + /* Check whether the slot is populated already */
> + if (!test_bit(slot, tc->cache_map))
> + return slot;
> +
> + /* Was this futex attached already ? */
> + if (tc->slots[slot].uaddr == uaddr)
> + return -EEXIST;
> +
> + cache_size = tc->cache_size;
> +retry:
> + /* Task has reached max cache size? */
> + if (cache_size >= TASK_CACHE_MAX_SIZE)
> + return -ENOSPC;
> +
> + cache_size *= 2;
> + tcnew = futex_alloc_cache(cache_size);
> + if (!tcnew)
> + return -ENOMEM;
> +
> + /*
> + * If the rehashing fails or the slot for uaddr is busy after
> + * rehashing, try with a larger cache.
> + */
> + slot = hash_local_futex(uaddr, tcnew->hash_prime);
> +
> + if (futex_rehash_task_cache(tc, tcnew) ||
> + test_bit(slot, tcnew->cache_map)) {
> + kfree(tcnew);
> + goto retry;
> + }
> +
> + /* Populate the new cache and return the slot number */
> + current->futex_cache = tcnew;
> + return slot;
> +}

I may be misreading it, but this seems to leak the old ->futex_cache?

Rasmus

Next message: kbuild test robot: "arch/xtensa/kernel/coprocessor.S:93: Error: unknown opcode or format name 'rur.ae_ovf_sar'"
Previous message: KY Srinivasan: "RE: [PATCH 0/6] Drivers: hv: vmbus: Cleanup and mmio management."
In reply to: Andy Lutomirski: "Re: [RFC patch 4/7] futex: Add support for attached futexes"
Next in thread: Thomas Gleixner: "Re: [RFC patch 4/7] futex: Add support for attached futexes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]