Re: [PATCH v9 00/11] futex: Add support task local hash maps.

From: Sebastian Andrzej Siewior
Date: Tue Mar 04 2025 - 10:05:29 EST


On 2025-03-03 17:40:16 [+0100], To Peter Zijlstra wrote:

> You avoided the two states by dropping refcount only there is no !new
> pointer. That should work.

> My first few tests succeeded. And I have a few RCU annotations, which I
> post once I complete them and finish my requeue-pi tests.

get_futex_key() has this:
|…
| if (!fshared) {
|…
| if (IS_ENABLED(CONFIG_MMU))
| key->private.mm = mm;
| else
| key->private.mm = NULL;
|
| key->private.address = address;
|

and now __futex_hash_private() has this:
| {
| if (!futex_key_is_private(key))
| return NULL;
|
| if (!fph)
| fph = rcu_dereference(key->private.mm->futex_phash);

Dereferencing mm won't work on !CONFIG_MMU. We could limit private hash
to !CONFIG_BASE_SMALL && CONFIG_MMU.

Ignoring this, I managed to crash the box on top of 49fd6b8f5d59
("futex: Implement FUTEX2_MPOL"). I had one commit on top to make the
prctl not blocking (make futex_hash_allocate(, false)). This is simulate
the fork resize. The backtrace:
| [ T8658] BUG: unable to handle page fault for address: fffffffffffffff0
| [ T8658] #PF: supervisor read access in kernel mode
| [ T8658] #PF: error_code(0x0000) - not-present page
| [ T8658] PGD 2c5a067 P4D 2c5a067 PUD 2c5c067 PMD 0
| [ T8658] Oops: Oops: 0000 [#1] PREEMPT_RT SMP NOPTI
| [ T8658] CPU: 6 UID: 1001 PID: 8658 Comm: thread-create-l Not tainted 6.14.0-rc4+ #188 676565269ee73396c27dead3a66b3f774bd9af57
| [ T8658] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
| [ T8658] RIP: 0010:plist_check_list+0xb/0xa0
| [ T8658] Code: cc cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 41 54 49 89 fc 55 53 48 83 ec 10 <48> 8b 1f 48 8b 43 08 48 39 c7 74 27 48 8b 4f 08 50 49 89 f8 48 89
| [ T8658] RSP: 0018:ffffc90022e27c90 EFLAGS: 00010286
| [ T8658] RAX: 0000000000000000 RBX: ffffc90022e27e00 RCX: 0000000000000000
| [ T8658] RDX: ffff888558da02a8 RSI: ffff888558da02a8 RDI: fffffffffffffff0
| [ T8658] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8885680dc980
| [ T8658] R10: 0000031e8e1a7200 R11: ffff888574990028 R12: fffffffffffffff0
| [ T8658] R13: ffff888558da02a8 R14: ffffc90022e27e48 R15: ffffc90022e27d38
| [ T8658] FS: 00007f741af9e6c0(0000) GS:ffff8885a7c2b000(0000) knlGS:0000000000000000
| [ T8658] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
| [ T8658] CR2: fffffffffffffff0 CR3: 00000006d7aca005 CR4: 00000000000626f0
| [ T8658] Call Trace:
| [ T8658] <TASK>
| [ T8658] plist_del+0x28/0x100
| [ T8658] __futex_unqueue+0x29/0x40
| [ T8658] futex_unqueue_pi+0x1f/0x40
| [ T8658] futex_lock_pi+0x24d/0x420
| [ T8658] do_futex+0x57/0x190
| [ T8658] __x64_sys_futex+0xfe/0x1a0

It takes about 1h+ to reproduce. And only on one particular stubborn
box. This originates from futex_unqueue_pi() after
futex_q_lockptr_lock(). I have another crash within
futex_q_lockptr_lock() (in spin_lock()).

This looks like the locking task was not enqueued in the hash bucket
during the resize. This means there was a timeout and the unlocking task
removed it while looking for the next owner. But the unlocking part
acquired an additional reference to avoid a resize in that case. So,
confused I am.
I reverted to 50ca0ec83226 ("futex: Resize local futex hash table based
on number of threads."), have the another "always resize hack" and so
far it looks good.
Looking at __futex_pivot_hash() there is this:
| if (fph) {
| if (rcuref_read(&fph->users) != 0) {
| mm->futex_phash_new = new;
| return false;
| }
|
| futex_rehash_private(fph, new);
| }

So we stash the new pointer as long as rcuref_read() does not return 0.
How stable is rcuref_read()'s 0 return actually? The code says:

| static inline unsigned int rcuref_read(rcuref_t *ref)
| {
| unsigned int c = atomic_read(&ref->refcnt);
|
| /* Return 0 if within the DEAD zone. */
| return c >= RCUREF_RELEASED ? 0 : c + 1;
| }

so if it got negative on its final put, the c becomes -1/ 0xff…ff. This
+1 will be 0 and we do a resize. But it is negative and did not reach
RCUREF_DEAD yet so it can be bumbed back to positive. It will not be
deconstructed because the cmpxchg in rcuref_put_slowpath() fails. So it
will remains active. But we do a resize here and end up with to private
hash. That is why I had the `released' member.

Sebastian