Re: [PATCH v9 00/11] futex: Add support task local hash maps.
From: Peter Zijlstra
Date: Mon Mar 10 2025 - 11:58:37 EST
On Tue, Mar 04, 2025 at 03:58:37PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-03 17:40:16 [+0100], To Peter Zijlstra wrote:
> …
> > You avoided the two states by dropping refcount only there is no !new
> > pointer. That should work.
> …
> > My first few tests succeeded. And I have a few RCU annotations, which I
> > post once I complete them and finish my requeue-pi tests.
>
> get_futex_key() has this:
> |…
> | if (!fshared) {
> |…
> | if (IS_ENABLED(CONFIG_MMU))
> | key->private.mm = mm;
> | else
> | key->private.mm = NULL;
> |
> | key->private.address = address;
> |
>
> and now __futex_hash_private() has this:
> | {
> | if (!futex_key_is_private(key))
> | return NULL;
> |
> | if (!fph)
> | fph = rcu_dereference(key->private.mm->futex_phash);
>
> Dereferencing mm won't work on !CONFIG_MMU. We could limit private hash
> to !CONFIG_BASE_SMALL && CONFIG_MMU.
Humph, yeah, not sure we should care about !MMU.
> Ignoring this, I managed to crash the box on top of 49fd6b8f5d59
> ("futex: Implement FUTEX2_MPOL"). I had one commit on top to make the
> prctl not blocking (make futex_hash_allocate(, false)). This is simulate
> the fork resize. The backtrace:
> | [ T8658] BUG: unable to handle page fault for address: fffffffffffffff0
> | [ T8658] #PF: supervisor read access in kernel mode
> | [ T8658] #PF: error_code(0x0000) - not-present page
> | [ T8658] PGD 2c5a067 P4D 2c5a067 PUD 2c5c067 PMD 0
> | [ T8658] Oops: Oops: 0000 [#1] PREEMPT_RT SMP NOPTI
> | [ T8658] CPU: 6 UID: 1001 PID: 8658 Comm: thread-create-l Not tainted 6.14.0-rc4+ #188 676565269ee73396c27dead3a66b3f774bd9af57
> | [ T8658] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
> | [ T8658] RIP: 0010:plist_check_list+0xb/0xa0
> | [ T8658] Code: cc cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 41 54 49 89 fc 55 53 48 83 ec 10 <48> 8b 1f 48 8b 43 08 48 39 c7 74 27 48 8b 4f 08 50 49 89 f8 48 89
> | [ T8658] RSP: 0018:ffffc90022e27c90 EFLAGS: 00010286
> | [ T8658] RAX: 0000000000000000 RBX: ffffc90022e27e00 RCX: 0000000000000000
> | [ T8658] RDX: ffff888558da02a8 RSI: ffff888558da02a8 RDI: fffffffffffffff0
> | [ T8658] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8885680dc980
> | [ T8658] R10: 0000031e8e1a7200 R11: ffff888574990028 R12: fffffffffffffff0
> | [ T8658] R13: ffff888558da02a8 R14: ffffc90022e27e48 R15: ffffc90022e27d38
> | [ T8658] FS: 00007f741af9e6c0(0000) GS:ffff8885a7c2b000(0000) knlGS:0000000000000000
> | [ T8658] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> | [ T8658] CR2: fffffffffffffff0 CR3: 00000006d7aca005 CR4: 00000000000626f0
> | [ T8658] Call Trace:
> | [ T8658] <TASK>
> | [ T8658] plist_del+0x28/0x100
> | [ T8658] __futex_unqueue+0x29/0x40
> | [ T8658] futex_unqueue_pi+0x1f/0x40
> | [ T8658] futex_lock_pi+0x24d/0x420
> | [ T8658] do_futex+0x57/0x190
> | [ T8658] __x64_sys_futex+0xfe/0x1a0
>
> It takes about 1h+ to reproduce. And only on one particular stubborn
> box. This originates from futex_unqueue_pi() after
> futex_q_lockptr_lock(). I have another crash within
> futex_q_lockptr_lock() (in spin_lock()).
>
> This looks like the locking task was not enqueued in the hash bucket
> during the resize. This means there was a timeout and the unlocking task
> removed it while looking for the next owner. But the unlocking part
> acquired an additional reference to avoid a resize in that case. So,
> confused I am.
Yeah, weird that.
> I reverted to 50ca0ec83226 ("futex: Resize local futex hash table based
> on number of threads."), have the another "always resize hack" and so
> far it looks good.
> Looking at __futex_pivot_hash() there is this:
> | if (fph) {
> | if (rcuref_read(&fph->users) != 0) {
> | mm->futex_phash_new = new;
> | return false;
> | }
> |
> | futex_rehash_private(fph, new);
> | }
>
> So we stash the new pointer as long as rcuref_read() does not return 0.
> How stable is rcuref_read()'s 0 return actually? The code says:
>
> | static inline unsigned int rcuref_read(rcuref_t *ref)
> | {
> | unsigned int c = atomic_read(&ref->refcnt);
> |
> | /* Return 0 if within the DEAD zone. */
> | return c >= RCUREF_RELEASED ? 0 : c + 1;
> | }
>
> so if it got negative on its final put, the c becomes -1/ 0xff…ff. This
> +1 will be 0 and we do a resize. But it is negative and did not reach
> RCUREF_DEAD yet so it can be bumbed back to positive. It will not be
> deconstructed because the cmpxchg in rcuref_put_slowpath() fails. So it
> will remains active. But we do a resize here and end up with to private
> hash. That is why I had the `released' member.
I am not quite sure I follow. If rcuref_put_slowpath() returns true;
then the value has been set to DEAD (high nibble E), any concurrent
inc/dec will move it away from that a little, but it will always be set
back to DEAD (IOW, you need 1<<29 concurrent modifications into the same
direction to push it out of the DEAD range).
As long as it is within those 29 bits of DEAD, rcuref_read() should
return 0.