Re: [PATCH 3/4] rhashtable: use bit_spin_locks to protect hash bucket.

From: NeilBrown
Date: Wed Apr 10 2019 - 20:49:09 EST


On Wed, Apr 10 2019, Guenter Roeck wrote:

> Hi,
>
> On Tue, Apr 02, 2019 at 10:07:45AM +1100, NeilBrown wrote:
>> This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
>> bucket pointer to lock the hash chain for that bucket.
>>
>> The benefits of a bit spin_lock are:
>> - no need to allocate a separate array of locks.
>> - no need to have a configuration option to guide the
>> choice of the size of this array
>> - locking cost is often a single test-and-set in a cache line
>> that will have to be loaded anyway. When inserting at, or removing
>> from, the head of the chain, the unlock is free - writing the new
>> address in the bucket head implicitly clears the lock bit.
>> For __rhashtable_insert_fast() we ensure this always happens
>> when adding a new key.
>> - even when lockings costs 2 updates (lock and unlock), they are
>> in a cacheline that needs to be read anyway.
>>
>> The cost of using a bit spin_lock is a little bit of code complexity,
>> which I think is quite manageable.
>>
>> Bit spin_locks are sometimes inappropriate because they are not fair -
>> if multiple CPUs repeatedly contend of the same lock, one CPU can
>> easily be starved. This is not a credible situation with rhashtable.
>> Multiple CPUs may want to repeatedly add or remove objects, but they
>> will typically do so at different buckets, so they will attempt to
>> acquire different locks.
>>
>> As we have more bit-locks than we previously had spinlocks (by at
>> least a factor of two) we can expect slightly less contention to
>> go with the slightly better cache behavior and reduced memory
>> consumption.
>>
>> To enhance type checking, a new struct is introduced to represent the
>> pointer plus lock-bit
>> that is stored in the bucket-table. This is "struct rhash_lock_head"
>> and is empty. A pointer to this needs to be cast to either an
>> unsigned lock, or a "struct rhash_head *" to be useful.
>> Variables of this type are most often called "bkt".
>>
>> Previously "pprev" would sometimes point to a bucket, and sometimes a
>> ->next pointer in an rhash_head. As these are now different types,
>> pprev is NULL when it would have pointed to the bucket. In that case,
>> 'blk' is used, together with correct locking protocol.
>>
>> Signed-off-by: NeilBrown <neilb@xxxxxxxx>
>
> This patch causes my qemu q800 boot test to crash reliably.
>
> Starting network: Unable to handle kernel access at virtual address (ptrval)
> Oops: 00000000
> Modules linked in:
> PC: [<00248b90>] sk_filter_trim_cap+0x56/0x158
> SR: 2000 SP: (ptrval) a2: 07a30aa0
> d0: 07836300 d1: 0783666c d2: 00000001 d3: 0025c192
> d4: 0025a636 d5: 00248b3a a0: 078363fe a1: 60000000
> Process ip (pid: 66, task=(ptrval))
> Frame format=7 eff addr=6000000c ssw=0505 faddr=6000000c
> wb 1 stat/addr/data: 0000 00000000 00000000
> wb 2 stat/addr/data: 0000 00000000 00000000
> wb 3 stat/addr/data: 0000 6000000c 00000000
> push data: 00000000 00000000 00000000 00000000
> Stack from 07a5bdec:
> 07a5be5c 0025c192 0025a636 00248b3a 00000000 ef9febc8 078363fe 0787a2d0
> 0783645c 07a5be5c 0025c192 0025a636 00248b3a 0787a2d0 07a2c000 0025c470
> 078363fe 0787a2d0 00000001 00000000 00000000 07a5be98 07a17500 00000000
> 0787a2d0 07a2c000 07a5bef8 07a5beac 7fffffff 0025c7e2 07a2c000 0787a2d0
> 00000000 00000000 00000000 0000000c ef9feb14 ef9feb14 00000000 0031a52e
> 07a5bef8 07a5bf28 0000000c 0781eba0 00000000 00000042 00000000 00000000
> Call Trace: [<0025c192>] netlink_attachskb+0x0/0x138
> [<0025a636>] __netlink_lookup.isra.3+0x0/0xbc
> [<00248b3a>] sk_filter_trim_cap+0x0/0x158
> [<0025c192>] netlink_attachskb+0x0/0x138
> [<0025a636>] __netlink_lookup.isra.3+0x0/0xbc
> [<00248b3a>] sk_filter_trim_cap+0x0/0x158
> [<0025c470>] netlink_unicast+0x170/0x1be
> [<0025c7e2>] netlink_sendmsg+0x288/0x2b2
> [<0021a5be>] sock_sendmsg+0x1c/0x44
> [<0021b8a6>] __sys_sendto+0xac/0xd2
> [<00100000>] ext4_read_block_bitmap_nowait+0x4d4/0x4ec
> [<00100000>] ext4_read_block_bitmap_nowait+0x4d4/0x4ec
> [<00219906>] sock_alloc_file+0x50/0x80
> [<000b7a1c>] fd_install+0x12/0x18
> [<0021b1cc>] __sys_socket+0x7e/0x9c
> [<0021b8ea>] sys_sendto+0x1e/0x24
> [<00002a40>] syscall+0x8/0xc
> [<0000c004>] ATANTBL+0x618/0x800
> Code: 4a89 6604 4280 60ea 2c2b 000c 2748 000c <2869> 000c 082c 0003 0002 6728 4878 0014 7620 4873 3800 486e ffec 4eb9 002e 5b88

Thanks for testing and for the report.
The above code disassembles to:

0: 4a89 tstl %a1
2: 6604 bnes 0x8
4: 4280 clrl %d0
6: 60ea bras 0xfffffff2
8: 2c2b 000c movel %a3@(12),%d6
c: 2748 000c movel %a0,%a3@(12)
10:* 2869 000c moveal %a1@(12),%a4 <-- trapping instruction
14: 082c 0003 0002 btst #3,%a4@(2)
1a: 6728 beqs 0x44
1c: 4878 0014 pea 0x14
20: 7620 moveq #32,%d3
22: 4873 3800 pea %a3@(0000000000000000,%d3:l)
26: 486e ffec pea %fp@(-20)
2a: 4eb9 002e 5b88 jsr 0x2e5b88

And as %a1 is 60000000, it crashes.
I'm not familiar with m68k assembler, but a bit of hunting suggests that
moveal %a1@(12),%a4

means "add 12 to %1, load an address from there, then dereference that
address again.
I think that both skb->sk and filter->prog are at offsets of 12, so I
guess
8: 2c2b 000c movel %a3@(12),%d6
is
struct sock *save_sk = skb->sk;

c: 2748 000c movel %a0,%a3@(12)
is
skb->sk = sk;
and
10:* 2869 000c moveal %a1@(12),%a4 <-- trapping instruction
14: 082c 0003 0002 btst #3,%a4@(2)
is
bpf_prog_run_save_cb(filter->prog, skb);
if (unlikely(prog->cb_access)) {

cb_access might be bit 3 of the byte 2 along from *prog, which is 12
along from a1.

That makes a1 'filter', loaded from sk->sk_filter, where 'sk' is %a0,
078363fe

That address is 2-byte aligned, which is probably wrong.... If addresses
of structs aren't always 4-byte aligned, then using BIT(1) for a lock
bit isn't going to work.

.... and after googling a bit I see that 68000 require 2-byte alignment,
but not 4-byte. Oh..

That means there aren't two spare bits in an address, so I cannot use
one for the NULLS and one for a lock bit. Bother.

I might be able to find a different way forward, but for now I think we
need to drop this series.

Thanks,
NeilBrown

Attachment: signature.asc
Description: PGP signature