Re: [linus:master] [hugetlb] 7118fc2906: kernel_BUG_at_lib/list_debug.c

From: Vlastimil Babka
Date: Wed Jan 18 2023 - 09:02:17 EST


On 1/17/23 19:25, Linus Torvalds wrote:
> On Tue, Jan 17, 2023 at 4:22 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
>>
>> With the following patch to use 'O1' instead 'O2' gcc optoin for
>> page_alloc.c, the list corruption issue can't be reproduced for
>> commit 7118fc2906 in 1000 runs.
>
> Ugh.
>
> It would be lovely if you could just narrow it down with
>
> #pragma GCC optimize ("O1")
> ...
> #pragma GCC optimize ("O2")
>
> around just that prep_compound_page(), but when I tried it myself I
> get some function attribute mismatch errors.
>
>
>> As is can't be reproduced with X86_64 build, it could be i386
>> compiling related.
>
> Your particular config causes a huge amount of nasty 64-bit arithmetic
> according to the objdump code, with sequences like
>
> c13b3cbb: 83 05 d0 28 6c c5 01 addl $0x1,0xc56c28d0
> c13b3cc2: 83 15 d4 28 6c c5 00 adcl $0x0,0xc56c28d4
>
> which seems to be just from some coverage profiling being on
> (CONFIG_GCOV?), or something. It makes it very hard to read the code.

I think whatever this is, it's also responsible for the bug.
Now from all the dumps I've seen so far, the struct page corruptions looked
like prep_compound_page() is called more times than it should. Which didn't
make sense as the code is very simple, a loop of 1 << order - 1 iterations.

If I look at Feng's disassembly of 48b8d744ea84 (the "good" parent commit),
I can work out that edi holds the 1 << order, ebx is initialized as 1, and
there's a nice clear "inc ebx, cmp ebx, edi, jne <loop>".

Now the disassembly of the "bad" commit is much harder to follow when it
comes to the iteration control. There are xors and or's involved so I didn't
even try to work out the meaning. Clearly it can't be deterministically
wrong or it would crash always and fast. But crucially I think it uses
memory addresses 0xc56c28f8 and 0xc56c28fc in a racy way and that could
explain the bug.

Before the loop iteration itself, it initializes several registers from
these addreses, and the values end up modifying also registers used for the
loop iteration control (ebx/esi) And crucially in the loop iteration it also
writes some values to the same addresses, and there seems to be no
synchronization whatsoever. So it seems to me, even if these addresses are
supposed to be "private" to prep_compound_page(), running two
prep_compound_page() in parallel, or even due to interrupt (Hyeonggon did
reproduce this with -smp 1 as well) can result in a race ultimately
affecting the number of iterations taken.

Attaching the disasm I annotated for more details. Everything register whose
value is coming from 0xc56c28f8/0xc56c28fc in some way is annotated as
"crap" there (sorry).

> You also have UBSAN enabled, which - again - makes for some really
> grotty asm that hides any actual logic.
>
> Finally, your objdump version also does some horrendous decoding, like
>
> c13b3e29: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi
>
> which is just a 7-byte 'nop' instruction, but again, it makes it
> rather hard to actually read the code.
>
> With the i386 defconfig, gcc generates a function that is just ~30
> instructions for me, so this makes a huge difference in the legibility
> of the code.
>
> I wonder if you can recreate the issue with a much more
> straightforward config. By all means, leave DEBUG_PAGEALLOC and SLUB
> debugging on, but without the things like UBSAN and GCOV.
>
>> I also objdumped 'prep_compound_page' for vmlinux of 7118fc2906 and
>> its parent commit 48b8d744ea84, which have big difference than the
>> simple 'set_page_count()' change, but I can't tell which part is
>> abnormal, so attach them for further check.
>
> Yeah, I can't make heads or tails of them either, see above on how
> illegible the objdump files are. And that's despite not even having
> all of prep_compound_page() in them (it's missing
> prep_compound_page.cold, which is probably just UBSAN fixup code, but
> who knows..)
>
> That said, with the i386 defconfig, the only change from adding
> set_page_count() to the loop seems to be exactly that:
>
> .L589:
> - movl $1024, 12(%eax)
> + movl $0, 28(%eax)
> addl $32, %eax
> + movl $1024, -20(%eax)
> movl %esi, -28(%eax)
> movl $0, -12(%eax)
> cmpl %edx, %eax
>
> (don't ask me why gcc does *one* access using the pre-incremented
> pointer, and then the rest to the post-incremented ones, but whatever
> - it means that it's not just "add a mov $0", it's also changing how
> the
>
> p->mapping = TAIL_MAPPING;
>
> instruction is done, which is that
>
> - movl $1024, 12(%eax)
> + movl $1024, -20(%eax)
>
> part of the change)
>
> Linus
inputs:
eax: page
edx: order

stack below ebp
-0x04 saved edi
-0x08 saved esi
-0x0c saved ebp
(sub $0x14,%esp)
-0x10 crap
-0x14 crap
-0x18 page|1 (compound_head value)
-0x1c copy of order
-0x20 page
-0x24
esp

eax = page|1 (compound_head value) / crap
edx = page / crap
edi = page / crap
ebx = 1 << order (=2) / 1 << order -2 (=0) / crap!
ecx = page + 1 (first tail page)
esi = 0 / crap

c13b3b90 <prep_compound_page>:
c13b3b90: 55 push %ebp
c13b3b91: 89 e5 mov %esp,%ebp
c13b3b93: 57 push %edi
c13b3b94: 89 c7 mov %eax,%edi edi = page
c13b3b96: 56 push %esi
c13b3b97: 53 push %ebx
c13b3b98: 83 ec 14 sub $0x14,%esp
c13b3b9b: 83 fa 1f cmp $0x1f,%edx
c13b3b9e: 89 55 e4 mov %edx,-0x1c(%ebp)
c13b3ba1: 0f 87 33 31 ed 01 ja c3286cda <prep_compound_page.cold> only when order > 31 (?)
c13b3ba7: 0f b6 4d e4 movzbl -0x1c(%ebp),%ecx ecx = order
c13b3bab: bb 01 00 00 00 mov $0x1,%ebx ebx = 1
c13b3bb0: d3 e3 shl %cl,%ebx ebx = 1 << order (=2)
c13b3bb2: 83 3f ff cmpl $0xffffffff,(%edi)
c13b3bb5: 0f 84 65 02 00 00 je c13b3e20 <prep_compound_page+0x290> only when page flags == 0xffffffff (?)
c13b3bbb: 83 05 d0 28 6c c5 01 addl $0x1,0xc56c28d0
c13b3bc2: 83 15 d4 28 6c c5 00 adcl $0x0,0xc56c28d4
c13b3bc9: 0f ba 2f 10 btsl $0x10,(%edi) sets bit 0x10 (PG_head) in page flags
c13b3bcd: 83 05 f0 28 6c c5 01 addl $0x1,0xc56c28f0
c13b3bd4: 83 15 f4 28 6c c5 00 adcl $0x0,0xc56c28f4
c13b3bdb: 83 fb 01 cmp $0x1,%ebx
c13b3bde: 0f 8e 80 00 00 00 jle c13b3c64 <prep_compound_page+0xd4> nr_pages <= 1 -> skip over the whole loop
c13b3be4: 8d 47 01 lea 0x1(%edi),%eax eax = page|1 (compound_head value)
c13b3be7: 8b 15 fc 28 6c c5 mov 0xc56c28fc,%edx edx now some crap, we don't know how this was initialized !!!
c13b3bed: 89 45 e8 mov %eax,-0x18(%ebp) saves the page|1
c13b3bf0: a1 f8 28 6c c5 mov 0xc56c28f8,%eax eax now some crap, again we don't know how this was initialized !!!
c13b3bf5: 8d 4f 28 lea 0x28(%edi),%ecx ecx = edi + 0x28 - first tail page
c13b3bf8: 89 7d e0 mov %edi,-0x20(%ebp) saves page pointer
c13b3bfb: 83 c0 01 add $0x1,%eax crap + 1
c13b3bfe: 89 45 ec mov %eax,-0x14(%ebp) saves the crap
c13b3c01: 83 d2 00 adc $0x0,%edx crap in crap out
c13b3c04: a1 f8 28 6c c5 mov 0xc56c28f8,%eax reset as crap
c13b3c09: 89 55 f0 mov %edx,-0x10(%ebp) save more crap
c13b3c0c: 8b 15 fc 28 6c c5 mov 0xc56c28fc,%edx reset as crap
c13b3c12: 83 eb 02 sub $0x2,%ebx ebx is nr_pages - 2 (for order-1 was 2, now 0) - remaining tail pages after first iteration?
c13b3c15: 31 f6 xor %esi,%esi esi is 0
c13b3c17: 83 c0 02 add $0x2,%eax crap + 2
c13b3c1a: 83 d2 00 adc $0x0,%edx crap
c13b3c1d: 01 c3 add %eax,%ebx ebx is now modified by crap !!! !!!
c13b3c1f: 8b 45 ec mov -0x14(%ebp),%eax reset as crap
c13b3c22: 11 d6 adc %edx,%esi esi now also crap
c13b3c24: 8b 55 f0 mov -0x10(%ebp),%edx reset as crap
c13b3c27: 89 f7 mov %esi,%edi edi now also crap, oh my
c13b3c29: 89 de mov %ebx,%esi supposed to be the remaining tail pages, byt modified by crap
c13b3c2b: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi elaborate nop
c13b3c2f: 90 nop nop

// this should be the loop for prep_compound_tail
c13b3c30: a3 f8 28 6c c5 mov %eax,0xc56c28f8 uh now we saved some value to the memory we read previously assuming some known good value? !!!
c13b3c35: 8b 5d e8 mov -0x18(%ebp),%ebx ebx = page|1 (compound_head value)
c13b3c38: 83 c0 01 add $0x1,%eax still crap
c13b3c3b: 89 15 fc 28 6c c5 mov %edx,0xc56c28fc and now we wrote to the other address as well? !!!
c13b3c41: 83 d2 00 adc $0x0,%edx still crap
c13b3c44: 83 c1 28 add $0x28,%ecx ecx pointed to first tail pages, now points to second tail page
c13b3c47: c7 41 e4 00 04 00 00 movl $0x400,-0x1c(%ecx) (first tail page)->mapping = TAIL_MAPPING;
c13b3c4e: 89 59 dc mov %ebx,-0x24(%ecx) (first tail page)->compound_head is now set
c13b3c51: 89 fb mov %edi,%ebx ebx now crap
c13b3c53: 31 d3 xor %edx,%ebx ebx xored with more crap
c13b3c55: 89 5d ec mov %ebx,-0x14(%ebp) we might need that crap later
c13b3c58: 89 f3 mov %esi,%ebx supposed to be the remaining tail pages, byt modified by crap
c13b3c5a: 31 c3 xor %eax,%ebx yeah why not xor it with more crap
c13b3c5c: 0b 5d ec or -0x14(%ebp),%ebx and "or" it with the previously saved crap
c13b3c5f: 75 cf jne c13b3c30 <prep_compound_page+0xa0> and that's how we decided if we should do another iteration :(

c13b3c61: 8b 7d e0 mov -0x20(%ebp),%edi restore edi = page

// here we land if we skip everything at c13b3bde
c13b3c64: c6 47 30 01 movb $0x1,0x30(%edi)
// the rest shouldn't be important