Re: [GIT PULL] tracing: Fixes to bootconfig memory management

From: Vlastimil Babka
Date: Wed Sep 15 2021 - 05:28:30 EST


On 9/15/21 01:29, Linus Torvalds wrote:
> On Tue, Sep 14, 2021 at 3:48 PM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>>
>> Well, looks like I can't. Commit 77e02cf57b6cf does boot fine for me,
>> multiple times. But so now does the parent commit 6a4746ba06191. Looks like
>> the magic is gone. I'm now surprised how deterministic it was during the
>> bisect (most bad cases manifested on first boot, only few at second).
>
> Well, your report was clearly memory corruption by the invalid
> memblock_free() just ending up causing random problems later on.

> So it could easily be 100% deterministic with a certain memory layout
> at a particular commit. And then enough other changes later, and it's
> all gone, because the memory corruption now hits something else that
> didn't even care.
>
> The code for your oops was
>
> 0: 48 8b 17 mov (%rdi),%rdx
> 3: 48 39 d7 cmp %rdx,%rdi
> 6: 74 43 je 0x4b
> 8: 48 8b 47 08 mov 0x8(%rdi),%rax
> c: 48 85 c0 test %rax,%rax
> f: 74 23 je 0x34
> 11: 49 89 c0 mov %rax,%r8
> 14:* 48 8b 40 10 mov 0x10(%rax),%rax <-- trapping instruction
>
> and that's the start of rb_next(), so what's going on is that
> "rb->rb_right" (the second word of 'struct rb_node') ends up having
> that value in %rax:
>
> RAX: 343479726f6d656d
>
> which is ASCII "44yromem" rather than a valid pointer if I looked that up right.

Yep, I was pretty sure it was related to the
"/sys/bus/memory/devices/memory44" sysfs object and bisection would lead to
kobject/sysfs or some memory hotplug related changes. So the result was a
surprise.

> And just _slightly_ different allocation patterns, and your 'struct
> rb_node' gets allocated somewhere else, and you don't see the oops at
> all, or you get it later in some different place.
>
> Most memory corruption doesn't cause oopses, because most memory isn't
> used as pointers etc.
>
> What you _could_ try if you care enough is
>
> - go back to the thing you bisectted to where you can still hopefully
> recreate the problem
>
> - apply that patch at that point with no other changes
>
> and then the test would hopefully be closer to the state you could
> re-create the problem.
>
> And hopefully it would still not reproduce, just because the bug is
> fixed, of course ;)

Yeah, that worked! Commit 40caa127f3c7 was still broken, and cherry-pick of
77e02cf57b6cf on top fixed it. Thanks!

> The very unlikely alternative is that your bisect was just pure random
> bad luck and hit the wrong commit entirely, and the oops was due to
> some other problem.
>
> But it does seem unlikely to be something else. Usually when bisects
> go off into the weeds due to not being reproducible, they go very
> obviously off into the weeds rather than point to something that ends
> up having a very similar bug.
>
> Linus
>