Re: [GIT PULL] tracing: Fixes to bootconfig memory management

From: Linus Torvalds
Date: Tue Sep 14 2021 - 19:29:49 EST


On Tue, Sep 14, 2021 at 3:48 PM Vlastimil Babka <vbabka@xxxxxxx> wrote:
>
> Well, looks like I can't. Commit 77e02cf57b6cf does boot fine for me,
> multiple times. But so now does the parent commit 6a4746ba06191. Looks like
> the magic is gone. I'm now surprised how deterministic it was during the
> bisect (most bad cases manifested on first boot, only few at second).

Well, your report was clearly memory corruption by the invalid
memblock_free() just ending up causing random problems later on.

So it could easily be 100% deterministic with a certain memory layout
at a particular commit. And then enough other changes later, and it's
all gone, because the memory corruption now hits something else that
didn't even care.

The code for your oops was

0: 48 8b 17 mov (%rdi),%rdx
3: 48 39 d7 cmp %rdx,%rdi
6: 74 43 je 0x4b
8: 48 8b 47 08 mov 0x8(%rdi),%rax
c: 48 85 c0 test %rax,%rax
f: 74 23 je 0x34
11: 49 89 c0 mov %rax,%r8
14:* 48 8b 40 10 mov 0x10(%rax),%rax <-- trapping instruction

and that's the start of rb_next(), so what's going on is that
"rb->rb_right" (the second word of 'struct rb_node') ends up having
that value in %rax:

RAX: 343479726f6d656d

which is ASCII "44yromem" rather than a valid pointer if I looked that up right.

And just _slightly_ different allocation patterns, and your 'struct
rb_node' gets allocated somewhere else, and you don't see the oops at
all, or you get it later in some different place.

Most memory corruption doesn't cause oopses, because most memory isn't
used as pointers etc.

What you _could_ try if you care enough is

- go back to the thing you bisectted to where you can still hopefully
recreate the problem

- apply that patch at that point with no other changes

and then the test would hopefully be closer to the state you could
re-create the problem.

And hopefully it would still not reproduce, just because the bug is
fixed, of course ;)

The very unlikely alternative is that your bisect was just pure random
bad luck and hit the wrong commit entirely, and the oops was due to
some other problem.

But it does seem unlikely to be something else. Usually when bisects
go off into the weeds due to not being reproducible, they go very
obviously off into the weeds rather than point to something that ends
up having a very similar bug.

Linus