Re: 2.6.25-git2: BUG: unable to handle kernel paging request atffffffffffffffff

From: Linus Torvalds
Date: Mon Apr 21 2008 - 11:57:33 EST




On Sun, 20 Apr 2008, Paul E. McKenney wrote:
> >
> > But I do concede that your version looks clearer, and has the
> > benefit that should prefetch ever be optimised out with no side-
> > effects, yours would still be correct while the current one will
> > lose the barrier completely.
>
> Agreed as well -- compilers would also be within their right to bypass
> the rcu_dereference() around the test/prefetch, which would allow
> them to refetch.

That is *not* the main problem.

If you use "rcu_dereference()" on the wrong access, it not only loses the
"smp_read_barrier_depends()" (which is a no-op on all sane architectures
anyway), but it loses the ACCESS_ONCE() thing *entirely*.

Accessign a local automatic variable through a volatile pointer has
absolutely no effect - it's a total no-op apart from possibly generating
slightly worse code (although if I were a compiler, I'd just ignore it),
since the compiler is totally free to spill and reload the local variable
to its memory location - the stack - anyway!

So the important part (for sane architectures) of rcu_dereference() is
that ACCESS_ONCE() hack, and it _only_ works if you actually do it on the
value as it gets loaded from the RCU-protected data structure, not later.

So forget about the prefetch, and forget about the barrier. They had
nothing to do with the bug. The bug existed even without the prefetch,
even in the versions that didn't have it at all. For example, look at the
"__list_for_each_rcu()" thing - the bug is there too, because it did just

pos = (head)->next ... pos = pos->next

where both of those assignments to pos were done without rcu_derefence, so
the compiler could happily decide to use the value once, forget it, and
then re-load it later (when it might have changed).

In other words, the thing I objected to was something much more
fundamental than any barriers. It was the fact that "rcu_dereference()"
simply *fundamentally* doesn't make sense when done on a local variable,
it can only make sense when actually loading the value from the data
structure.

In short:

pos = ..

rcu_dereference(pos)

is crazy and senseless, but

pos = rcu_dereference(pos->next)

actually has some logical meaning.

Now, all this said, I seriously doubt this was the source of the bug
itself. I do not actually really believe that the compiler had much room
for reloading things with or without any rcu_dereference(), and I doubt
the code generation really changes all that much in practice.

(In fact, from a quick look, it seems that the only thing that
the incorrect use of "rcu_derference()" did was to force the "node"
variable onto the stack, since it did that volatime memory access through
its pointer - and fixing the use of rcu_dereference() just means that
"node" is kept in a register over the whole loop on x86-64, but the
compiler still needs a stack slot, it just picks "str" instead. Which is a
much better choice anyway.

So what the incorrect use of rcu_dereference() really resulted in was just
this insane code (which is also seen in the BUG code):

14: 48 8b 45 d0 mov -0x30(%rbp),%rax
18: 48 8b 00 mov (%rax),%rax
1b: 48 89 45 d0 mov %rax,-0x30(%rbp)
1f: 48 8b 45 d0 mov -0x30(%rbp),%rax
23: 48 85 c0 test %rax,%rax
26: 74 18 je 0x40

and notice how insane that is, and how pointless?

First it loads %rax (node) from the ->next pointer of the previous value
of 'node' (which is a stack variable at -48(rbp)):

# node = node->next
mov -0x30(%rbp),%rax
mov (%rax),%rax

then it saves that to the stack and immediately reloads it (because of the
volatile access on "pos" in "rcu_dereference(pos)"):

# rcu_dereference(node)
mov %rax,-0x30(%rbp)
mov -0x30(%rbp),%rax

and then it tests it for being NULL:

test %rax,%rax
je 0x40

and notice how the only thing that rcu_dereference() did was a totally
unnecessary store and load to the stack? But also notice how gcc could
have done the accesses to "node" *before* this entry as multiple loads
from the original because there was nothing really holding this back.

(But also notice how there really isn't much room for that in practice,
since the code that actually uses "node->next" isn't going to do a whole
lot of exciting stuff with it).

With the corrected version, the insane "store and immediately reload from
stack" goes away" and diffstat on the assembly language actually shows
that there are two less instructions (most of the changes are just
compiler labels moving around, but there are a few real changes that
actually makes the assembler code look a bit more natural too).

So to recap: I don't think this mattered in practice. But the code was
buggy in theory, even though in practice I don't think it would ever
generate any reloads on that "next" variable simply because nothing else
than the loop logic really used it.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/