Re: 2.6.24-rc6-mm1

From: Torsten Kaiser
Date: Sun Jan 06 2008 - 05:31:03 EST


On Jan 6, 2008 9:27 AM, Jarek Poplawski <jarkao2@xxxxxxxxx> wrote:
> On Sat, Jan 05, 2008 at 03:52:32PM +0100, Torsten Kaiser wrote:
> ...
> > So my personal conclusion would be, that someone is writing to memory
> > that he no longer owns. Most probably 0-bytes. (the complete_routine
> > got NULLed and the warning about dst->__refcnt being 0).
> >
> > Use-after-free or something else?
>
> I agree: your conclusion seems to be the most probable explanation for
> this. Then it could be really hard to solve this without bisection or
> something similar. But there is some probabability this something could
> try kfree later too, but simply this list debugging triggers earlier.

As for example in the case when it dies in ieee1394-thread the list is
so corrupted that it will die anyway.

But I might try this anyway, as I don't really have a better idee.

> > > > If you think some other slub_debug might catch it, I would try this...
>
> You can try to add "U" to these other slub_debug options. As a matter
> of fact, if your above diagnose is right, it seems you risk to damage
> your system or even the box with these tests, so if you want to
> continue, you should probably turn any possible debugging on (not in
> mm only).

I did not add U, because I thought that would only needed to trace memory leaks.
And I hoped that using P (poison) would catch any later use (after free).

> BTW, you've written that some debugging options seem to delay the bug.
> Since they often change sizes of some structures than such wrong
> writes could have some 'safer' offsets. So, this could really delay
> e.g. these list's bugs, but maybe this could also let to stay 'alive'
> to such wrong kfree?

I think this bug is highly timing dependent. Its not always the same
package that dies and as this is a SMP system I would guess two CPUs
using the same data will trigger this.
And using the poison-option will definitily slow the system down and
mess up the timings.

What also speaks against the 'safer' offsets is, that after adding my
notfreed-byte to skbuff the bug still triggered in the same way.

I'm currently looking at
http://www.mail-archive.com/linux-scsi@xxxxxxxxxxxxxxx/msg12702.html
,trying to understand if this is relevant for me on x86_64.

Torsten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/