Re: Linux 3.1-rc7

From: Arnaud Lacombe
Date: Wed Sep 28 2011 - 12:39:00 EST


Hi,

On Wed, Sep 28, 2011 at 11:47 AM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Tue, Sep 27, 2011 at 10:34 PM, Arnaud Lacombe <lacombar@xxxxxxxxx> wrote:
>>
>> <off-topic>
>> Speaking of corruption, I'm encountering another set on an external
>> hard-drive, connected through USB.
>
> I don't think it's unrelated or off-topic.
>
>>     The same corruption pop up (at least in those text file): a
>> sequence of 4 bytes is replaced by 0x000000E0 at offset 0x1E4 of the
>> start of the file for some of them, 0x3E4 for two other (same
>> corruption though). Locating the corruption will be more tricky in
>> binary files.
>
> So it's possible that it's some rogue kernel pointer. We've certainly
> had those before. Constants offsets like that happen with some
> structure allocation that just happens to be say 1kB in size, and the
> rogue kernel pointer assigns at a fixed offset to something that has
> already been free'd.
>
> You might want to try to compile the kernel with SLUB_DEBUG_ON set,
> and possibly also DEBUG_PAGEALLOC.
>
I'll give it a try.

> HOWEVER. It's quite possible that it's hardware too.
>
Yes.

>> I may not trust the drive, but the fact that only known offset are
>> corrupted (in text files), the exact same way, sounds too much of a
>> coincidence. Anyway, I started a long SMART self test to see if it
>> catches anything, as there was no DMA transfer error[0].
>
> It *could* be the disk, but it's much more likely to be something like
> memory or a bad cable. Which wouldn't show up with SMART, since that
> just tests internal disk issues.
>
At some point I did not trust the internal disk, but SMART tests
(`short', `'long', `conveyance') passed successfully. I'd assume that
a bad cable issue between the USB adapter and the disk would be caught
by the UDMA_CRC_Error_Count counter (it already did), and would be
somehow truly random. I'm not sure if USB do any kind of data checksum
between the host and the device. I'd assume so.

> Do you get some occasional random SIGSEGV's too?
>
Over the last month, not much, mostly chrome (unstable version), and WIP stuff:

# sed -r '/kernel:.*segfault/!d; s/.*kernel:.* ([a-z]+)\[.*/\1/'
/var/log/messages* | sort | uniq -c
4 chrome
5 conf
19 nconf

However, the list of program which dumped core is different:
# sed '/core dump/!d; s/.*(\(.*\)) to .*/\1/' /var/log/messages* |
sort | uniq -c
1 /bin/zsh
1 /src/linux/linux/scripts/kconfig/conf
24 /src/linux/linux/scripts/kconfig/nconf
1 /opt/google/chrome/chrome
2 /usr/bin/evince
1 /usr/bin/mplayer
45 /usr/lib64/nspluginwrapper/npviewer.bin

Comparatively, on another machine (F15, 2.6.40.4, chromium
13.0.782.215), same time period:

# sed -r '/kernel:.*segfault/!d; s/.*kernel:.* ([a-z\.-]+)\[.*/\1/'
/var/log/messages* | sort | uniq -c
1 aplay
33 chromium-browser

and no specific core dump listed (my setup may be wrong).

I'll try to gather more information.

Thanks,
- Arnaud

>                        Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/