Re: Oops in 2.6.10-rc1 (almost solved)

From: Linus Torvalds
Date: Tue Nov 09 2004 - 20:04:28 EST




On Wed, 10 Nov 2004, Christian Kujau wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Matt Domsch schrieb:
> >
> > -BIOS EDD facility v0.16 2004-Jun-25, 16 devices found
> > +BIOS EDD facility v0.16 2004-Jun-25, 6 devices found
> >
> > So with the latest EDD patch noted above, it's finding more disks than
> > before. How many disks do you actually have in the system?
>
> i have one scsi disk (sda) and two atapi cdrom drives:

Interestingly, "16" is also EDD_MBR_SIG_MAX, so my suspicion is that it
overflowed some EDD data area. edd_num_devices() (which is what reports
the above number) does

min_t(unsigned char,
max_t(unsigned char, edd.edd_info_nr, edd.mbr_signature_nr),
max_t(unsigned char, EDD_MBR_SIG_MAX, EDDMAXNR));

where EDDMAXNR is 6, and EDD_MBR_SIG_MAX is the afore-mentioned 16, so we
know that either edd.edd_info_nr or edd.mbr_signature_nr is actually
_bigger_ than 16.

Which is clearly totally bogus. In fact, even your old "6 devices found"
thing looks suspiciously bogus.

> PS: do you have *any* idea how this could be related to the snd-es1371
> driver (which is producing the oops then)?

I bet it's overwriting some array, and just corrupting memory after it.
For example, the edd_info[] array only has 6 entries, and for example, the
EDD_MBR_SIG_BUFFER is quite close to where we save the E820MAP memory map
at bootup, so if something stomps on that, the kernel might be confused
about where PCI memory can be allocated or similar. Or it might have
overwritten some ACPI memory data, who knows.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/