Re: Yet another hot unplug NULL pointer dereference (was Re: statusof oops in sd_revalidate_disk?)
From: Stefan Richter
Date: Tue Dec 27 2011 - 08:40:52 EST
On Dec 27 Axel Theilmann wrote:
> On 12/25/2011 09:58 PM, Stefan Richter wrote:
>>> å 2011å12æ15æ äå1:59ïAxel Theilmann <theilmann@xxxxxxxxxxxx> ï
>>>> two weeks ago Huajun Li posted a patch for a kernel oops, subject
>>>> [PATCH] SCSI/sd: Fix NULL dereference in sd_revalidate_disk".
>>>> The patch was discussed but considered "clearly wrong". The bug shows
>>>> up for us in kernel 3.1.4 quite often when unplugging usb sticks and
>>>> it seems a few other people have the same problem:
>>>> Can anyone of you maybe give me any status update on that bug?
> > as far as I remember, all Linux releases in 2011 have been broken WRT hot
> > removal of block devices; some more severely, some less. Various patches
> > for this went in over the year, but if they fixed anything, they always
> > uncovered the next lingering unplug related bug. The presumed first Linux
> so now there are 2 known NULL-pointer problems in the cd-rom code and one in
> the scsi-disk code.
The two CD-ROM related traces which I posted seem to indicate a bug
between block layer's and SCSI core lifetime managements, rather than in
the cd-rom code particularly. When I get the time, I will try the
"1. open(), 2. remove device, 3. ioctl()" sequence on an sd_mod device
instead of an sr_mod one and see where this goes.
> Would a complete fix for this issue be a question of locating all the
> possible NULL-pointers and fixing them or do you think that the hotplug
> problem has to be fixed on a more "fundamental" level?
I don't know what my two traces tell us what particularly is broken and
where to attack the problem.
In case of the sd_revalidate_disk oops from the thread which I highjacked
(which refers to "[PATCH] SCSI/sd: Fix NULL dereference in
sd_revalidate_disk", http://thread.gmane.org/gmane.linux.scsi/71174), the
trouble is that nobody came up with an answer to James' question on how it
could happen in the first place that sd_revalidate_disk(disk) could be
called on a disk that leads to a NULL scsi_disk. In turn, this presumably
means among else that the answer to my earlier question --- what prevents
the scsi_disk to go invalid slightly after that newly added NULL pointer
check --- cannot be answered yet either.
However, I do think that the pitiful state of block device unplugging
throughout circa a whole year indicates a fundamental problem indeed. But
I am only familiar with one of the SCSI transport layer drivers, not with
the kernel layers above, so what do I know.
> Even if there is a more fundamental problem below that has to be fixed, it
> would still be nice to get in fixes for the dereferences that are currently
> known to keep peoples systems from crashing.
> We built a kernel with Huajun's patch included and will do some tests to see
> if the problem goes away (and no others show up).
AFAIU it is not clear whether this patch actually prevents dereference of
an invalid sdkp or only makes it considerably more unlikely.
In either case, since there is apparently an underlying issue that this
patch does not address, it is a judgment call whether such a patch is
allowed into a kernel --- distributor kernel or mainline kernel. If
somebody takes it, then at least a FIXME comment should be put there that
sd_revalidate_disk is supposed to rely on an always valid sdkp.
> > With a little bit of bad luck, udisks-daemon or in older distros hald
> > should hit the bug too. Under kernel 3.1 I typically just got processes
> > hanging in unkillable sleep. With kernel 3.2-rc7 I get an instant kernel
> > panic.
> Yes, udisks is what probably triggers the bug for us. People removing USB
> media before udisks is finished initializing the medium. With kernel 3.1.4
> we get instant kernel panics as well.
> tty, axel
Sounds like both "your" and "my" bug occur at the end of the sequence
1. open(), 2. remove device, 3. ioctl() or whatever
though perhaps with the extra twist in your case that this has to happen
before the device bring-up was entirely finished...? In my CD-ROM related
case the bug is not timing-sensitive at all; it always happens with above
-=====-==-== ==-- ==-==
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/