On Wed, 13 Jul 2016, Mark Hounschell wrote:
Alright, so you are basically supplementing O_NDELAY flag in order to
avoid check_disk_change() being called. It's rather a coincidence that
it has worked this way, but I agree with you that we can't ignore the
fact that there is userspace relying on this behavior.
I'm not supplementing anything. The driver _did_ this on its own.
I mean, you're passing O_NDELAY to open(/dev/fd0) exactly to avoid kernel
issuing check_disk_change(). That's the only semantics O_NDELAY has for
fd.
I just expect to be able to open the drive to get a handle without the
kernel attempting to access the media. My apps manage a disk_change on
their own. I don't think its check_disk_change that gives me my pain.
There is some probe happening that fails when a floppy is installed that
is not a "standard" format. That causes the open to fail which is the
most pain. Still I should be able to get a handle without any media or
even unrecognized media installed.
Yeah, that's check_disk_change().
The original behavior of the floppy driver was correct. I have no
idea what BUG these changes were supposed to fix but the "fix"
obviously broke user land. Was this bug reported by some new ROBOT
test or something? The kernel floppy driver has been stable for
years now
That's not really true; the code is a racy mess, and this is being
uncovered only when virtualized floppy devices started to exist
(because they are much faster than a real hardware, and the different
timing reveals bugs that were not visible before).
Forgive me here as I'm ignorant about why any virtualized floppy would
require the real physical kernel floppy driver to be involved at all.
Because VMs (such as qemu) actually do emulate a FDC on a hardware level,
but don't emulate the timings of the real hardware (which are not mandated
by the spec, but "are just there").
This particular fix was because syzkaller found a way how easily corrupt
kernel memory using O_NDELAY to floppy driver; see
https://lkml.org/lkml/2016/2/2/848
so I am really confused as to why these changes were induced.
The floppy driver is in an orphan mode; no new "features" are added "just
because". Everything that's happening there is to fix real bugs in the
kernel.
I'll look into ways how to fix this, but I am afraid this is going to be
really tricky. Therefore we'd have to very likely proceed asap with revert
of 09954bad448 and coming up with a workaround that'd still avoid the bug
reported by syzkaller.
I would be happy to do some testing for you if needed. At least with regard to
our apps.
Could you please check whether my last patch that Jens queued in
linux-block.git ("floppy: fix open(O_ACCMODE) for ioctl-only open" in
for-linus branch) remedies at least some of the issues you are seeing?