Re: [PATCH] udevadm-info: Don't access sysfs 'resource<N>' files
From: Myron Stowe
Date: Tue Mar 19 2013 - 12:58:08 EST
On Mon, 2013-03-18 at 12:59 -0600, Alex Williamson wrote:
> On Mon, 2013-03-18 at 19:25 +0100, BjÃrn Mork wrote:
> > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
> > >On Mon, 2013-03-18 at 18:20 +0100, BjÃrn Mork wrote:
> > >> Alex Williamson <alex.williamson@xxxxxxxxxx> writes:
> > >>
> > >> > At least for KVM the kernel fix is the addition of the vfio driver
> > >which
> > >> > gives us a non-sysfs way to do this. If this problem was found a
> > >few
> > >> > years later and we were ready to make the switch I'd support just
> > >> > removing these resource files. In the meantime we have userspace
> > >that
> > >> > depends on this interface, so I'm open to suggestions how to fix
> > >it.
> > >>
> > >> I am puzzled by a couple of things in this discussion:
> > >>
> > >> 1) do you seriously mean that a userspace application (any, not just
> > >> udevadm or qemu or whatever) should be able to read and write
> > >these
> > >> registers while the device is owned by a driver? How is that ever
> > >> going to work?
> > >
> > >The expectation is that the user doesn't mess with the device through
> > >pci-sysfs while it's running. This is really no different than config
> > >space or MMIO space in that respect.
> > But it is. That's the problem. As a user I expect to be able to run
> > e.g "grep . /sys/devices/whatever/*" with no ill effects. This holds
> > for config space or MMIO space. It does not for any reset-on-read
> > register.
> As a non-admin user you can
> > > You can use setpci to break your
> > >PCI card while it's used by the driver today. The difference is that
> > >MMIO spaces side-step the issue by only allowing mmap and config space
> > >is known not to have read side-effects.
> > Yes. And that is why there is no problem exporting those. This
> > difference is fundamental.
> So how do we side-step the problem with I/O port registers? If we
> remove them then KVM needs to run with iopl which is a pretty serious
> security hole should QEMU be exploited. We could activate the resource
> files only when the device is bound to pci-assign, but that only limits
> the scope and might break UIO drivers. We could modify the file to have
> an enable sequence, but we can't do this without breaking current
> userspace. As I mentioned, the VFIO driver is intended to replace KVM's
> use of these files, but we're not ready to rip it out, perhaps not even
> ready to declare it deprecated.
> > >> 2) is it really so that a device can be so fundamentally screwed up
> > >by
> > >> reading some registers, that a later driver probe cannot properly
> > >> reinitialize it?
> > >
> > >Never underestimate how broken hardware can be,
> > True :)
> > > though in this case
> > >reading a device register seems to be causing a system hang/reset.
> > I understand that it does so if the ahci driver is bound to the device
> > while reading the registers, but does it also hang the system with no
> > bound driver? How does it do that? By killing the bus?
> I don't know, Myron?
Yes - the system hangs when BAR1's (and likely BAR3's) I/O port space is
Here are the details that I've been able to put together from the two
linux-pci threads and various online sources -
>From Robert Hancock - "... BAR5 is the MMIO region used by the AHCI
driver. BARs 0-4 are the legacy SFF-compatible ATA ports. Nothing
should be messing with those IO ports while AHCI is enabled. ..." This
likely explains why the system boots and runs fine as long as the
'udevadm ...' command is *not* ran (i.e. the driver never accesses the
I/O port BARs).
Using a SATA controller I have access to as an example for the details
(Note: I do not have access to a system with the Marvell 9125 device):
00:1f.2 SATA controller: Intel Corporation 5 Series/3400 Series Chipset 6 port SATA AHCI Controller (rev 06) (prog-if 01 [AHCI 1.0])
Subsystem: Lenovo Device 2168
Region 0: I/O ports at 1860 [size=8]
Region 1: I/O ports at 1814 [size=4]
Region 2: I/O ports at 1818 [size=8]
Region 3: I/O ports at 1810 [size=4]
Region 4: I/O ports at 1840 [size=32]
Region 5: Memory at f2827000 (32-bit, non-prefetchable) [size=2K]
I/O port registers :
Primary IDE controller [0x1860-0x1867; 0x1814-0x1817]
BAR0 Base address for the command block registers for ATA Channel X
0x1860 (Read/Write): Data Register
0x1861 (Read): Error Register
0x1861 (Write): Features Register
0x1862 (Read/Write): Sector Count Register
0x1863 (Read/Write): LBA Low Register
0x1864 (Read/Write): LBA Mid Register
0x1865 (Read/Write): LBA High Register
0x1866 (Read/Write): Drive/Head Register
0x1867 (Read): Status Register
0x1867 (Write): Command Register
BAR1* Base address for the control register for ATA Channel X
0x1816 (Read): Alternate Status Register
0x1816 (Write): Device Control Register
* The base must be Dword aligned; a PCI requirement. The Device Control
and Alternate Status Registers are at ofset 0x2 from this base.
>From Xiangliang - executing 'udevadm ...' causes a 32-bit I/O port read
to BAR1's region. This is shown by the BE (Byte Enable) value of
0x1111. So apparently reads to this region that include any of reserved
Bytes causes "the chip will go bad."
So, only a Byte access at offset 2 is successful. I have not been able
to get any more details as to the exact cause of the hang. I would have
thought that the PCI transaction would have just timed out, or errored
out, or something but apparently the platform ends up hanging.
It appears that this device did not implement the reserved registers
such that they would return 0 on reads or something more similarly sane.
Since BARs 2 and 3 are not 0, indicating the device only supports one
channel, I expect the same issue will occur when accessing BAR3. Again,
I do not have access to a system with this device to test with.
> > >> I would have thought that the solution to all this was to return
> > >-EINVAL
> > >> on any attemt to read or write these files while a driver is bound to
> > >> the device. If userspace is going to use the API, then the
> > >application
> > >> better unbind any driver first.
> > >>
> > >> Or? Am I missing something here?
> > >
> > >That doesn't really solve anything though. Let's pretend the resource
> > >files only work while the device is bound to pci-stub. Now what
> > >happens
> > >when you run this udevadm command as admin while it's in use by the
> > >userspace driver? All we've done is limit the scope of the problem.
> > Assuming that the system hangs without driver help and that this
> > brokenness is widespread. I don't think any of those assumptions hold.
> > Do they?
> I thought it was true that for this device a system hang happened
> regardless of the host driver, but haven't seen the original bug report.
> As for widespread, this is the first I've heard of problems in the 2.5+
> years that we've supported these I/O port resource files. The rest is
> probably just FUD about random userspace apps trolling through device
> > >> > If we want to blacklist this specific device, that's fine, but as
> > >others
> > >> > have pointed out it's really a class problem. Perhaps we report 1
> > >byte
> > >> > extra for the file length where EOF-1 is an enable byte? Is there
> > >> > anything else in file ops that we could use to make it slightly
> > >more
> > >> > complicated than open(), read() to access the device? Thanks,
> > >>
> > >> If there really are devices which cannot handle reading at all, and
> > >> cannot be reset to a sane state by later driver initialization, then
> > >a
> > >> blacklist could be added for those devices. This should not be a
> > >common
> > >> problem.
> > >
> > >Yes, if these are dead registers, let's blacklist and move along. I
> > >suspect though that these registers probably work fine if you access
> > >them according to the device programming model, so blacklisting just
> > >prevents full use through something like KVM device assignment.
> > Well, if the device is that broken then I think it will require the
> > kernel to police the device programming. I don't see how you can leave
> > a bomb like that because it might be useful in a rare and very
> > theoretical case.
> > Easier to just blacklist it...
> Easier, yes. But it likely just kicks the problem down the road until
> the next device. Thanks,
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/