Re: RFC: Device isolation infrastructure

From: Alex Williamson
Date: Wed Dec 07 2011 - 01:47:11 EST


On Wed, 2011-12-07 at 14:58 +1100, David Gibson wrote:
>
> Alex Williamson recently posted some patches adding a new hook to
> iommu_ops to identify groups of devices which cannot reliably be
> distinguished by the iommu, and therefore can only safely be assigned
> as a unit to a guest or userspace driver.
>
> I'm not very fond of that design, because:
> - It requires something else to scan the group ids to present the
> groups in a userspace discoverable manner, rather than handling that
> in the core.

The thought was that the vast majority of users won't make use of
groups, so why bloat the core when the only user is vfio. We need bus
specific drivers to expose devices in vfio, so it's fairly natural to
have the bus drivers enumerate devices on their bus, register them with
the vfio-core, which then exposes groups in a convenient manner for
userspace.

> - It encorages manipulating devices individually, but only
> permitting certain actions if the whole group is together. This seems
> confusing and arse-backwards to me.

Hmm, this patch introduces a group level detach and reattach of devices,
but seems to lack any support for actually making use of devices in the
detached state, so it's rather hard to compare. How does someone
wanting to interact with a device within the group do so? This appears
to put devices entirely outside of the driver model when detached, which
seems just as awkward.

> - It does not provide an obvious means to prevent a kernel
> driver unsafely binding to a new device which is hotplugged into an
> iommu group already assigned to a guest.

I've added this for PCI, triggered via the device add notifier.

> - It is inherently centred arounf the iommu structure, however
> there might be other platform dependent factors which also affect
> which devices can be safely isolated from others.

Yep, MSI isolation being an example. I tend to think the iommu driver
is pretty closely tied to the platform (all current examples are), so
the iommu driver should take those factors into account when advertising
groups.

>
> These could probably each be addressed with various tweaks, but
> basically, I'm just not convinced that it represents the right
> structure for handling this. So, Alexey Kardashevskiy and myself havr
> made this patches, representing the outline of how I think this should
> be handled.
>
>
> Notes:
> * All names are negotiable. At the moment I'm thinking of this as
> the "device isolation subsystem" which handles "device isolation
> groups" (not to be confused with an iommu domain, which could contain
> one or more groups). A group is considered "detached" when it has
> been removed from the normal driver binding process, and therefore
> ready for assignment to a guest or userspace. I'm more than open to
> suggestions of better names.
> * This is just the generic infrastructure. Obviously, platform
> specific code will be needed as well to create the groups and assign
> devices to them. We have a proof of concept implementation for the
> power p5ioc2 pci host bridge, and I hope to post code for that and the
> p7ioc bridge shortly. I'll also be looking at how this would be used
> by Intel VTd, and AMD iommus.
> * isolation groups appear in /sys/isolation. Isolatable devices
> have a link to their group. The group contains links to all their
> devices, and a "detached" attribute, which can be used to view and
> control whether the group is detached.
> * At present groups are detached manually via sysfs, but the idea
> is a kernel system like vfio, could also directly request a group to
> be detached for its use.
> * when detached, all kernel drivers are unbound from every device
> in the group. No driver will match any device in the group until it
> is reattached.
>
> * There are probably stupid bugs, in particular I know some error
> paths are broken. By all means point these out to expedite fixing
> them.
> * The isolation_group structure is more or less a stub at this
> point. It will probably want a pointer to an iommu_ops and/or its own
> ops structure. The 'detached' bool may need to become some sort of
> state variable. We'll probably want some sort of "owner" field for
> what (e.g. vfio) is using the group when its detached.

This makes me nervous. One of the flaws I'm trying to correct with vfio
vs existing kvm PCI device assignment is the driver ownership model.
KVM will currently try to make use of any device it can and is only
stopped by drivers having already claimed various resources. VFIO has a
strong device-driver-group ownership model.

Isolation groups facilitate ejecting the normal device drivers (which
actually seems somewhat dangerous) but isn't yet introducing any clear
ownership model to replace *a* device being bound to *a* driver.
Devices seem to be in limbo once detached. We're not even outlining
here how we're going to attach group-drivers to groups or facilitate
matching individual devices to bus specific drivers, which seems like
it'll get us back into the individual device manipulation you complain
about above.

> * I've also considered adding an "isolation strength" bitmask to
> indicate how well isolated the group is - e.g DMA isolated (iommu),
> irq isolated (irq remapping iommu, or other mechanism), error isolated
> (bus errors from this device can't affect other devices), ..

This sounds like a maintenance burden, which is why I've been trying to
push that iommu drivers should not expose groups that aren't fully
isolated. It's easy to add parameters per iommu driver where the user
can opt-in to certain deficiencies.

> Anyway, on with the code..
>
> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx>
> Signed-off-by: David Gibson <david@xxxxxxxxxxxxxxxxxxxxx>
>
> ---
> drivers/Kconfig | 2 +
> drivers/Makefile | 2 +
> drivers/base/base.h | 5 +
> drivers/base/core.c | 7 +
> drivers/base/init.c | 2 +
> drivers/isolation/Kconfig | 3 +
> drivers/isolation/Makefile | 2 +
> drivers/isolation/device_isolation.c | 216 ++++++++++++++++++++++++++++++++++
> include/linux/device.h | 5 +
> include/linux/device_isolation.h | 79 ++++++++++++
> 10 files changed, 323 insertions(+), 0 deletions(-)
> create mode 100644 drivers/isolation/Kconfig
> create mode 100644 drivers/isolation/Makefile
> create mode 100644 drivers/isolation/device_isolation.c
> create mode 100644 include/linux/device_isolation.h
>
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index 95b9e7e..2ce5717 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -130,4 +130,6 @@ source "drivers/iommu/Kconfig"
>
> source "drivers/virt/Kconfig"
>
> +source "drivers/isolation/Kconfig"
> +
> endmenu
> diff --git a/drivers/Makefile b/drivers/Makefile
> index 7fa433a..741874a 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -127,3 +127,5 @@ obj-$(CONFIG_IOMMU_SUPPORT) += iommu/
>
> # Virtualization drivers
> obj-$(CONFIG_VIRT_DRIVERS) += virt/
> +
> +obj-$(CONFIG_DEVICE_ISOLATION) += isolation/
> diff --git a/drivers/base/base.h b/drivers/base/base.h
> index a34dca0..66a4ed2 100644
> --- a/drivers/base/base.h
> +++ b/drivers/base/base.h
> @@ -24,6 +24,9 @@
> * bus_type/class to be statically allocated safely. Nothing outside of the
> * driver core should ever touch these fields.
> */
> +
> +#include <linux/device_isolation.h>
> +
> struct subsys_private {
> struct kset subsys;
> struct kset *devices_kset;
> @@ -108,6 +111,8 @@ extern int driver_probe_device(struct device_driver *drv, struct device *dev);
> static inline int driver_match_device(struct device_driver *drv,
> struct device *dev)
> {
> + if (isolation_device_is_detached(dev))
> + return 0;
> return drv->bus->match ? drv->bus->match(dev, drv) : 1;
> }
>
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index bc8729d..45201c5 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -22,6 +22,7 @@
> #include <linux/kallsyms.h>
> #include <linux/mutex.h>
> #include <linux/async.h>
> +#include <linux/device_isolation.h>
>
> #include "base.h"
> #include "power/power.h"
> @@ -593,6 +594,10 @@ void device_initialize(struct device *dev)
> lockdep_set_novalidate_class(&dev->mutex);
> spin_lock_init(&dev->devres_lock);
> INIT_LIST_HEAD(&dev->devres_head);
> +#ifdef CONFIG_DEVICE_ISOLATION
> + INIT_LIST_HEAD(&dev->isolation_list);
> + dev->isolation_group = NULL;
> +#endif
> device_pm_init(dev);
> set_dev_node(dev, -1);
> }
> @@ -993,6 +998,8 @@ int device_add(struct device *dev)
> class_intf->add_dev(dev, class_intf);
> mutex_unlock(&dev->class->p->class_mutex);
> }

At this point we've already attached a native driver to a potentially
in-use isolation group, so we haven't really addressed the hotplug
issue. If we're creating groups and adding devices to them via the
device add notifier, that looks pretty similar to how vfio does it, and
there we only need the trivial iommu_device_group hook. I'll give you
that the "detached" hook in the match function makes it easy to take a
device out of the driver model, but then what do you replace the driver
model with?

> +
> + isolation_group_update_links(dev);
> done:
> put_device(dev);
> return error;

We'd need something in the device_del path to block removal if the
device is in use via the group. The equivalent of the .remove for the
driver it can no longer attach to.

> diff --git a/drivers/base/init.c b/drivers/base/init.c
> index c8a934e..89f3e94 100644
> --- a/drivers/base/init.c
> +++ b/drivers/base/init.c
> @@ -8,6 +8,7 @@
> #include <linux/device.h>
> #include <linux/init.h>
> #include <linux/memory.h>
> +#include <linux/device_isolation.h>
>
> #include "base.h"
>
> @@ -24,6 +25,7 @@ void __init driver_init(void)
> devices_init();
> buses_init();
> classes_init();
> + isolation_init();
> firmware_init();
> hypervisor_init();
>
> diff --git a/drivers/isolation/Kconfig b/drivers/isolation/Kconfig
> new file mode 100644
> index 0000000..a2a7703
> --- /dev/null
> +++ b/drivers/isolation/Kconfig
> @@ -0,0 +1,3 @@
> +config DEVICE_ISOLATION
> + bool "Enable isolating devices for safe pass-through to guests or user space."
> +
> diff --git a/drivers/isolation/Makefile b/drivers/isolation/Makefile
> new file mode 100644
> index 0000000..8f1c6ec
> --- /dev/null
> +++ b/drivers/isolation/Makefile
> @@ -0,0 +1,2 @@
> +obj-$(CONFIG_DEVICE_ISOLATION) := device_isolation.o
> +
> diff --git a/drivers/isolation/device_isolation.c b/drivers/isolation/device_isolation.c
> new file mode 100644
> index 0000000..2b00ff7
> --- /dev/null
> +++ b/drivers/isolation/device_isolation.c
> @@ -0,0 +1,216 @@
> +#include <linux/slab.h>
> +#include <linux/device_isolation.h>
> +
> +#define KERN_DEBUG KERN_WARNING
> +
> +
> +static struct kset *isolation_kset;
> +
> +#define to_isolation_attr(_attr) container_of(_attr, \
> + struct isolation_attribute, attr)
> +
> +static ssize_t isolation_attr_show(struct kobject *kobj,
> + struct attribute *attr, char *buf)
> +{
> + struct isolation_attribute *isolation_attr = to_isolation_attr(attr);
> + struct isolation_group *group = container_of(kobj,
> + struct isolation_group, kobj);
> + ssize_t ret = -EIO;
> +
> + if (isolation_attr->show)
> + ret = isolation_attr->show(group, buf);
> + return ret;
> +}
> +
> +static ssize_t isolation_attr_store(struct kobject *kobj,
> + struct attribute *attr, const char *buf, size_t count)
> +{
> + struct isolation_attribute *isolation_attr = to_isolation_attr(attr);
> + struct isolation_group *group = container_of(kobj,
> + struct isolation_group, kobj);
> + ssize_t ret = -EIO;
> +
> + if (isolation_attr->store)
> + ret = isolation_attr->store(group, buf, count);
> + return ret;
> +}
> +
> +static void isolation_release(struct kobject *kobj)
> +{
> + struct isolation_group *group = container_of(kobj,
> + struct isolation_group, kobj);
> +
> + pr_debug("isolation : release.\n");
> +
> + kfree(group);
> +}
> +
> +static const struct sysfs_ops isolation_sysfs_ops = {
> + .show = isolation_attr_show,
> + .store = isolation_attr_store,
> +};
> +
> +static struct kobj_type isolation_ktype = {
> + .sysfs_ops = &isolation_sysfs_ops,
> + .release = isolation_release,
> +};
> +
> +static ssize_t group_showdetached(struct isolation_group *group,
> + char *buf)
> +{
> + return sprintf(buf, "%u", !!group->detached);
> +}
> +
> +static ssize_t group_setdetached(struct isolation_group *group,
> + const char *buf, size_t count)
> +{
> + switch (buf[0]) {
> + case '0':
> + isolation_group_reattach(group);
> + break;
> + case '1':
> + isolation_group_detach(group);
> + break;
> + default:
> + count = -EINVAL;
> + }
> + return count;
> +}
> +
> +static DEVICE_ISOLATION_ATTR(detached, S_IWUSR | S_IRUSR,
> + group_showdetached, group_setdetached);
> +
> +struct isolation_group *isolation_group_new(const char *name)
> +{
> + int ret;
> + struct isolation_group *group;
> +
> + group = kzalloc(sizeof(*group), GFP_KERNEL);
> + if (!group)
> + return NULL;
> +
> + spin_lock_init(&group->lock);
> +
> + group->kobj.kset = isolation_kset;
> + ret = kobject_init_and_add(&group->kobj, &isolation_ktype, NULL, name);
> + if (ret < 0) {
> + printk(KERN_ERR "device_isolation: kobject_init_and_add failed "
> + "for %s\n", group->kobj.name);
> + return NULL;
> + }
> +
> + ret = sysfs_create_file(&group->kobj, &isolation_attr_detached.attr);
> + if (0 > ret)
> + printk(KERN_WARNING "device_isolation: create \"detached\" failed "
> + "for %s, errno=%i\n", group->kobj.name, ret);
> +
> + INIT_LIST_HEAD(&group->devices);
> + printk(KERN_DEBUG "device_isolation: group %s created\n",
> + group->kobj.name);
> +
> + return group;
> +}
> +
> +void isolation_group_dev_add(struct isolation_group *group, struct device *dev)
> +{
> + printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n",
> + dev->kobj.name, group->kobj.name);
> + spin_lock(&group->lock);
> +
> + list_add_tail(&dev->isolation_list, &group->devices);
> + dev->isolation_group = group;
> + /* Todo: remove links we already created */
> + spin_unlock(&group->lock);
> +}
> +
> +void isolation_group_dev_remove(struct device *dev)
> +{
> + list_del(&dev->isolation_list);
> +}
> +
> +int isolation_group_update_links(struct device *dev)
> +{
> + int error;
> + char *buf;
> + struct isolation_group *group = dev->isolation_group;
> +
> + if (!group) {
> + printk(KERN_DEBUG "device_isolation: no group for %s\n",
> + dev->kobj.name);
> + return 0;
> + }
> +
> + printk(KERN_DEBUG "device_isolation: adding device %s to group %s\n",
> + dev->kobj.name, group->kobj.name);
> +
> + spin_lock(&group->lock);
> +
> + error = sysfs_create_link(&dev->kobj, &group->kobj, "isolation_group");
> + if (0 > error)
> + printk(KERN_WARNING "device_isolation: create isolation_group "
> + "link failed for %s -> %s, errno=%i\n",
> + dev->kobj.name, group->kobj.name, error);
> +
> + buf = kasprintf(GFP_KERNEL, "%s-%p", dev->kobj.name, dev);
> + if (!buf) {
> + error = -ENOMEM;
> + printk(KERN_ERR "device_isolation: kasprintf failed\n");
> + goto out;
> + }
> +
> + error = sysfs_create_link(&group->kobj, &dev->kobj, buf);
> + kfree(buf);
> + if (0 > error)
> + printk(KERN_WARNING "device_isolation: create %s "
> + "link failed for %s -> %s, errno=%i\n",
> + buf, dev->kobj.name, group->kobj.name, error);
> +
> +out:
> + /* Todo: remove links we already created */
> + spin_unlock(&group->lock);
> + return error;
> +}
> +
> +int isolation_group_detach(struct isolation_group *group)
> +{
> + int error;
> + struct device *dev;
> +
> + spin_lock(&group->lock);
> +
> + group->detached = true;
> + list_for_each_entry(dev, &group->devices, isolation_list) {
> + error = device_reprobe(dev);

How do we handle blocking .remove functions? That's one reason vfio
doesn't attempt to automatically detach drivers. We can treat the
groups as a unit here, but it's hard to get away from the fact that
potentially several individual devices are being removed from their
drivers here.

> + }
> +
> + spin_unlock(&group->lock);
> +
> + return 0;
> +}
> +
> +int isolation_group_reattach(struct isolation_group *group)
> +{
> + int error;
> + struct device *dev;
> +
> + spin_lock(&group->lock);

Obviously some ownership check would have to exist here if we knew how
to claim a group.

> + group->detached = false;
> + list_for_each_entry(dev, &group->devices, isolation_list) {
> + printk("Reprobing %s\n", dev->kobj.name);
> + error = device_reprobe(dev);
> + }
> +
> + spin_unlock(&group->lock);
> +
> + return 0;
> +}
> +
> +int __init isolation_init(void)
> +{
> + isolation_kset = kset_create_and_add("isolation", NULL, NULL);
> + if (!isolation_kset)
> + return -ENOMEM;
> + return 0;
> +}
> +
> diff --git a/include/linux/device.h b/include/linux/device.h
> index c20dfbf..6238a13 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -585,6 +585,11 @@ struct device {
>
> struct dma_coherent_mem *dma_mem; /* internal for coherent mem
> override */
> +#ifdef CONFIG_DEVICE_ISOLATION
> + struct isolation_group *isolation_group;
> + struct list_head isolation_list;
> +#endif
> +
> /* arch specific additions */
> struct dev_archdata archdata;
>
> diff --git a/include/linux/device_isolation.h b/include/linux/device_isolation.h
> new file mode 100644
> index 0000000..1dc7e68
> --- /dev/null
> +++ b/include/linux/device_isolation.h
> @@ -0,0 +1,79 @@
> +#ifndef _DEVICE_ISOLATION_H_
> +#define _DEVICE_ISOLATION_H_
> +
> +#include <linux/kobject.h>
> +#include <linux/list.h>
> +#include <linux/device.h>
> +
> +struct isolation_group {
> + struct kobject kobj;
> + struct list_head devices;
> + spinlock_t lock;
> + bool detached;
> +};
> +
> +struct isolation_attribute {
> + struct attribute attr;
> + ssize_t (*show)(struct isolation_group *group, char *buf);
> + ssize_t (*store)(struct isolation_group *group, const char *buf,
> + size_t count);
> +};
> +
> +#ifdef CONFIG_DEVICE_ISOLATION
> +
> +int __init isolation_init(void);
> +
> +struct isolation_group *isolation_group_new(const char *name);
> +int isolation_group_delete(struct isolation_group *group);
> +
> +void isolation_group_dev_add(struct isolation_group *group,
> + struct device *dev);
> +void isolation_group_dev_remove(struct device *dev);
> +
> +int isolation_group_update_links(struct device *dev);
> +int isolation_group_detach(struct isolation_group *group);
> +int isolation_group_reattach(struct isolation_group *group);
> +
> +#define DEVICE_ISOLATION_ATTR(_name, _mode, _show, _store) \
> + struct isolation_attribute isolation_attr_##_name = \
> + __ATTR(_name, _mode, _show, _store)
> +
> +static inline int isolation_device_is_detached(struct device *dev)
> +{
> + return dev->isolation_group && dev->isolation_group->detached;
> +}
> +
> +#else
> +
> +static inline int __init isolation_init(void)
> +{
> + return 0;
> +}
> +
> +static inline struct isolation_group *isolation_group_new(const char *name)
> +{
> + return NULL;
> +}
> +
> +/*void isolation_group_dev_add(struct isolation_group *group,
> + struct device *dev)
> +{
> +}*/
> +
> +static inline void isolation_group_dev_remove(struct device *dev)
> +{
> +}
> +
> +static inline int isolation_group_update_links(struct device *dev)
> +{
> + return 0;
> +}
> +
> +static inline int isolation_is_detached(struct device *dev)
> +{
> + return 0;
> +}
> +
> +#endif
> +
> +#endif

So what this boils down to so far are some group objects with lists of
devices that allow us to pop off all the native drivers and prevent them
from reattaching. And for that convenience, we're going to need a whole
new group driver model to allow something like vfio to claim the group
and then maybe some group-device model to allow bus specific drivers to
easily make use of the devices within the group? And somehow along the
way we're going to avoid the group vs device awkwardness? And again,
all for an interface that the vast majority of users won't use. I'm not
convinced. Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/