Re: [PATCH v3 12/21] KVM: X86: Implement ring-based dirty memory tracking

From: Michael S. Tsirkin
Date: Sun Jan 12 2020 - 01:24:38 EST


On Fri, Jan 10, 2020 at 10:29:59AM -0500, Peter Xu wrote:
> On Thu, Jan 09, 2020 at 05:18:24PM -0500, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2020 at 03:19:16PM -0500, Peter Xu wrote:
> > > > > while for virtio, both sides (hypervisor,
> > > > > and the guest driver) are trusted.
> > > >
> > > > What gave you the impression guest is trusted in virtio?
> > >
> > > Hmm... maybe when I know virtio can bypass vIOMMU as long as it
> > > doesn't provide IOMMU_PLATFORM flag? :)
> >
> > If guest driver does not provide IOMMU_PLATFORM, and device does,
> > then negotiation fails.
>
> I mean it's still possible to specify "!IOMMU_PLATFORM" for the virtio
> device even if vIOMMU is enabled in the guest (rather than the
> negociation procedures). Again I think it's fair, just the same
> reason as why we tend to even make "iommu=pt" by default for all the
> kernel drivers, because we should trust all the drivers as kernel
> itself. The only thing we want to protect using vIOMMU is the
> userspace driver because we do have a line between the userspace and
> the kernel, and IMHO it's the same thing here for the new kvm
> interface.
>
> >
> > > I think it's logical to trust a virtio guest kernel driver, could you
> > > guide me on what I've missed?
> >
> >
> > guest driver is assumed to be part of guest kernel. It can't
> > do anything kernel can't do anyway.
>
> Right, I think all things belongs to the kernel will have the same
> level of trust. However again, userspace should be differently
> treated, and that's why I tend to prefer the index solution that we
> expose less to userspace to write (read is far safer comparing to
> writes from userspace).

You are mixing up different userspace kinds here. vIOMMU
prtects guest kernel from guest userspace.
Protecting guest kernel against userspace hypervisors
(e.g. QEMU) is mostly futile.


> >
> > > >
> > > >
> > > > > Above means we need to do these to
> > > > > change to the new design:
> > > > >
> > > > > - Allow the GFN array to be mapped as writable by userspace (so that
> > > > > userspace can publish bit 2),
> > > > >
> > > > > - The userspace must be trusted to follow the design (just imagine
> > > > > what if the userspace overwrites a GFN when it publishes bit 2
> > > > > over a valid dirty gfn entry? KVM could wrongly unprotect a page
> > > > > for the guest...).
> > > >
> > > > You mean protect, right? So what?
> > >
> > > Yes, I mean with that, more things are uncertain from userspace. It
> > > seems easier to me that we restrict the userspace with one index.
> >
> > Donnu how to treat vague statements like this. You need to be specific
> > with threat models. Otherwise there's no way to tell whether code is
> > secure.
> >
> > > >
> > > > > While if we use the indices, we restrict the userspace to only be able
> > > > > to write to one index only (which is the reset_index). That's all it
> > > > > can do to mess things up (and it could never as long as we properly
> > > > > validate the reset_index when read, which only happens during
> > > > > KVM_RESET_DIRTY_RINGS and is very rare). From that pov, it seems the
> > > > > indices solution still has its benefits.
> > > >
> > > > So if you mess up index how is this different?
> > >
> > > We can't mess up much with that. We simply check fetch_index (sorry I
> > > meant this when I said reset_index, anyway it's the only index that we
> > > expose to userspace) to make sure:
> > >
> > > reset_index <= fetch_index <= dirty_index
> > >
> > > Otherwise we fail the ioctl. With that, we're 100% safe.
> >
> > safe from what? userspace can mess up guest memory trivially.
> > for example skip sending some memory or send junk.
>
> Yes, QEMU can mess the guest up, but it should never mess the host up,
> am I right? Regarding to QEMU as an userspace, KVM should see it as
> untrusted as well from host-wise. However guest security is another
> thing, imho.
>
> >
> > > >
> > > > I agree RO page kind of feels safer generally though.
> > > >
> > > > I will have to re-read how does the ring works though,
> > > > my comments were based on the old assumption of mmaped
> > > > page with indices.
> > >
> > > Yes, sorry again for a bad cover letter.
> > >
> > > It's basically the same as before, just that we only have per-vcpu
> > > ring now, and the indices are exposed from kvm_run so we don't need
> > > the extra page, but we still expose that via mmap.
> >
> > So that's why changelogs are useful.
> > Can you please write a changelog for this version so I don't
> > need to re-read all of it? Thanks!
>
> Sure, actually I've got a changelog in the cover letter for this
> version [1]... it's:
>
> V3 changelog:
>
> - fail userspace writable maps on dirty ring ranges [Jason]
> - commit message fixups [Paolo]
> - change __x86_set_memory_region to return hva [Paolo]
> - cacheline align for indices [Paolo, Jason]
> - drop waitqueue, global lock, etc., include kvmgt rework patchset
> - take lock for __x86_set_memory_region() (otherwise it triggers a
> lockdep in latest kvm/queue) [Paolo]
> - check KVM_DIRTY_LOG_PAGE_OFFSET in kvm_vm_ioctl_enable_dirty_log_ring
> - one more patch to drop x86_set_memory_region [Paolo]
> - one more patch to remove extra srcu usage in init_rmode_identity_map()
> - add some r-bs for Paolo
>
> I didn't have detailed changelog for v2 because it could be a long
> list with trivial details which can hide the major things, but I've
> got a small write-up in the cover letter trying to mention the major
> changes [2].
>
> Again, I'm very sorry for either missing a complete changelog in v2,
> or the high-level overview of v3 in the cover letter. I'll make it
> better in v4.
>
> Thanks,
>
> [1] https://lore.kernel.org/kvm/20200109145729.32898-1-peterx@xxxxxxxxxx/
> [2] https://lore.kernel.org/kvm/20191220211634.51231-1-peterx@xxxxxxxxxx/
>
> --
> Peter Xu