On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu <tianyu.lan@xxxxxxxxx> wrote:
On 12/5/2015 1:07 AM, Alexander Duyck wrote:
We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.
That is a poor argument. I highly doubt Microsoft is interested in
having to modify all of the drivers that will support direct assignment
in order to support migration. They would likely request something
similar to what I have in that they will want a way to do DMA tracking
with minimal modification required to the drivers.
This totally depends on the NIC or other devices' vendors and they
should make decision to support migration or not. If yes, they would
modify driver.
Having to modify every driver that wants to support live migration is
a bit much. In addition I don't see this being limited only to NIC
devices. You can direct assign a number of different devices, your
solution cannot be specific to NICs.
If just target to call suspend/resume during migration, the feature will
be meaningless. Most cases don't want to affect user during migration
a lot and so the service down time is vital. Our target is to apply
SRIOV NIC passthough to cloud service and NFV(network functions
virtualization) projects which are sensitive to network performance
and stability. From my opinion, We should give a change for device
driver to implement itself migration job. Call suspend and resume
callback in the driver if it doesn't care the performance during migration.
The suspend/resume callback should be efficient in terms of time.
After all we don't want the system to stall for a long period of time
when it should be either running or asleep. Having it burn cycles in
a power state limbo doesn't do anyone any good. If nothing else maybe
it will help to push the vendors to speed up those functions which
then benefit migration and the system sleep states.
Also you keep assuming you can keep the device running while you do
the migration and you can't. You are going to corrupt the memory if
you do, and you have yet to provide any means to explain how you are
going to solve that.
Following is my idea to do DMA tracking.
Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.
The ordering of your explanation here doesn't quite work. What needs to
happen is that you have to disable DMA and then mark the pages as dirty.
What the disabling of the BME does is signal to the hypervisor that
the device is now stopped. The ixgbevf_suspend call already supported
by the driver is almost exactly what is needed to take care of something
like this.
This is why I hope to reserve a piece of space in the dma page to do dummy
write. This can help to mark page dirty while not require to stop DMA and
not race with DMA data.
You can't and it will still race. What concerns me is that your
patches and the document you referenced earlier show a considerable
lack of understanding about how DMA and device drivers work. There is
a reason why device drivers have so many memory barriers and the like
in them. The fact is when you have CPU and a device both accessing
memory things have to be done in a very specific order and you cannot
violate that.
If you have a contiguous block of memory you expect the device to
write into you cannot just poke a hole in it. Such a situation is not
supported by any hardware that I am aware of.
As far as writing to dirty the pages it only works so long as you halt
the DMA and then mark the pages dirty. It has to be in that order.
Any other order will result in data corruption and I am sure the NFV
customers definitely don't want that.
If can't do that, we have to stop DMA in a short time to mark all dma
pages dirty and then reenable it. I am not sure how much we can get by
this way to track all DMA memory with device running during migration. I
need to do some tests and compare results with stop DMA diretly at last
stage during migration.
We have to halt the DMA before we can complete the migration. So
please feel free to test this.
In addition I still feel you would be better off taking this in
smaller steps. I still say your first step would be to come up with a
generic solution for the dirty page tracking like the dma_mark_clean()
approach I had mentioned earlier. If I get time I might try to take
care of it myself later this week since you don't seem to agree with
that approach.
The question is how we would go about triggering it. I really don't
think the PCI configuration space approach is the right idea.
I wonder
if we couldn't get away with some sort of ACPI event instead. We
already require ACPI support in order to shut down the system
gracefully, I wonder if we couldn't get away with something similar in
order to suspend/resume the direct assigned devices gracefully.
I don't think there is such events in the current spec.
Otherwise, There are two kinds of suspend/resume callbacks.
1) System suspend/resume called during S2RAM and S2DISK.
2) Runtime suspend/resume called by pm core when device is idle.
If you want to do what you mentioned, you have to change PM core and
ACPI spec.
The thought I had was to somehow try to move the direct assigned
devices into their own power domain and then simulate a AC power event
where that domain is switched off. However I don't know if there are
ACPI events to support that since the power domain code currently only
appears to be in use for runtime power management.
That had also given me the thought to look at something like runtime
power management for the VFs. We would need to do a runtime
suspend/resume. The only problem is I don't know if there is any way
to get the VFs to do a quick wakeup. It might be worthwhile looking
at trying to check with the ACPI experts out there to see if there is
anything we can do as bypassing having to use the configuration space
mechanism to signal this would definitely be worth it.