It looks like it does save us ~20-30 cycles vs. vmload, but maybe not
enough to justify the added complexity. Additionally, since we still
need to call vmload when we exit to userspace, it ends up being a bit
slower for this particular workload at least. So for now I'll plan on
sticking to vmload'ing after vmexit and moving that to the asm code
if there are no objections.
current v2 patch, sample 1
ioctl entry: 1204722748832
pre-run: 1204722749408 ( +576)
post-run: 1204722750784 (+1376)
ioctl exit: 1204722751360 ( +576)
total cycles: 2528
current v2 patch, sample 2
ioctl entry: 1204722754784
pre-vmrun: 1204722755360 ( +576)
post-vmrun: 1204722756720 (+1360)
ioctl exit: 1204722757312 ( +592)
total cycles 2528
wrgsbase, sample 1
ioctl entry: 1346624880336
pre-vmrun: 1346624880912 ( +576)
post-vmrun: 1346624882256 (+1344)
ioctl exit: 1346624882912 ( +656)
total cycles 2576
wrgsbase, sample 2
ioctl entry: 1346624886272
pre-vmrun: 1346624886832 ( +560)
post-vmrun: 1346624888176 (+1344)
ioctl exit: 1346624888816 ( +640)
total cycles: 2544