RE: [PATCH 2/2] hyperv: Do not overlap the input and output hypercall areas in get_vtl(void)

From: Michael Kelley
Date: Tue Dec 24 2024 - 11:45:31 EST


From: Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx> Sent: Monday, December 23, 2024 12:30 PM
>
> On 12/20/2024 2:42 PM, Michael Kelley wrote:
> > From: Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx> Sent: Friday, December 20, 2024 11:14 AM
> >>
>
> [...]
>
> >
> > I will not object to your preferred path of allocating the
> > hyperv_pcpu_output_arg when CONFIG_HYPERV_VTL_MODE is set. For
> > me, it has been a worthwhile discussion to explore the alternatives. I do,
> > however, want to get further clarification about the spec requirements
> > and whether our current Linux code is meeting those requirements. More
> > on that below. And allow me to continue the discussion about a couple
> > of points you raised.
> >
> Thank you, Michael, for helping me explore the options! From where I
> sit, we both seem to agree that an update is needed, and where our
> opinions diverge is whether we need to spend another per-CPU page or
> not. If at any point, you start believing that your "Nack" is the best
> option going forward, I'll put this patch aside.
>
> [...]
>
> > I see three uses in current upstream code. And then there are close to
> > a dozen more uses in Nuno's year-old patch set for /dev/mshv that hasn't
> > been accepted upstream yet. Does that roughly match what you are referring
> > to?
> >
> While compiling my arguments and making sure they're worth the readers'
> time, I did miss to provide references to the code. Yes, that patch set
> might be a good illustration of the direction, thank you very much for
> pointing that out!
>
> Another reference might have been the kernel used with OpenHCL (6.6.x):
> https://github.com/microsoft/OHCL-Linux-Kernel. Azure Boost runs that,
> and the kernel is delighting our customers with the rock-solid
> foundation.

This is good. I'll refer to this code base as background as more patches
come through to upstream the deltas. I had seen the blog post about
OpenHCL but didn’t mentally connect the patches as coming from
that code base.

>
> It contains lots of mshv code, and it is the code run in Azure as the
> VTL2 kernel; we're working towards upstreaming, e.g. here :) It uses the
> hypercall output page, and the performance folks beat the hell out of
> the code, and found it being up to snuff. There were few findings about
> memory consumption out of which I remember the perf folks tuning SWIOTLB
> size and tweaking the kernel memory layout to save on padding and
> alignment. As for the per-CPU data, nothing allowing to save that much,
> perhaps would be nice to be able to compile out more. Yet, as I
> understand, x86_64 systems might not always be envisioned as the
> commonplace foundation for building embedded systems so while we're
> looking to save few KiBs in hyperv, there's likely much more to save
> elsewhere.
>
> >> * dom0 and the VTL mode can share code quite naturally,
> >
> > Yep.
> >
> To me, this has been the ever-shining light in the dark maze of other
> arguments. It provides a corner stone to building the generic well-
> tested code, hence developers will rest assured when introducing changes
> that the changes are robust even when not testing the myriad ways the
> hyperv code is used in. That is, the validation wouldn't need all the
> permutations of conditional statements and all #ifdef's that would be
> needed when special-casing.
>
> >> * Not using the output page cannot acccount for all cases permitted by
> >> the TLFS clause "don't overlap input and output, and don't cross page
> >> boundaries",
> >
> > This is what I don’t understand.
> >
> I meant that when implementing a generic case, not using the output page
> prohibits from being able to receive a page of output when supplying a
> page of input. I agree that we can special-case to save a bit of memory
> as it appears we don't need that ability (4KiB of output in response to
> 4KiB input) right now. What I question though if that specialization is
> needed at all since it will lead to more code, more parameters for the
> functions, more comments and if statements therefore more validation
> work and more cognitive load, and more maintenance costs.
>
> >> * Not using the output page everywhere for consistency requires updating
> >> call sites of `hv_do_hypercall`​ and friends, i.e. every place where a
> >> hypercall is made needs to incorporate the offset/pointer at which the
> >> output should start, or perhaps do some shenanigans with macro's,
> >
> > In my musing about getting rid of hyperv_pcpu_output_arg entirely, the
> > call signature of hv_do_hypercall() and friends would remain unchanged.
> > There would still be a virtual address passed as the output argument (which
> > might be NULL, and is *not* required to be page aligned). The caller of
> > hv_do_hypercall() must only ensure that the virtual address points to an
> > output area that is large enough to contain the output without crossing a
> > page boundary. The area must also not overlap with the input area. Given
> > that we currently have only 3 uses of hyperv_pcpu_output_arg in upstream
> > code, only those would need to be tweaked to split the hyperv_pcpu_input_arg
> > page into an input and output area. All other callers of hv_do_hypercall()
> > would be unchanged.
> >
> Your approach appears being solid to me! Again, going to be beating on
> my drum :) I believe here we are deciding not only on the small
> optimization which is a clever way of saving a bit of memory, and it did
> bring a satisfying "wow, nice" to my mind.
>
> We are drafting the constrains for the future code as it appears;
> without knowing better, perhaps it is the wisest not to constrain. As
> the quote goes:
>
> "... We should forget about small efficiencies, say about 97% of the
> time: premature optimization is the root of all evil. Yet we should not
> pass up our opportunities in that critical 3%. A good programmer ...
> will be wise to look carefully at the critical code; but only after that
> code has been identified. ..."
>
>
> >> * Not using the output page leads to negligible savings,
> >>
> >> it is hard to see for me how to make not using the hypercall output page
> >> look as a profitable enginnering endeavor, really comes off as dancing
> >> to the perf scare/feature creep tune for peanuts.
> >
> > There would be some simplification in hv_common_free(), hv_common_init(),
> > and hv_common_cpu_init() if there was no hyperv_pcpu_output_arg. :-)
> >
> Agreed!
>
> >>
> >> In my opinion, we should use the hypercall output page for the VTL mode
> >> as dom0 does to share code and reduce code churn. Had we used some
> >> `hv_get_registers` in both​ instead of the specialized for no compelling
> >> imo practical reason `get_vtl`​ (as it just gets a vCPU register, nothing
> >> more, nothing less), this code path would've been better tested, and any
> >> of this patching would not have been needed.
> >
> > FWIW, the historical context is that at the time get_vtl() was implemented,
> > all VP registers that Linux needed to access were available as synthetic
> > MSRs on the x86 side. Going back to its origins 15 years ago, the Linux code
> > for Hyper-V always just used the synthetic MSRs. get_vtl() was the first time
> > that x86 code needed the HVCALL_GET_VP_REGISTERS hypercall because
> > HV_REGISTER_VSM_VP_STATUS has no synthetic MSR equivalent. There still
> > is no 'hv_get_registers' function in upstream code on the x86 side.
> > hv_get_vpreg() exists only on the arm64 side where there are no synthetic
> > MSRs or equivalent.
>
>
> >
> > In hindsight, hv_get_vpreg() could have been added on the x86 side, and
> > then get_vtl() implemented on top of that. And I made the get_vtl() situation
> > worse by giving Tianyu bad advice about overlapping the output area with
> > the input area. :-(
> >
> > I once had a manager at Microsoft who said "He reserved the right
> > to wake up smarter every day." I've had to claim that right as well
> > from time-to-time. :-)
> >
> Michael, thank you very much for your efforts! At that time, no one
> knew otherwise iiuc. You had the willpower to make things happen and the
> vision to bring the solution over the finish line, and imo the Hyper-V
> community just cannot possibly thank you enough for your incredible
> work. It is my genuine hope that putting the Fixes tag didn't overshadow
> that fact for anyone as these five letters don't tell any of the story,
> most assuredly.
>
> >>
> >> I'd wait for few days and then would likely prefer to run with Wei's
> >> permission to send this patch in v2 as-is unless some damning evidence
> >> presents itself.
> >
> > Again, I won't object to your taking that path.
> >
> > But regarding compliance with the TLFS, take a look at the code for
> > hv_pci_read_mmio() and hv_pci_write_mmio(). Do you think that code
> > violates the spec? If not, what would a violation look like in the context
> > of this discussion? If current code is in violation, then we should fix that.
> >
> Appreciate your advice! I don't think they do. The code doesn't overlap
> input and output, and to my eye, the arguments are sized and aligned
> appropriately. That does not violate TLFS. I'd think the
> `hv_pci_read_mmio` might've used the output page as it is supposed to be
> used for output to stick to one pattern, much similar to the `get_vtl`
> in question.
>
> Now, it seems the functions run in the VTL0 only (but I'd leave that to
> someone else to judge) where we currently don't allocate the output
> page, and `hv_pci_read_mmio` is the only function doing that input page
> split so special-casing might be justified, perhaps adding a comment
> would seem appropriate. Unless there are prospects to merge that with
> dom0, I'd leave the code be.

OK, my understanding is that your concern about spec conformance is
just that Linux should be able to allocate enough input and output space
for the maximum case, which is 4KiB of input *plus* 4KiB of output. If
the total amount of input plus output for a particular hypercall is less
than 4KiB, then there's no conformance problem with having the input
and output share a page, as long as the "no overlap" rule is observed.

There's an idea kicking around in my head about a different way to
handle all this that might be cleaner and less code all-around. If I
get motivated, I may code it up and see if it really works. If so,
I'll run it by you to see what you think.

Michael