On Thu, Jun 27, 2024 at 8:34 AM Nanyong Sun <sunnanyong@xxxxxxxxxx> wrote:Oh,I misunderstood. Pseudo NMI is available. We have CONFIG_ARM64_PSEUDO_NMI=y
(Pseudo) NMI does require GICv3 (released in 2015). But that's
在 2024/6/24 13:39, Yu Zhao 写道:
On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote:I don't have an NMI IPI capable ARM machine on hand, so I think this feature
On 2024/3/14 7:32, David Rientjes wrote:I think so too -- I came cross this while working on TAO [1].
On Thu, 8 Feb 2024, Will Deacon wrote:I'm afraid that FEAT_BBM may not solve the problem here
Nanyong, are you still actively working on making HVO possible on arm64?How about take a new lock with irq disabled during BBM, like:I really think the only maintainable way to achieve this is to avoid the
+void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
+{
+ (NEW_LOCK);
+ pte_clear(&init_mm, addr, ptep);
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+ set_pte_at(&init_mm, addr, ptep, pte);
+ spin_unlock_irq(NEW_LOCK);
+}
possibility of a fault altogether.
Will
This would yield a substantial memory savings on hosts that are largely
configured with hugetlbfs. In our case, the size of this hugetlbfs pool
is actually never changed after boot, but it sounds from the thread that
there was an idea to make HVO conditional on FEAT_BBM. Is this being
pursued?
If so, any testing help needed?
[1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@xxxxxxxxxx/
because from ArmI do have a patch that's similar to stop_machine() -- it uses NMI IPIs
ARM,
I see that FEAT_BBM is only used for changing block size. Therefore, in this
HVO feature,
it can work in the split PMD stage, that is, BBM can be avoided in
vmemmap_split_pmd,
but in the subsequent vmemmap_remap_pte, the Output address of PTE still
needs to be
changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my
understanding
of ARM FEAT_BBM is wrong, and I hope someone can correct me.
Actually, the solution I first considered was to use the stop_machine
method, but we have
products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically
use hugepages,
so I have to consider performance issues. If your product does not change
the amount of huge
pages after booting, using stop_machine() may be a feasible way.
So far, I still haven't come up with a good solution.
to pause/resume remote CPUs while the local one is doing BBM.
Note that the problem of updating vmemmap for struct page[], as I see
it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory
hot removal in general [2]. On arm64, we would need to support BBM on
vmemmap so that we can fix the problem with offlining memory (or to be
precise, unmapping offlined struct page[]), by mapping offlined struct
page[] to a read-only page of dummy struct page[], similar to
ZERO_PAGE(). (Or we would have to make extremely invasive changes to
the reader side, i.e., all speculative PFN walkers.)
In case you are interested in testing my approach, you can swap your
patch 2 with the following:
depends on a higher version of the ARM cpu.
independent from CPU versions. Just to double check: you don't have
GICv3 (rather than not have CONFIG_ARM64_PSEUDO_NMI=y or
irqchip.gicv3_pseudo_nmi=1), is that correct?
Even without GICv3, IPIs can be masked but still works, with a less
bounded latency.
We have many use cases so that I'm not thinking about a specific use case,What I worried about was that other cores would occasionally be interruptedCatalin has suggested batching, and to echo what he said [1]: it's
frequently(8 times every 2M and 4096 times every 1G) and then wait for the
update of page table to complete before resuming.
possible to make all vmemmap changes from a single HVO/de-HVO
operation into *one batch*.
[1] https://lore.kernel.org/linux-mm/ZcN7P0CGUOOgki71@xxxxxxx/
If there are workloadsHow often does your use case trigger HVO/de-HVO operations?
running on other cores, performance may be affected. This implementation
speeds up stopping and resuming other cores, but they still have to wait
for the update to finish.
For our VM use case, it's generally correlated to VM lifetimes, i.e.,
how often VM bin-packing happens. For our THP use case, it can be more
often, but I still don't think we would trigger HVO/de-HVO every
minute. So with NMI IPIs, IMO, the performance impact would be
acceptable to our use cases.
.