Re: [PATCH v3 4/5] KVM: arm64: Enable HDBSS support and handle HDBSSF events

From: Tian Zheng

Date: Fri Mar 06 2026 - 04:32:36 EST

Hi Leo,

On 3/4/2026 11:40 PM, Leonardo Bras wrote:

Hi Tian,

Few extra notes/questions below

On Wed, Feb 25, 2026 at 12:04:20PM +0800, Tian Zheng wrote:

From: eillon<yezhenyu2@xxxxxxxxxx>

HDBSS is enabled via an ioctl from userspace (e.g. QEMU) at the start of
migration. This feature is only supported in VHE mode.

Initially, S2 PTEs doesn't contain the DBM attribute. During migration,
write faults are handled by user_mem_abort, which relaxes permissions
and adds the DBM bit when HDBSS is active. Once DBM is set, subsequent
writes no longer trap, as the hardware automatically transitions the page
from writable-clean to writable-dirty.

KVM does not scan S2 page tables to consume DBM. Instead, when HDBSS is
enabled, the hardware observes the clean->dirty transition and records
the corresponding page into the HDBSS buffer.

During sync_dirty_log, KVM kicks all vCPUs to force VM-Exit, ensuring
that check_vcpu_requests flushes the HDBSS buffer and propagates the
accumulated dirty information into the userspace-visible dirty bitmap.

Add fault handling for HDBSS including buffer full, external abort, and
general protection fault (GPF).

Signed-off-by: eillon<yezhenyu2@xxxxxxxxxx>
Signed-off-by: Tian Zheng<zhengtian10@xxxxxxxxxx>
---
arch/arm64/include/asm/esr.h | 5 ++
arch/arm64/include/asm/kvm_host.h | 17 +++++
arch/arm64/include/asm/kvm_mmu.h | 1 +
arch/arm64/include/asm/sysreg.h | 11 ++++
arch/arm64/kvm/arm.c | 102 ++++++++++++++++++++++++++++++
arch/arm64/kvm/hyp/vhe/switch.c | 19 ++++++
arch/arm64/kvm/mmu.c | 70 ++++++++++++++++++++
arch/arm64/kvm/reset.c | 3 +
8 files changed, 228 insertions(+)

diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index 81c17320a588..2e6b679b5908 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -437,6 +437,11 @@
#ifndef __ASSEMBLER__
#include <asm/types.h>

+static inline bool esr_iss2_is_hdbssf(unsigned long esr)
+{
+ return ESR_ELx_ISS2(esr) & ESR_ELx_HDBSSF;
+}
+
static inline unsigned long esr_brk_comment(unsigned long esr)
{
return esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 5d5a3bbdb95e..57ee6b53e061 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -55,12 +55,17 @@
#define KVM_REQ_GUEST_HYP_IRQ_PENDING KVM_ARCH_REQ(9)
#define KVM_REQ_MAP_L1_VNCR_EL2 KVM_ARCH_REQ(10)
#define KVM_REQ_VGIC_PROCESS_UPDATE KVM_ARCH_REQ(11)
+#define KVM_REQ_FLUSH_HDBSS KVM_ARCH_REQ(12)

#define KVM_DIRTY_LOG_MANUAL_CAPS (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
KVM_DIRTY_LOG_INITIALLY_SET)

#define KVM_HAVE_MMU_RWLOCK

+/* HDBSS entry field definitions */
+#define HDBSS_ENTRY_VALID BIT(0)
+#define HDBSS_ENTRY_IPA GENMASK_ULL(55, 12)
+
/*
* Mode of operation configurable with kvm-arm.mode early param.
* See Documentation/admin-guide/kernel-parameters.txt for more information.
@@ -84,6 +89,7 @@ int __init kvm_arm_init_sve(void);
u32 __attribute_const__ kvm_target_cpu(void);
void kvm_reset_vcpu(struct kvm_vcpu *vcpu);
void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu);
+void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu);

struct kvm_hyp_memcache {
phys_addr_t head;
@@ -405,6 +411,8 @@ struct kvm_arch {
* the associated pKVM instance in the hypervisor.
*/
struct kvm_protected_vm pkvm;
+
+ bool enable_hdbss;
};

struct kvm_vcpu_fault_info {
@@ -816,6 +824,12 @@ struct vcpu_reset_state {
bool reset;
};

+struct vcpu_hdbss_state {
+ phys_addr_t base_phys;
+ u32 size;
+ u32 next_index;
+};
+

IIUC this is used once both on enable/disable and massively on
vcpu_put/get.

What if we actually save just HDBSSBR_EL2 and HDBSSPROD_EL2 instead?
That way we avoid having masking operations in put/get as well as any
possible error we may have formatting those.

The cost is doing those operations once for enable and once for disable,
which should be fine.

Thanks for the suggestion. I actually started with storing the raw system register

values, as you proposed.

However, after discussing it with Oliver Upton in v1, we felt that keeping the base address,

size, and index as separate fields makes the state easier to understand.

Discussion link:https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@xxxxxxxxx/ <https://lore.kernel.org/linux-arm-kernel/Z8_usklidqnerurc@xxxxxxxxx/>

That's why I ended up changing the storage approach in the end.

struct vncr_tlb;

struct kvm_vcpu_arch {
@@ -920,6 +934,9 @@ struct kvm_vcpu_arch {

/* Per-vcpu TLB for VNCR_EL2 -- NULL when !NV */
struct vncr_tlb *vncr_tlb;
+
+ /* HDBSS registers info */
+ struct vcpu_hdbss_state hdbss;
};

/*
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index d968aca0461a..3fea8cfe8869 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -183,6 +183,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,

int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
+void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu);

phys_addr_t kvm_mmu_get_httbr(void);
phys_addr_t kvm_get_idmap_vector(void);
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index f4436ecc630c..d11f4d0dd4e7 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -1039,6 +1039,17 @@

#define GCS_CAP(x) ((((unsigned long)x) & GCS_CAP_ADDR_MASK) | \
GCS_CAP_VALID_TOKEN)
+
+/*
+ * Definitions for the HDBSS feature
+ */
+#define HDBSS_MAX_SIZE HDBSSBR_EL2_SZ_2MB
+
+#define HDBSSBR_EL2(baddr, sz) (((baddr) & GENMASK(55, 12 + sz)) | \
+ FIELD_PREP(HDBSSBR_EL2_SZ_MASK, sz))
+
+#define HDBSSPROD_IDX(prod) FIELD_GET(HDBSSPROD_EL2_INDEX_MASK, prod)
+
/*
* Definitions for GICv5 instructions
*/
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 29f0326f7e00..d64da05e25c4 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -125,6 +125,87 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
}

+void kvm_arm_vcpu_free_hdbss(struct kvm_vcpu *vcpu)
+{
+ struct page *hdbss_pg;
+
+ hdbss_pg = phys_to_page(vcpu->arch.hdbss.base_phys);
+ if (hdbss_pg)
+ __free_pages(hdbss_pg, vcpu->arch.hdbss.size);
+
+ vcpu->arch.hdbss.size = 0;
+}
+
+static int kvm_cap_arm_enable_hdbss(struct kvm *kvm,
+ struct kvm_enable_cap *cap)
+{
+ unsigned long i;
+ struct kvm_vcpu *vcpu;
+ struct page *hdbss_pg = NULL;
+ __u64 size = cap->args[0];
+ bool enable = cap->args[1] ? true : false;
+
+ if (!system_supports_hdbss())
+ return -EINVAL;
+
+ if (size > HDBSS_MAX_SIZE)
+ return -EINVAL;
+
+ if (!enable && !kvm->arch.enable_hdbss) /* Already Off */
+ return 0;
+
+ if (enable && kvm->arch.enable_hdbss) /* Already On, can't set size */
+ return -EINVAL;
+
+ if (!enable) { /* Turn it off */
+ kvm->arch.mmu.vtcr &= ~(VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA);
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ /* Kick vcpus to flush hdbss buffer. */
+ kvm_vcpu_kick(vcpu);
+
+ kvm_arm_vcpu_free_hdbss(vcpu);
+ }
+
+ kvm->arch.enable_hdbss = false;
+
+ return 0;
+ }
+
+ /* Turn it on */
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ hdbss_pg = alloc_pages(GFP_KERNEL_ACCOUNT, size);
+ if (!hdbss_pg)
+ goto error_alloc;
+
+ vcpu->arch.hdbss = (struct vcpu_hdbss_state) {
+ .base_phys = page_to_phys(hdbss_pg),
+ .size = size,
+ .next_index = 0,
+ };
+ }
+
+ kvm->arch.enable_hdbss = true;
+ kvm->arch.mmu.vtcr |= VTCR_EL2_HD | VTCR_EL2_HDBSS | VTCR_EL2_HA;
+
+ /*
+ * We should kick vcpus out of guest mode here to load new
+ * vtcr value to vtcr_el2 register when re-enter guest mode.
+ */
+ kvm_for_each_vcpu(i, vcpu, kvm)
+ kvm_vcpu_kick(vcpu);
+
+ return 0;
+
+error_alloc:
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ if (vcpu->arch.hdbss.base_phys)
+ kvm_arm_vcpu_free_hdbss(vcpu);
+ }
+
+ return -ENOMEM;
+}
+
int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
struct kvm_enable_cap *cap)
{
@@ -182,6 +263,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = 0;
set_bit(KVM_ARCH_FLAG_EXIT_SEA, &kvm->arch.flags);
break;
+ case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
+ mutex_lock(&kvm->lock);
+ r = kvm_cap_arm_enable_hdbss(kvm, cap);
+ mutex_unlock(&kvm->lock);
+ break;
default:
break;
}
@@ -471,6 +557,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = kvm_supports_cacheable_pfnmap();
break;

+ case KVM_CAP_ARM_HW_DIRTY_STATE_TRACK:
+ r = system_supports_hdbss();
+ break;
default:
r = 0;
}
@@ -1120,6 +1209,9 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
if (kvm_dirty_ring_check_request(vcpu))
return 0;

+ if (kvm_check_request(KVM_REQ_FLUSH_HDBSS, vcpu))
+ kvm_flush_hdbss_buffer(vcpu);

I am curious on why we need a flush-hdbss request,
Don't we have the flush function happening every time we run vcpu_put?

Oh, I see, you want to check if there is anything needed inside the inner
loop of vcpu_run, without having to vcpu_put. I think it makes sense.

But instead of having this on guest entry, does not it make more sense to
have it in guest exit? This way we flush every time (if needed) we exit the
guest, and instead of having a vcpu request, we just require a vcpu kick
and it should flush if needed.

Maybe have vcpu_put just save the registers, and add a the flush before
handle_exit.

What do you think?

Thank you for the feedback.

Indeed, in the initial version (v1), I placed the flush operation inside handle_exit and

used a vcpu_kick in kvm_arch_sync_dirty_log to trigger the flush of the HDBSS buffer.

However, during the review, Marc pointed out that calling this function on every exit

event is too frequent if it's not always needed.

Discussion link: _https://lore.kernel.org/linux-arm-kernel/86senjony9.wl-maz@xxxxxxxxxx/_

I agreed with his assessment. Therefore, in the current version, I've separated the flush

operation into more specific and less frequent points:

1. In vcpu_put

2. During dirty log synchronization, by kicking the vCPU to trigger a request that flushes

on its next exit.

3. When handling a specific HDBSSF event.

This ensures the flush happens only when necessary, avoiding the overhead of doing it

on every guest exit.

+
check_nested_vcpu_requests(vcpu);
}

@@ -1898,7 +1990,17 @@ long kvm_arch_vcpu_unlocked_ioctl(struct file *filp, unsigned int ioctl,

void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
{
+ /*
+ * Flush all CPUs' dirty log buffers to the dirty_bitmap. Called
+ * before reporting dirty_bitmap to userspace. Send a request with
+ * KVM_REQUEST_WAIT to flush buffer synchronously.
+ */
+ struct kvm_vcpu *vcpu;
+
+ if (!kvm->arch.enable_hdbss)
+ return;

+ kvm_make_all_cpus_request(kvm, KVM_REQ_FLUSH_HDBSS);
}

static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index 9db3f11a4754..600cbc4f8ae9 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -213,6 +213,23 @@ static void __vcpu_put_deactivate_traps(struct kvm_vcpu *vcpu)
local_irq_restore(flags);
}

+static void __load_hdbss(struct kvm_vcpu *vcpu)
+{
+ struct kvm *kvm = vcpu->kvm;
+ u64 br_el2, prod_el2;
+
+ if (!kvm->arch.enable_hdbss)
+ return;
+
+ br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
+ prod_el2 = vcpu->arch.hdbss.next_index;
+
+ write_sysreg_s(br_el2, SYS_HDBSSBR_EL2);
+ write_sysreg_s(prod_el2, SYS_HDBSSPROD_EL2);
+
+ isb();
+}
+
void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
{
host_data_ptr(host_ctxt)->__hyp_running_vcpu = vcpu;
@@ -220,10 +237,12 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
__vcpu_load_switch_sysregs(vcpu);
__vcpu_load_activate_traps(vcpu);
__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
+ __load_hdbss(vcpu);
}

void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
{
+ kvm_flush_hdbss_buffer(vcpu);
__vcpu_put_deactivate_traps(vcpu);
__vcpu_put_switch_sysregs(vcpu);

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 070a01e53fcb..42b0710a16ce 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1896,6 +1896,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
if (writable)
prot |= KVM_PGTABLE_PROT_W;

+ if (writable && kvm->arch.enable_hdbss && logging_active)
+ prot |= KVM_PGTABLE_PROT_DBM;
+
if (exec_fault)
prot |= KVM_PGTABLE_PROT_X;

@@ -2033,6 +2036,70 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
return 0;
}

+void kvm_flush_hdbss_buffer(struct kvm_vcpu *vcpu)
+{
+ int idx, curr_idx;
+ u64 br_el2;
+ u64 *hdbss_buf;
+ struct kvm *kvm = vcpu->kvm;
+
+ if (!kvm->arch.enable_hdbss)
+ return;
+
+ curr_idx = HDBSSPROD_IDX(read_sysreg_s(SYS_HDBSSPROD_EL2));
+ br_el2 = HDBSSBR_EL2(vcpu->arch.hdbss.base_phys, vcpu->arch.hdbss.size);
+
+ /* Do nothing if HDBSS buffer is empty or br_el2 is NULL */
+ if (curr_idx == 0 || br_el2 == 0)
+ return;
+
+ hdbss_buf = page_address(phys_to_page(vcpu->arch.hdbss.base_phys));
+ if (!hdbss_buf)
+ return;
+
+ guard(write_lock_irqsave)(&vcpu->kvm->mmu_lock);
+ for (idx = 0; idx < curr_idx; idx++) {
+ u64 gpa;
+
+ gpa = hdbss_buf[idx];
+ if (!(gpa & HDBSS_ENTRY_VALID))
+ continue;
+
+ gpa &= HDBSS_ENTRY_IPA;
+ kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
+ }

This will mark a page dirty for both dirty_bitmap or dirty_ring, depending
on what is in use.

Out of plain curiosity, have you planned / tested for the dirty-ring as
well, or just for dirty-bitmap?

Currently, I have only tested this with dirty-bitmap mode.

I will test and ensure the HDBSS feature works correctly with dirty-ring in the next version.

+
+ /* reset HDBSS index */
+ write_sysreg_s(0, SYS_HDBSSPROD_EL2);
+ vcpu->arch.hdbss.next_index = 0;
+ isb();
+}
+
+static int kvm_handle_hdbss_fault(struct kvm_vcpu *vcpu)
+{
+ u64 prod;
+ u64 fsc;
+
+ prod = read_sysreg_s(SYS_HDBSSPROD_EL2);
+ fsc = FIELD_GET(HDBSSPROD_EL2_FSC_MASK, prod);
+
+ switch (fsc) {
+ case HDBSSPROD_EL2_FSC_OK:
+ /* Buffer full, which is reported as permission fault. */
+ kvm_flush_hdbss_buffer(vcpu);
+ return 1;

Humm, flushing in a fault handler means hanging there, in IRQ context, for
a while.

Since we already deal with this on guest_exit (vcpu_put IIUC), why not just
return in a way the vcpu has to exit the inner loop and let it flush there
instead?

Thanks!
Leo

Thanks for the feedback.

If we flush on every guest exit (by moving the flush before handle_exit, then we can

indeed drop the flush from the fault handler and from vcpu_put.

However, given Marc's earlier concern about not imposing this overhead on all vCPUs,

I'd rather avoid flushing on every exit.

My current plan is to set a request bit in kvm_handle_hdbss_fault (via kvm_make_request),

and move the actual flush to the normal exit path, where it can execute in a safe context.

This also allows us to remove the flush from the fault handler entirely.

Does that approach sound reasonable to you?

+ case HDBSSPROD_EL2_FSC_ExternalAbort:
+ case HDBSSPROD_EL2_FSC_GPF:
+ return -EFAULT;
+ default:
+ /* Unknown fault. */
+ WARN_ONCE(1,
+ "Unexpected HDBSS fault type, FSC: 0x%llx (prod=0x%llx, vcpu=%d)\n",
+ fsc, prod, vcpu->vcpu_id);
+ return -EFAULT;
+ }
+}
+
/**
* kvm_handle_guest_abort - handles all 2nd stage aborts
* @vcpu: the VCPU pointer
@@ -2071,6 +2138,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)

is_iabt = kvm_vcpu_trap_is_iabt(vcpu);

+ if (esr_iss2_is_hdbssf(esr))
+ return kvm_handle_hdbss_fault(vcpu);
+
if (esr_fsc_is_translation_fault(esr)) {
/* Beyond sanitised PARange (which is the IPA limit) */
if (fault_ipa >= BIT_ULL(get_kvm_ipa_limit())) {
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 959532422d3a..c03a4b310b53 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -161,6 +161,9 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
free_page((unsigned long)vcpu->arch.ctxt.vncr_array);
kfree(vcpu->arch.vncr_tlb);
kfree(vcpu->arch.ccsidr);
+
+ if (vcpu->kvm->arch.enable_hdbss)
+ kvm_arm_vcpu_free_hdbss(vcpu);
}

static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
--
2.33.0

Thanks!

Tian