Re: [PATCH 05/14] irqchip/gic-v3-its: Prepare shadow structures for KVM host deprivilege

From: Fuad Tabba

Date: Tue Mar 24 2026 - 10:41:05 EST


Hi Seb,

<snip>

> > You should decouple the synchronous pointer swapping (which must be
> > locked) from the hypervisor notification (which can be done outside
> > the lock). Instead of executing the callback inside the critical
> > section, its_end_deprivilege should:
> > - Lock everything.
> > - Perform the pointer swaps in the host driver structures.
> > - Save the hyp_shadow pointers to a temporary array.
> > - Unlock everything.
>
> I am afraid you can't do that because you can have dropped commands &
> timeouts between these two steps. The driver might put commands in the
> swapped queue and they will timeout.

You're right, that won't work. Simply releasing the lock between the
pointer swap and the hypercall (HVC) isn't safe for two reasons:
- Timeouts (your point): If we swap the pointers to the new shadow
queue and drop the lock, another CPU might immediately try to queue a
command. It will trap to EL2 (cwriter_write), but because the HVC
hasn't finished initializing the hypervisor's internal state
(region->priv), EL2 will
drop the MMIO write. The host will then spin in
its_wait_for_range_completion waiting for the hardware to process a
command it never saw, resulting in a timeout.
- Stage-2 Aborts: Conversely, if we try to run the HVC before swapping
the pointers, the HVC will actively unmap (donate) the original
queue's memory to EL2. If the host is not locked and tries to write to
the old queue during this window, it will trigger a Stage-2 Data Abort
and boom!

We must pause host command submission to the affected ITS while the
HVC is running. That said, I still don't think it's acceptable to do
what you propose in this patch. This holds the raw_spin_lock_irqsave
for way too long, keeping local interrupts disabled while performing
slow hypercalls for the entire system.

I had a bit of a think, and I have two ideas, one is an improvement,
but not a full solution. It's simple though, and it might be enough
for now and I am more confident that it works. The second one is a
better solution I think, assuming it works and I haven't missed
anything :)

Option A: Granular Locking (Per-ITS Lock & HVC)

In this patch its_start_depriviledge effectively locks *all* ITS nodes
in the system simultaneously. Then, its_end_depriviledge calls the HVC
for every single ITS sequentially, while the CPU is still holding all
the locks with interrupts globally disabled. This makes this critical
section very long. Making it shorter is pretty straightforward i
think...

Instead of trying to decouple the pointer swap from the HVC, we can
start by reducing the scope of the lock. We remove the global locking
in its_start_depriviledge. Inside its_end_depriviledge, we process one
ITS at a time:
1. Disable local interrupts and lock one specific ITS
(raw_spin_lock_irqsave(&its->lock)).
2. Perform the pointer swap AND the HVC for this specific ITS.
3. Unlock and re-enable local interrupts (raw_spin_unlock_irqrestore).
4. Move to the next ITS in the list.

This should be simple to implement. It guarantees zero dropped
commands and zero aborts because the swap and HVC remain atomic.
However, the CPU executing the deprivilege still holds a raw spinlock
during a hypercall, but the duration is reduced to a single ITS node
rather than the entire thing.

Option B: Software Quiescence (Driver-Level Pausing)
To make it so that the HVC runs outside of an atomic context (with
local interrupts enabled), maybe we can teach the ITS driver to
voluntarily pause command submission without holding the raw spinlock.

I was thinking that we introduce a new state flag, e.g.,
is_vmm_migrating, to struct its_node. Every ITS command goes through
the BUILD_SINGLE_CMD_FUNC macro. We modify this macro so that if a CPU
tries to send a command and sees this flag is true, it temporarily
drops the lock, re-enables its interrupts, spins (cpu_relax()), and
retries.

The deprivilege sequence per ITS then becomes:
1. Lock: Acquire its->lock.
2. Swap & Pause: Swap the pointers to the shadow queue and set
its->is_vmm_migrating = true.
3. Unlock: Drop its->lock and re-enable interrupts. (Any other CPU
trying to send a command to this ITS will now safely spin and wait).
4. The HVC: Execute the slow hypercall safely outside of atomic context.
5. Resume: Re-acquire its->lock, set its->is_vmm_migrating = false,
and drop the lock. (This wakes up any spinning CPUs, and they
immediately send their commands to the newly registered shadow queue).

The HVC runs safely with local interrupts enabled, guaranteeing that
no commands are dropped or sent to unmapped memory. If a hardware
interrupt fires on another CPU that requires sending an ITS command
exactly while the HVC is running, that CPU will be forced to spin.
However, this is no worse than aquiring locks, where that CPU would
have been spinning waiting for the raw spinlock anyway.

What do you think?

If you like, I could hack something and we could discuss it some more.

Cheers,
/fuad


>
> > - Loop through the temporary array and call the KVM cb to notify EL2.
> >
> > You should probably split this patch into two. The first patch would
> > implement the freeze/unfreeze locking mechanism, and the second would
> > swap the driver's internal memory pointers to the shadow structures,
> > and invoke the KVM callback to lock down the real hardware.
> >
> > Cheers,
> > /fuad
> >
>
> Thanks,
> Sebastian
>
> > > + if (ret) {
> > > + its_free_shadow_tables(hyp_shadow);
> > > + return ret;
> > > + }
> > > +
> > > + /* Switch the driver command queue to use the shadow and save the original */
> > > + its->cmd_write = (its->cmd_write - its->cmd_base) +
> > > + (struct its_cmd_block *)shadow.cmd_shadow;
> > > + its->cmd_base = shadow.cmd_shadow;
> > > +
> > > + /* Shadow the first level of the indirect tables */
> > > + for (i = 0; i < GITS_BASER_NR_REGS; i++) {
> > > + baser = shadow.tables[i].val;
> > > +
> > > + if (!shadow.tables[i].shadow)
> > > + continue;
> > > +
> > > + baser_phys = virt_to_phys(shadow.tables[i].shadow);
> > > + if (IS_ENABLED(CONFIG_ARM64_64K_PAGES) && (baser_phys >> 48))
> > > + baser_phys = GITS_BASER_PHYS_52_to_48(baser_phys);
> > > +
> > > + its->tables[i].val &= ~GENMASK(47, 12);
> > > + its->tables[i].val |= baser_phys;
> > > + its->tables[i].base = shadow.tables[i].shadow;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +int its_end_depriviledge(int ret_pkvm_finalize, unsigned long *flags, its_init_emulate cb)
> > > +{
> > > + struct its_node *its;
> > > + int i = 0, ret = 0;
> > > +
> > > + if (!flags || !cb)
> > > + return -EINVAL;
> > > +
> > > + list_for_each_entry(its, &its_nodes, entry) {
> > > + if (!ret_pkvm_finalize && !ret)
> > > + ret = its_switch_to_shadow_locked(its, cb);
> > > +
> > > + raw_spin_unlock_irqrestore(&its->lock, flags[i++]);
> > > + }
> > > +
> > > + kfree(flags);
> > > + raw_spin_unlock(&its_lock);
> > > +
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(its_end_depriviledge);
> > > +
> > > static int __init its_probe_one(struct its_node *its)
> > > {
> > > u64 baser, tmp;
> > > diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
> > > index 0225121f3013..40457a4375d4 100644
> > > --- a/include/linux/irqchip/arm-gic-v3.h
> > > +++ b/include/linux/irqchip/arm-gic-v3.h
> > > @@ -657,6 +657,30 @@ static inline bool gic_enable_sre(void)
> > > return !!(val & ICC_SRE_EL1_SRE);
> > > }
> > >
> > > +/*
> > > + * The ITS_BASER structure - contains memory information, cached
> > > + * value of BASER register configuration and ITS page size.
> > > + */
> > > +struct its_baser {
> > > + void *base;
> > > + void *shadow;
> > > + u64 val;
> > > + u32 order;
> > > + u32 psz;
> > > +};
> > > +
> > > +struct its_shadow_tables {
> > > + struct its_baser tables[GITS_BASER_NR_REGS];
> > > + void *cmd_shadow;
> > > + void *cmd_original;
> > > + size_t cmdq_len;
> > > +};
> > > +
> > > +typedef int (*its_init_emulate)(phys_addr_t its_phys_base, struct its_shadow_tables *shadow);
> > > +
> > > +void *its_start_depriviledge(void);
> > > +int its_end_depriviledge(int ret, unsigned long *flags, its_init_emulate cb);
> > > +
> > > #endif
> > >
> > > #endif
> > > --
> > > 2.53.0.473.g4a7958ca14-goog
> > >