Re: [RFC v2] irqchip/gic-its: fix command queue pointer comparison bug
From: Marc Zyngier
Date: Tue Jun 04 2019 - 06:32:35 EST
Hi Heyi,
On 13/05/2019 12:42, Heyi Guo wrote:
> When we run several VMs with PCI passthrough and GICv4 enabled, not
> pinning vCPUs, we will occasionally see below warnings in dmesg:
>
> ITS queue timeout (65440 65504 480)
> ITS cmd its_build_vmovp_cmd failed
>
> The reason for the above issue is that in BUILD_SINGLE_CMD_FUNC:
> 1. Post the write command.
> 2. Release the lock.
> 3. Start to read GITS_CREADR to get the reader pointer.
> 4. Compare the reader pointer to the target pointer.
> 5. If reader pointer does not reach the target, sleep 1us and continue
> to try.
>
> If we have several processors running the above concurrently, other
> CPUs will post write commands while the 1st CPU is waiting the
> completion. So we may have below issue:
>
> phase 1:
> ---rd_idx-----from_idx-----to_idx--0---------
>
> wait 1us:
>
> phase 2:
> --------------from_idx-----to_idx--0-rd_idx--
>
> That is the rd_idx may fly ahead of to_idx, and if in case to_idx is
> near the wrap point, rd_idx will wrap around. So the below condition
> will not be met even after 1s:
>
> if (from_idx < to_idx && rd_idx >= to_idx)
>
> There is another theoretical issue. For a slow and busy ITS, the
> initial rd_idx may fall behind from_idx a lot, just as below:
>
> ---rd_idx---0--from_idx-----to_idx-----------
>
> This will cause the wait function exit too early.
>
> Actually, it does not make much sense to use from_idx to judge if
> to_idx is wrapped, but we need a initial rd_idx when lock is still
> acquired, and it can be used to judge whether to_idx is wrapped and
> the current rd_idx is wrapped.
That's an interesting observation. Indeed, from_idx is pretty irrelevant
here, and all we want to observe is the read pointer reaching the end of
the command set.
>
> We switch to a method of calculating the delta of two adjacent reads
> and accumulating it to get the sum, so that we can get the real rd_idx
> from the wrapped value even when the queue is almost full.
>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Jason Cooper <jason@xxxxxxxxxxxxxx>
> Cc: Marc Zyngier <marc.zyngier@xxxxxxx>
>
> Signed-off-by: Heyi Guo <guoheyi@xxxxxxxxxx>
> ---
> drivers/irqchip/irq-gic-v3-its.c | 30 ++++++++++++++++++++----------
> 1 file changed, 20 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index 7577755..f05acd4 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -745,32 +745,40 @@ static void its_flush_cmd(struct its_node *its, struct its_cmd_block *cmd)
> }
>
> static int its_wait_for_range_completion(struct its_node *its,
> - struct its_cmd_block *from,
> + u64 origin_rd_idx,
> struct its_cmd_block *to)
> {
> - u64 rd_idx, from_idx, to_idx;
> + u64 rd_idx, prev_idx, to_idx, sum;
> + s64 delta;
> u32 count = 1000000; /* 1s! */
>
> - from_idx = its_cmd_ptr_to_offset(its, from);
> to_idx = its_cmd_ptr_to_offset(its, to);
> + if (to_idx < origin_rd_idx)
> + to_idx += ITS_CMD_QUEUE_SZ;
> +
> + prev_idx = origin_rd_idx;
I guess you could just rename origin_rd_idx to prev_idx and drop the
extra declaration (the pr_err doesn't matter much).
> + sum = origin_rd_idx;
>
> while (1) {
> rd_idx = readl_relaxed(its->base + GITS_CREADR);
>
> - /* Direct case */
> - if (from_idx < to_idx && rd_idx >= to_idx)
> - break;
> + /* Wrap around for CREADR */
> + if (rd_idx >= prev_idx)
> + delta = rd_idx - prev_idx;
> + else
> + delta = rd_idx + ITS_CMD_QUEUE_SZ - prev_idx;
>
> - /* Wrapped case */
> - if (from_idx >= to_idx && rd_idx >= to_idx && rd_idx < from_idx)
> + sum += delta;
So "sum" isn't quite saying what it represent. My understanding is that
it is the linearized version of the read pointer, right? Just like
you've linearized to_idx at the beginning of the function.
> + if (sum >= to_idx)
> break;
>
> count--;
> if (!count) {
> pr_err_ratelimited("ITS queue timeout (%llu %llu %llu)\n",
> - from_idx, to_idx, rd_idx);
> + origin_rd_idx, to_idx, sum);
> return -1;
> }
> + prev_idx = rd_idx;
> cpu_relax();
> udelay(1);
> }
> @@ -787,6 +795,7 @@ void name(struct its_node *its, \
> struct its_cmd_block *cmd, *sync_cmd, *next_cmd; \
> synctype *sync_obj; \
> unsigned long flags; \
> + u64 rd_idx; \
> \
> raw_spin_lock_irqsave(&its->lock, flags); \
> \
> @@ -808,10 +817,11 @@ void name(struct its_node *its, \
> } \
> \
> post: \
> + rd_idx = readl_relaxed(its->base + GITS_CREADR); \
> next_cmd = its_post_commands(its); \
> raw_spin_unlock_irqrestore(&its->lock, flags); \
> \
> - if (its_wait_for_range_completion(its, cmd, next_cmd)) \
> + if (its_wait_for_range_completion(its, rd_idx, next_cmd)) \
> pr_err_ratelimited("ITS cmd %ps failed\n", builder); \
> }
>
>
If you agree with my comments above, I'm happy to take this patch and
tidy it up myself.
Now, I think there is still an annoying bug that can creep up if the
queue wraps *twice* while you're sleeping (which could perfectly happen
in a guest running on a busy host). Which means we need to account for
wrapping generations...
Thanks,
M.
--
Jazz is not dead. It just smells funny...