Ping for review. And I still can't reproduce it, even if I reduced the
timeout to 2us and tried different workloads. Any suggestions for how
to reproduce it?
Thanks,
Yabin
On Tue, Jan 10, 2023 at 1:06 PM Yabin Cui <yabinc@xxxxxxxxxx> wrote:
Do you have a reproducer for this or some more info?
For example is it a regression or has it always been there? And on which
platform.
It happens on Pixel 6 and 7. We collect ETM data periodically from some
internal dogfood devices. The problem has happened several times on
dogfood devices. But I am still trying to reproduce it locally.
We use the scatter-gather mode of ETR, and allocate a 4M buffer. In userspace,
we use simpleperf in Android to collect system wide ETM data. What is special
is, simpleperf disables and reenables perf events every 100ms to flush ETM
data to perf aux buffer.
Pixel 6 and 7 have hardware monitoring AXI traffic. The hardware finds ETR is
trying to read from or write to a low invalid address (like 0x2E0000). The
problem always happens right after the "tmc_etr: timeout while waiting for TMC
to be Ready" message. And in almost all cases, I can find a "timeout while
waiting for completion of Manual Flush" message from the previous session.
One log history is below:
[11484.610008][ C0] coresight tmc_etr0: timeout while waiting for
completion of Manual Flush
[11484.610177][ C0] coresight tmc_etr0: timeout while waiting for
TMC to be Ready
[11484.615367][ C0] coresight tmc_etr0: timeout while waiting for
completion of Manual Flush
[11484.615534][ C0] coresight tmc_etr0: timeout while waiting for
TMC to be Ready
[12089.486044][ C0] coresight tmc_etr0: timeout while waiting for
TMC to be Ready
AXI error report reading from invalid address
Another log history is below:
[76709.382650][ C5] coresight tmc_etf1: timeout while waiting for
TMC to be Ready
[76709.382852][ C7] coresight tmc_etr0: timeout while waiting for
completion of Manual Flush
[76709.382995][ C7] coresight tmc_etr0: timeout while waiting for
TMC to be Ready
[76709.384510][ C7] coresight tmc_etr0: timeout while waiting for
completion of Manual Flush
[76709.384649][ C7] coresight tmc_etr0: timeout while waiting for
TMC to be Ready
[76709.384838][ C0] coresight tmc_etr0: timeout while waiting for
TMC to be Ready
AIX error report writing to invalid address
It seems if the previous manual flush doesn't finish gracefully, ETR may not be
ready for the next enable (even after 10min as in the first log). And if we
continue to enable ETR, an invalid AXI IO may happen.
Thanks,
Yabin
On Tue, Jan 10, 2023 at 10:04 AM Suzuki K Poulose
<suzuki.poulose@xxxxxxx> wrote:
On 10/01/2023 17:48, Mike Leach wrote:
Hi,
On Tue, 10 Jan 2023 at 09:30, James Clark <james.clark@xxxxxxx> wrote:
On 09/01/2023 23:43, Yabin Cui wrote:
Otherwise, it may cause error in AXI bus and result in a kernel panic.
Hi Yabin,
Thanks for the fix. Do you have a reproducer for this or some more info?
For example is it a regression or has it always been there? And on which
platform.
Thanks
James
Signed-off-by: Yabin Cui <yabinc@xxxxxxxxxx>
---
.../hwtracing/coresight/coresight-tmc-core.c | 4 +++-
.../hwtracing/coresight/coresight-tmc-etr.c | 18 +++++++++++++++---
drivers/hwtracing/coresight/coresight-tmc.h | 2 +-
3 files changed, 19 insertions(+), 5 deletions(-)
diff --git a/drivers/hwtracing/coresight/coresight-tmc-core.c b/drivers/hwtracing/coresight/coresight-tmc-core.c
index 07abf28ad725..c106d142e632 100644
--- a/drivers/hwtracing/coresight/coresight-tmc-core.c
+++ b/drivers/hwtracing/coresight/coresight-tmc-core.c
@@ -31,7 +31,7 @@ DEFINE_CORESIGHT_DEVLIST(etb_devs, "tmc_etb");
DEFINE_CORESIGHT_DEVLIST(etf_devs, "tmc_etf");
DEFINE_CORESIGHT_DEVLIST(etr_devs, "tmc_etr");
-void tmc_wait_for_tmcready(struct tmc_drvdata *drvdata)
+int tmc_wait_for_tmcready(struct tmc_drvdata *drvdata)
{
struct coresight_device *csdev = drvdata->csdev;
struct csdev_access *csa = &csdev->access;
@@ -40,7 +40,9 @@ void tmc_wait_for_tmcready(struct tmc_drvdata *drvdata)
if (coresight_timeout(csa, TMC_STS, TMC_STS_TMCREADY_BIT, 1)) {
dev_err(&csdev->dev,
"timeout while waiting for TMC to be Ready\n");
+ return -EBUSY;
}
+ return 0;
}
void tmc_flush_and_stop(struct tmc_drvdata *drvdata)
diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c
index 867ad8bb9b0c..2da99dd41ed6 100644
--- a/drivers/hwtracing/coresight/coresight-tmc-etr.c
+++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c
@@ -983,15 +983,21 @@ static void tmc_sync_etr_buf(struct tmc_drvdata *drvdata)
etr_buf->ops->sync(etr_buf, rrp, rwp);
}
-static void __tmc_etr_enable_hw(struct tmc_drvdata *drvdata)
+static int __tmc_etr_enable_hw(struct tmc_drvdata *drvdata)
{
u32 axictl, sts;
struct etr_buf *etr_buf = drvdata->etr_buf;
+ int rc = 0;
CS_UNLOCK(drvdata->base);
/* Wait for TMCSReady bit to be set */
- tmc_wait_for_tmcready(drvdata);
+ rc = tmc_wait_for_tmcready(drvdata);
+ if (rc) {
+ dev_err(&drvdata->csdev->dev, "not ready ETR isn't enabled\n");
+ CS_LOCK(drvdata->base);
+ return rc;
+ }
writel_relaxed(etr_buf->size / 4, drvdata->base + TMC_RSZ);
writel_relaxed(TMC_MODE_CIRCULAR_BUFFER, drvdata->base + TMC_MODE);
@@ -1032,6 +1038,7 @@ static void __tmc_etr_enable_hw(struct tmc_drvdata *drvdata)
tmc_enable_hw(drvdata);
CS_LOCK(drvdata->base);
+ return rc;
}
static int tmc_etr_enable_hw(struct tmc_drvdata *drvdata,
@@ -1060,7 +1067,12 @@ static int tmc_etr_enable_hw(struct tmc_drvdata *drvdata,
rc = coresight_claim_device(drvdata->csdev);
if (!rc) {
drvdata->etr_buf = etr_buf;
- __tmc_etr_enable_hw(drvdata);
+ rc = __tmc_etr_enable_hw(drvdata);
+ if (rc) {
+ drvdata->etr_buf = NULL;
+ coresight_disclaim_device(drvdata->csdev);
+ tmc_etr_disable_catu(drvdata);
+ }
}
return rc;
diff --git a/drivers/hwtracing/coresight/coresight-tmc.h b/drivers/hwtracing/coresight/coresight-tmc.h
index 66959557cf39..01c0382a29c0 100644
--- a/drivers/hwtracing/coresight/coresight-tmc.h
+++ b/drivers/hwtracing/coresight/coresight-tmc.h
@@ -255,7 +255,7 @@ struct tmc_sg_table {
};
/* Generic functions */
-void tmc_wait_for_tmcready(struct tmc_drvdata *drvdata);
+int tmc_wait_for_tmcready(struct tmc_drvdata *drvdata);
void tmc_flush_and_stop(struct tmc_drvdata *drvdata);
void tmc_enable_hw(struct tmc_drvdata *drvdata);
void tmc_disable_hw(struct tmc_drvdata *drvdata);
There is no point in waiting for a timeout, and then carrying on even
when it is exceeded. As such this patch seems reasonable.
We should also apply the same principle to the ETF and ETB devices
which use the same tmc_wait_for_tmcready() function.
+1
I am fine with pushing this change, as it is doing the right thing.
However - the concern is that this appears to be happening on starting
the ETR - there should be no outstanding AXI operations that cause the
system to not be ready - as we will either be using this the first
time after reset, or we should have successfully stopped and flushed
the ETR from the previous operation. This warrants further
investigation - should we be extending the timeout - which is already
at a rather generous 100uS, and do we also need to check the MemErr
bit in the status register?
It would be good to dump the value of TMC_STATUS to see what is going
on.
As James says, we need details of when and how the problem occurs -
as far as I know it has not been seen on any systems we currently use
(though could have been missed given the current code)
+1
Kind regards
Suzuki
Regards
Mike
--
Mike Leach
Principal Engineer, ARM Ltd.
Manchester Design Centre. UK