Re: [PATCH v2] drm/panfrost:report the full raw fault information instead

From: Chunyou Tang
Date: Mon Jun 28 2021 - 23:55:48 EST


Hi Robin,
thinks for you reply.
I had send more detal to Steve in another mail,thinks for your
messages.

于 Mon, 28 Jun 2021 15:17:51 +0100
Robin Murphy <robin.murphy@xxxxxxx> 写道:

> On 2021-06-28 11:48, Steven Price wrote:
> > On 25/06/2021 10:49, Chunyou Tang wrote:
> >> Hi Steve,
> >> Thinks for your reply.
> >> When I only set the pte |= ARM_LPAE_PTE_SH_NS;there have
> >> no "GPU Fault",When I set the pte |= ARM_LPAE_PTE_SH_IS(or
> >> ARM_LPAE_PTE_SH_OS);there have "GPU Fault".I don't know how the pte
> >> effect this issue?
> >> Can you give me some suggestions again?
> >>
> >> Thinks.
> >>
> >> Chunyou
> >
> > Hi Chunyou,
> >
> > You haven't given me much context so I'm not entirely sure which
> > PTE you are talking about (GPU or CPU), or indeed where you are
> > changing the PTE values.
> >
> > The PTEs control whether a page is shareable or not, the GPU
> > requires that accesses are consistent (i.e. either all accesses to
> > a page are shareable or all are non-shareable) and will race a
> > fault if it detects this isn't the case. Mali also has a quirk for
> > its version of 'LPAE' where inner shareable actually means only
> > within the GPU and outer shareable means outside the GPU (which I
> > think usually means Inner Shareable on the external bus).
>
> Furthermore, the way the io-pgtable code works for ARM_MALI_LPAE
> format means that *all* GPU mappings are unconditionally
> outer-shareable, so it's not clear how the GPU could observe a
> mismatch in the first place (other than major integration issues
> causing data corruption).
>
> Robin.
>
> >
> > Steve
> >
> >> 于 Thu, 24 Jun 2021 14:22:04 +0100
> >> Steven Price <steven.price@xxxxxxx> 写道:
> >>
> >>> On 22/06/2021 02:40, Chunyou Tang wrote:
> >>>> Hi Steve,
> >>>> I will send a new patch with suitable subject/commit
> >>>> message. But I send a V3 or a new patch?
> >>>
> >>> Send a V3 - it is a new version of this patch.
> >>>
> >>>> I met a bug about the GPU,I have no idea about how to fix
> >>>> it, If you can give me some suggestion,it is perfect.
> >>>>
> >>>> You can see such kernel log:
> >>>>
> >>>> Jun 20 10:20:13 icube kernel: [ 774.566760] mvp_gpu
> >>>> 0000:05:00.0: GPU Fault 0x00000088 (SHAREABILITY_FAULT) at
> >>>> 0x000000000310fd00 Jun 20 10:20:13 icube kernel: [ 774.566764]
> >>>> mvp_gpu 0000:05:00.0: There were multiple GPU faults - some have
> >>>> not been reported Jun 20 10:20:13 icube kernel: [ 774.667542]
> >>>> mvp_gpu 0000:05:00.0: AS_ACTIVE bit stuck Jun 20 10:20:13 icube
> >>>> kernel: [ 774.767900] mvp_gpu 0000:05:00.0: AS_ACTIVE bit stuck
> >>>> Jun 20 10:20:13 icube kernel: [ 774.868546] mvp_gpu
> >>>> 0000:05:00.0: AS_ACTIVE bit stuck Jun 20 10:20:13 icube kernel:
> >>>> [ 774.968910] mvp_gpu 0000:05:00.0: AS_ACTIVE bit stuck Jun 20
> >>>> 10:20:13 icube kernel: [ 775.069251] mvp_gpu 0000:05:00.0:
> >>>> AS_ACTIVE bit stuck Jun 20 10:20:22 icube kernel: [ 783.693971]
> >>>> mvp_gpu 0000:05:00.0: gpu sched timeout, js=1, config=0x7300,
> >>>> status=0x8, head=0x362c900, tail=0x362c100,
> >>>> sched_job=000000003252fb84
> >>>>
> >>>> In
> >>>> https://lore.kernel.org/dri-devel/20200510165538.19720-1-peron.clem@xxxxxxxxx/
> >>>> there had a same bug like mine,and I found you at the mail list,I
> >>>> don't know how it fixed?
> >>>
> >>> The GPU_SHAREABILITY_FAULT error means that a cache line has been
> >>> accessed both as shareable and non-shareable and therefore
> >>> coherency cannot be guaranteed. Although the "multiple GPU
> >>> faults" means that this may not be the underlying cause.
> >>>
> >>> The fact that your dmesg log has PCI style identifiers
> >>> ("0000:05:00.0") suggests this is an unusual platform - I've not
> >>> previously been aware of a Mali device behind PCI. Is this device
> >>> working with the kbase/DDK proprietary driver? It would be worth
> >>> looking at the kbase kernel code for the platform to see if there
> >>> is anything special done for the platform.
> >>>
> >>> From the dmesg logs all I can really tell is that the GPU seems
> >>> unhappy about the memory system.
> >>>
> >>> Steve
> >>>
> >>>> I need your help!
> >>>>
> >>>> thinks very much!
> >>>>
> >>>> Chunyou
> >>>>
> >>>> 于 Mon, 21 Jun 2021 11:45:20 +0100
> >>>> Steven Price <steven.price@xxxxxxx> 写道:
> >>>>
> >>>>> On 19/06/2021 04:18, Chunyou Tang wrote:
> >>>>>> Hi Steve,
> >>>>>> 1,Now I know how to write the subject
> >>>>>> 2,the low 8 bits is the exception type in spec.
> >>>>>>
> >>>>>> and you can see prnfrost_exception_name()
> >>>>>>
> >>>>>> switch (exception_code) {
> >>>>>> /* Non-Fault Status code */
> >>>>>> case 0x00: return "NOT_STARTED/IDLE/OK";
> >>>>>> case 0x01: return "DONE";
> >>>>>> case 0x02: return "INTERRUPTED";
> >>>>>> case 0x03: return "STOPPED";
> >>>>>> case 0x04: return "TERMINATED";
> >>>>>> case 0x08: return "ACTIVE";
> >>>>>> ........
> >>>>>> ........
> >>>>>> case 0xD8: return "ACCESS_FLAG";
> >>>>>> case 0xD9 ... 0xDF: return "ACCESS_FLAG";
> >>>>>> case 0xE0 ... 0xE7: return "ADDRESS_SIZE_FAULT";
> >>>>>> case 0xE8 ... 0xEF: return "MEMORY_ATTRIBUTES_FAULT";
> >>>>>> }
> >>>>>> return "UNKNOWN";
> >>>>>> }
> >>>>>>
> >>>>>> the exception_code in case is only 8 bits,so if fault_status
> >>>>>> in panfrost_gpu_irq_handler() don't & 0xFF,it can't get correct
> >>>>>> exception reason,it will be always UNKNOWN.
> >>>>>
> >>>>> Yes, I'm happy with the change - I just need a patch that I can
> >>>>> apply. At the moment this patch only changes the first '0x%08x'
> >>>>> output rather than the call to panfrost_exception_name() as
> >>>>> well. So we just need a patch which does:
> >>>>>
> >>>>> - fault_status & 0xFF, panfrost_exception_name(pfdev,
> >>>>> fault_status),
> >>>>> + fault_status, panfrost_exception_name(pfdev, fault_status &
> >>>>> 0xFF),
> >>>>>
> >>>>> along with a suitable subject/commit message describing the
> >>>>> change. If you can send me that I can apply it.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Steve
> >>>>>
> >>>>> PS. Sorry for going round in circles here - I'm trying to help
> >>>>> you get setup so you'll be able to contribute patches easily in
> >>>>> future. An important part of that is ensuring you can send a
> >>>>> properly formatted patch to the list.
> >>>>>
> >>>>> PPS. I'm still not receiving your emails directly. I don't think
> >>>>> it's a problem at my end because I'm receiving other emails, but
> >>>>> if you can somehow fix the problem you're likely to receive a
> >>>>> faster response.
> >>>>>
> >>>>>> 于 Fri, 18 Jun 2021 13:43:24 +0100
> >>>>>> Steven Price <steven.price@xxxxxxx> 写道:
> >>>>>>
> >>>>>>> On 17/06/2021 07:20, ChunyouTang wrote:
> >>>>>>>> From: ChunyouTang <tangchunyou@xxxxxxxxxxxx>
> >>>>>>>>
> >>>>>>>> of the low 8 bits.
> >>>>>>>
> >>>>>>> Please don't split the subject like this. The first line of
> >>>>>>> the commit should be a (very short) summary of the patch.
> >>>>>>> Then a blank line and then a longer description of what the
> >>>>>>> purpose of the patch is and why it's needed.
> >>>>>>>
> >>>>>>> Also you previously had this as part of a series (the first
> >>>>>>> part adding the "& 0xFF" in the panfrost_exception_name()
> >>>>>>> call). I'm not sure we need two patches for the single line,
> >>>>>>> but as it stands this patch doesn't apply.
> >>>>>>>
> >>>>>>> Also I'm still not receiving any emails from you directly
> >>>>>>> (only via the list), so it's possible I might have missed
> >>>>>>> something you sent.
> >>>>>>>
> >>>>>>> Steve
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Signed-off-by: ChunyouTang <tangchunyou@xxxxxxxxxxxx>
> >>>>>>>> ---
> >>>>>>>> drivers/gpu/drm/panfrost/panfrost_gpu.c | 2 +-
> >>>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c
> >>>>>>>> b/drivers/gpu/drm/panfrost/panfrost_gpu.c index
> >>>>>>>> 1fffb6a0b24f..d2d287bbf4e7 100644 ---
> >>>>>>>> a/drivers/gpu/drm/panfrost/panfrost_gpu.c +++
> >>>>>>>> b/drivers/gpu/drm/panfrost/panfrost_gpu.c @@ -33,7 +33,7 @@
> >>>>>>>> static irqreturn_t panfrost_gpu_irq_handler(int irq, void
> >>>>>>>> *data) address |= gpu_read(pfdev, GPU_FAULT_ADDRESS_LO);
> >>>>>>>> dev_warn(pfdev->dev, "GPU Fault 0x%08x
> >>>>>>>> (%s) at 0x%016llx\n",
> >>>>>>>> - fault_status & 0xFF,
> >>>>>>>> panfrost_exception_name(pfdev, fault_status & 0xFF),
> >>>>>>>> + fault_status,
> >>>>>>>> panfrost_exception_name(pfdev, fault_status & 0xFF),
> >>>>>>>> address);
> >>>>>>>> if (state & GPU_IRQ_MULTIPLE_FAULT)
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
> >