RE: [PATCHv2 2/2] drm/amdgpu: Register MCE notifier for Aldebaran RAS

From: Joshi, Mukul
Date: Wed Sep 22 2021 - 15:44:03 EST


[AMD Official Use Only]



> -----Original Message-----
> From: Borislav Petkov <bp@xxxxxxxxx>
> Sent: Wednesday, September 22, 2021 7:41 AM
> To: Joshi, Mukul <Mukul.Joshi@xxxxxxx>
> Cc: linux-edac@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> mingo@xxxxxxxxxx; mchehab@xxxxxxxxxx; Ghannam, Yazen
> <Yazen.Ghannam@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [PATCHv2 2/2] drm/amdgpu: Register MCE notifier for Aldebaran
> RAS
>
> [CAUTION: External Email]
>
> On Sun, Sep 12, 2021 at 10:13:11PM -0400, Mukul Joshi wrote:
> > On Aldebaran, GPU driver will handle bad page retirement even though
> > UMC is host managed. As a result, register a bad page retirement
> > handler on the mce notifier chain to retire bad pages on Aldebaran.
> >
> > v1->v2:
> > - Use smca_get_bank_type() to determine MCA bank.
> > - Envelope the changes under #ifdef CONFIG_X86_MCE_AMD.
> > - Use MCE_PRIORITY_UC instead of MCE_PRIO_ACCEL as we are
> > only handling uncorrectable errors.
> > - Use macros to determine UMC instance and channel instance
> > where the uncorrectable error occured.
> > - Update the headline.
>
> Same note as for the previous patch.
>
Acked.

> > +static int amdgpu_bad_page_notifier(struct notifier_block *nb,
> > + unsigned long val, void *data) {
> > + struct mce *m = (struct mce *)data;
> > + struct amdgpu_device *adev = NULL;
> > + uint32_t gpu_id = 0;
> > + uint32_t umc_inst = 0;
> > + uint32_t ch_inst, channel_index = 0;
> > + struct ras_err_data err_data = {0, 0, 0, NULL};
> > + struct eeprom_table_record err_rec;
> > + uint64_t retired_page;
> > +
> > + /*
> > + * If the error was generated in UMC_V2, which belongs to GPU UMCs,
> > + * and error occurred in DramECC (Extended error code = 0) then only
> > + * process the error, else bail out.
> > + */
> > + if (!m || !((smca_get_bank_type(m->bank) == SMCA_UMC_V2) &&
> > + (XEC(m->status, 0x1f) == 0x0)))
> > + return NOTIFY_DONE;
> > +
> > + /*
> > + * GPU Id is offset by GPU_ID_OFFSET in MCA_IPID_UMC register.
> > + */
> > + gpu_id = GET_MCA_IPID_GPUID(m->ipid) - GPU_ID_OFFSET;
> > +
> > + adev = find_adev(gpu_id);
> > + if (!adev) {
> > + dev_warn(adev->dev, "%s: Unable to find adev for gpu_id: %d\n",
> > + __func__, gpu_id);
> > + return NOTIFY_DONE;
> > + }
> > +
> > + /*
> > + * If it is correctable error, return.
> > + */
> > + if (mce_is_correctable(m)) {
> > + return NOTIFY_OK;
> > + }
>
> This can run before you find_adev().
>
Acked.

> > +static void amdgpu_register_bad_pages_mca_notifier(void)
> > +{
> > + /*
> > + * Register the x86 notifier only once
> > + * with MCE subsystem.
> > + */
> > + if (notifier_registered == false) {
> > + mce_register_decode_chain(&amdgpu_bad_page_nb);
> > + notifier_registered = true;
> > + }
>
> I have a patchset which will get rid of the need to do this silliness - only if I had
> some time to actually prepare it for submission... :-\
>
:-\

Thank you.

Regards,
Mukul
> --
> Regards/Gruss,
> Boris.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7Cmukul.joshi%40amd.com%7C7ae9c87153f7
> 4572712908d97dbdcc7a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637679076423976867%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =M5UDrSea0lnEi5%2BhU4ck0zk1dZD9kX4DUoXt95J6dJ4%3D&amp;reserved=0