Re: [PATCH v3 2/2] EDAC: add EDAC driver for DMC520

From: James Morse
Date: Fri Jun 14 2019 - 11:43:53 EST


Hi Lei,

On 13/06/2019 19:31, Lei Wang (BSP) wrote:
> Please see inline(tagged with [Lei]) below. Thanks!

Please don't do this. Top posting is discouraged[0]. Creating a reply treasure hunt is
even worse!

You probably need to change your mail client as your mail is also arriving base64 encoded.
To the maintainer's scripts its going to look like this:
https://lore.kernel.org/lkml/BYAPR21MB131946E0B469E74D6054C33390EF0@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/raw


> -----Original Message-----
> From: James Morse <james.morse@xxxxxxx>
> On 16/05/2019 03:55, Lei Wang wrote:
>> New driver supports error detection and correction on the devices with ARM DMC-520 memory controller.
>
>> diff --git a/MAINTAINERS b/MAINTAINERS index 7d1246b..23894ac 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -5573,6 +5573,12 @@ F: Documentation/driver-api/edac.rst
>> F: drivers/edac/
>> F: include/linux/edac.h
>>
>> +EDAC-DMC520
>> +M: Rui Zhao <ruizhao@xxxxxxxxxxxxx>
>> +L: linux-edac@xxxxxxxxxxxxxxx
>> +S: Supported
>> +F: drivers/edac/dmc520_edac.c
>
> Hmm, you're listing someone else as maintainer of this driver.
> I think we'd need to see an Ack from Rui Zhao...
>
> This patch was previously posted by Rui Zhao, this version has your changes and you as author. (But how you arrange the attribution is up to the two of you...)
> [Lei] And Rui and I sync-ed up on this code and this patch was after addressing his feedbacks and his Ack.

And Rui commented that you should be listed as maintainer:
https://lore.kernel.org/lkml/CY4PR21MB0279BB0E40B86CEA485CF19AB3010@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#u

Please don't add someone else!


>> diff --git a/drivers/edac/dmc520_edac.c b/drivers/edac/dmc520_edac.c
>> new file mode 100644 index 0000000..c81bfcc
>> --- /dev/null
>> +++ b/drivers/edac/dmc520_edac.c

>> +static void dmc520_handle_dram_ecc_errors(struct mem_ctl_info *mci,
>> + bool is_ce)
>> +{
>> + struct ecc_error_info info;
>> + struct dmc520_edac *edac;
>> + u32 cnt;
>> + char message[EDAC_MSG_BUF_SIZE];
>> +
>> + edac = mci->pvt_info;
>> + dmc520_get_dram_ecc_error_info(edac, is_ce, &info);
>> +
>> + cnt = dmc520_get_dram_ecc_error_count(edac, is_ce);
>> +
>> + if (cnt > 0) {
>> + snprintf(message, ARRAY_SIZE(message),
>> + "rank:%d bank:%d row:%d col:%d",
>> + info.rank, info.bank,
>> + info.row, info.col);
>> +
>> + edac_mc_handle_error((is_ce ? HW_EVENT_ERR_CORRECTED :
>> + HW_EVENT_ERR_UNCORRECTED),
>> + mci, cnt, 0, 0, 0, info.rank, -1, -1,
>> + message, "");
>
> Because you have multiple interrupts, you can be calling edac_mc_handle_error() in
> parallel on different CPUs, for the same mci.
>
> edac_mc_handle_error() packs all these arguments into mci->error_desc, so two CPUs will
> stomp over each other's values.
>
> Please add a spinlock in 'struct dmc520_edac' to prevent this.
>
> [Lei] This round of patch moved away from using mci->error_desc, and the message is on stack now.

It's not just the message buffer this is a problem for. You call edac_mc_handle_error(),
you can call it with the same mci on two different CPUs, at the same time.

> edac_mc_handle_error() packs all these arguments into mci->error_desc, so two CPUs will
> stomp over each other's values.

>From drivers/edac/edac_mc.c::edac_mc_handle_error():
| struct edac_raw_error_desc *e = &mci->error_desc;
[...]
| e->error_count = error_count;
| e->top_layer = top_layer;
| e->mid_layer = mid_layer;
| e->low_layer = low_layer;
| e->page_frame_number = page_frame_number;
| e->offset_in_page = offset_in_page;
| e->syndrome = syndrome;
| e->msg = msg;
| e->other_detail = other_detail;

If this happens one two CPUs at the same time, e->msg could point to the other CPU's stack.


Thanks,

James

[0] The best summary I found after a quick search is:
https://kernelnewbies.org/PatchCulture 's 'Email etiquette.'