Re: [PATCH v1 11/11] mtd: new support oops logger based on pstore/blk

From: Miquel Raynal
Date: Fri Feb 07 2020 - 03:41:30 EST


Hi Liao,

liaoweixiong <liaoweixiong@xxxxxxxxxxxxxxxxx> wrote on Fri, 7 Feb 2020
12:13:08 +0800:

> hi Miquel Raynal,
>
> On 2020/2/6 PM 11:45, Miquel Raynal wrote:
> > Hi liao,
> >
> > liaoweixiong <liaoweixiong@xxxxxxxxxxxxxxxxx> wrote on Thu, 6 Feb 2020
> > 21:10:47 +0800:
> >
> >> hi Miquel Raynal,
> >>
> >> On 2020/1/23 AM 1:41, Miquel Raynal wrote:
> >>> Hello,
> >>>
> >>> >>>>>>>> +/*
> >>>>>>>> + * All zones will be read as pstore/blk will read zone one by one when do
> >>>>>>>> + * recover.
> >>>>>>>> + */
> >>>>>>>> +static ssize_t mtdpstore_read(char *buf, size_t size, loff_t off)
> >>>>>>>> +{
> >>>>>>>> + struct mtdpstore_context *cxt = &oops_cxt;
> >>>>>>>> + size_t retlen;
> >>>>>>>> + int ret;
> >>>>>>>> +
> >>>>>>>> + if (mtdpstore_block_isbad(cxt, off))
> >>>>>>>> + return -ENEXT;
> >>>>>>>> +
> >>>>>>>> + pr_debug("try to read off 0x%llx size %zu\n", off, size);
> >>>>>>>> + ret = mtd_read(cxt->mtd, off, size, &retlen, (u_char *)buf);
> >>>>>>>> + if ((ret < 0 && !mtd_is_bitflip(ret)) || size != retlen) {
> >>>>>>>
> >>>>>>> IIRC size != retlen does not mean it failed, but that you should
> >>>>>>> continue reading after retlen bytes, no?
> >>>>>>> >>
> >>>>>> Yes, you are right. I will fix it. Thanks.
> >>>>>> >>>>> Also, mtd_is_bitflip() does not mean that you are reading a false
> >>>>>>> buffer, but that the data has been corrected as it contained bitflips.
> >>>>>>> mtd_is_eccerr() however, would be meaningful.
> >>>>>>> >>
> >>>>>> Sure I know mtd_is_bitflip() does not mean failure, but I do not think
> >>>>>> mtd_is_eccerr() should be here since the codes are ret < 0 and NOT
> >>>>>> mtd_is_bitflip().
> >>>>>
> >>>>> Yes, just drop this check, only keep ret < 0.
> >>>>> >>
> >>>> If I don't get it wrong, it should not be dropped here. Like your words,
> >>>> "mtd_is_bitflip() does not mean that you are reading a false buffer,
> >>>> but that the data has been corrected as it contained bitflips.", the
> >>>> data I get are valid even if mtd_is_bitflip() return true. It's correct
> >>>> data and it's no need to go to handle error. To me, the codes
> >>>> should be:
> >>>> if (ret < 0 && !mit_is_bitflip())
> >>>> [error handling]
> >>>
> >>> Please check the implementation of mtd_is_bitflip(). You'll probably
> >>> figure out what I am saying.
> >>>
> >>> https://elixir.bootlin.com/linux/latest/source/include/linux/mtd/mtd.h#L585
> >>> >>
> >> How about the codes as follows:
> >>
> >> for (done = 0, retlen = 0; done < size; done += retlen) {
> >> ret = mtd_read(..., &retlen, ...);
> >> if (!ret)
> >> continue;
> >> /*
> >> * do nothing if bitflip and ecc error occurs because whether
> >> * it's bitflip or ECC error, just a small number of bits flip
> >> * and the impact on log data is so small. The mtdpstore just
> >> * hands over what it gets and user can judge whether the data
> >> * is valid or not.
> >> */
> >> if (mtd_is_bitflip(ret)) {
> >> dev_warn("bitflip at....");
> >> continue;
>
> > I don't understand why do you check for bitflips. Bitflips have been
> > corrected at this stage, you just get the information that there
> > has been bitflips, but the data integrity is fine.
> >
>
> Both of bitflip and eccerror are not real wrong in this
> case. So we must check them.
>
> > I am not against ignoring ECC errors in this case though. I would
> > propose:
> >
> > for (...) {
> > if (ret < 0) {
> > complain;
> > return;
> > }
> >
>
> -117 (-EUCLEAN) means bitflip but be corrected.
> -74 (-EBADMSG) means ecc error that uncorrectable
> All of them are negative number that smaller than 0. If it just keeps
> "ret < 0", it can never make a difference between bitflip/eccerror
> and others.

I forgot about these, your're right, so what about:

static bool mtdpstore_is_io_error(int ret)
{
return ret < 0 && !mtd_is_eccerr(ret)> && !mtd_is_bitflip(ret);
}

for (...) {
if (mtdpstore_is_io_error(ret))
return ret;

/*
* Comment explaining why we still return valid data
* despite ECC errors.
*/
if (mtd_is_eccerr(ret))
just-complain();
}

This snippet makes a difference between any "controller/bus
timeout/error : data could not be retrieved" and "ECC error, maybe we
can still read it and try to understand (risky, must be warned)".

>
> > if (mtd_is_eccerr())
> > complain;
> > }
> >
> >> } else if (mtd_is_eccerr(ret)) {
> >> dev_warn("eccerr at....");
> >> retlen = retlen == 0 ? size : retlen;
> >> continue;
> >> } else {
> >> dev_err("read failure at...");
> >> /* this zone is broken, try next one */
> >> return -ENEXT;
> >> }
> >> }
> >>
> >
> >
> > Thanks,
> > MiquÃl
> >

Thanks,
MiquÃl