Re: [PATCH 2/3] cxl/mbox: Add GET_POISON_LIST mailbox command support

From: Alison Schofield
Date: Wed Jun 15 2022 - 01:10:00 EST


On Tue, Jun 14, 2022 at 08:22:41PM -0700, Ira Weiny wrote:

Thanks for the review Ira...

> On Tue, Jun 14, 2022 at 05:10:27PM -0700, Alison Schofield wrote:
> > From: Alison Schofield <alison.schofield@xxxxxxxxx>
> >

snip

> >
> > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
> > index 60d10ee1e7fc..29cf0459b44a 100644
> > --- a/drivers/cxl/cxlmem.h
> > +++ b/drivers/cxl/cxlmem.h
> > @@ -174,6 +174,7 @@ struct cxl_endpoint_dvsec_info {
> > * (CXL 2.0 8.2.8.4.3 Mailbox Capabilities Register)
> > * @lsa_size: Size of Label Storage Area
> > * (CXL 2.0 8.2.9.5.1.1 Identify Memory Device)
> > + * @poison_max_mer: maximum Media Error Records tracked in Poison List
>
> Does not match the member name below.
Got it! I intended to drop the _mer.

>
> Ira
>
> > * @mbox_mutex: Mutex to synchronize mailbox access.
> > * @firmware_version: Firmware version for the memory device.
> > * @enabled_cmds: Hardware commands found enabled in CEL.
> > @@ -204,6 +205,7 @@ struct cxl_dev_state {
> >
> > size_t payload_size;
> > size_t lsa_size;
> > + u32 poison_max;
> > struct mutex mbox_mutex; /* Protects device mailbox and firmware */
> > char firmware_version[0x10];
> > DECLARE_BITMAP(enabled_cmds, CXL_MEM_COMMAND_ID_MAX);

snip

> > +int cxl_mem_get_poison_list(struct device *dev)
> > +{
> > + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> > + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> > + struct cxl_mbox_poison_payload_out *po;
> > + struct cxl_mbox_poison_payload_in pi;
> > + int nr_records = 0;
> > + int rc, i;
> > +
> > + if (range_len(&cxlds->pmem_range)) {
> > + pi.offset = cpu_to_le64(cxlds->pmem_range.start);
> > + pi.length = cpu_to_le64(range_len(&cxlds->pmem_range));
> > + } else {
> > + return -ENXIO;
> > + }
> > +
> > + po = kvmalloc(cxlds->payload_size, GFP_KERNEL);
> > + if (!po)
> > + return -ENOMEM;
> > +
> > + do {
> > + rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_POISON, &pi,
> > + sizeof(pi), po, cxlds->payload_size);
> > + if (rc)
> > + goto out;
> > +
> > + if (po->flags & CXL_POISON_FLAG_OVERFLOW) {
> > + time64_t o_time = le64_to_cpu(po->overflow_timestamp);
> > +
> > + dev_err(dev, "Poison list overflow at %ptTs UTC\n",
> > + &o_time);
> > + rc = -ENXIO;
> > + goto out;
>
> I guess the idea is that this return will trigger something else will clear the list,
> rebuild the list, and perform a scan media request?
>
Per CXL Spec 8.2.9.5.4.1: The poison list may be incomplete when the list
has overflowed. User can perform a Scan Media to try to clear and rebuild
the list, with no guarantee that the overflow will not recur.

So yes to what you are saying. This return value should indicate to
user space that a Scan Media should be issued. Issuing the Scan Media
to the device does lead the device to rebuild it's list, as you say.
Also, when we get the Scan Media results, the device is able to report
partial results and tell the host to collect the error records, and
then restart the scan, get results again, and on and on until the scan
is complete.

Perhaps a clarification - there is not a logical pairing of Scan Media
followed by Get Poison List. Scan Media followed by Get Scan Media
Results is the logical pairing. Get Poison List is getting a snapshot
of the poison list at a point in time. The device adds DPAs to the list
when the device detects poison, some devices run their own backround
scans and add to the poison list, and then there are the user initiated
actions (Scan Media and Poison Inject) that can affect the list.

> I'm just wondering if this loop should continue to clear the list and then let
> something else do the scan media request?

It's not like the _MORE status where the device is telling the host to
come back and gather more. I think the action of failing, and letting
user initiated a Scan Media is correct course here.

So, this response got kind of long winded. As you can see, especially
if one looks in the spec as I know you are, there are additional
commands that need to be implemented to complete the ARS feature set.
And, of course, we'll offer user space tooling (NDCTL and libcxl).

>
> Other than that question and the above typo. Looks good!
>
> Ira
>
> > + }
> > +
> > + if (po->flags & CXL_POISON_FLAG_SCANNING) {
> > + dev_err(dev, "Scan Media in Progress\n");
> > + rc = -EBUSY;
> > + goto out;
> > + }
> > +
> > + for (i = 0; i < le16_to_cpu(po->count); i++) {
> > + u64 addr = le64_to_cpu(po->record[i].address);
> > + u32 len = le32_to_cpu(po->record[i].length);
> > + int source = FIELD_GET(CXL_POISON_SOURCE_MASK, addr);
> > +
> > + if (!CXL_POISON_SOURCE_VALID(source)) {
> > + dev_dbg(dev, "Invalid poison source %d",
> > + source);
> > + source = CXL_POISON_SOURCE_INVALID;
> > + }
> > +
> > + trace_cxl_poison_list(dev, source, addr, len);
> > + }
> > +
> > + /* Protect against an uncleared _FLAG_MORE */
> > + nr_records = nr_records + le16_to_cpu(po->count);
> > + if (nr_records >= cxlds->poison_max)
> > + goto out;
> > +
> > + } while (po->flags & CXL_POISON_FLAG_MORE);
> > +
> > +out:
> > + kvfree(po);
> > + return rc;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_mem_get_poison_list, CXL);
> > +
> > struct cxl_dev_state *cxl_dev_state_create(struct device *dev)
> > {
> > struct cxl_dev_state *cxlds;
> > --
> > 2.31.1
> >