Re: [PATCH 2/3] cxl/mbox: Add GET_POISON_LIST mailbox command support

From: Jonathan Cameron
Date: Fri Jun 17 2022 - 09:01:37 EST


On Tue, 14 Jun 2022 17:10:27 -0700
alison.schofield@xxxxxxxxx wrote:

> From: Alison Schofield <alison.schofield@xxxxxxxxx>
>
> CXL devices that support persistent memory maintain a list of locations
> that are poisoned or result in poison if the addresses are accessed by
> the host.
>
> Per the spec (CXL 2.0 8.2.8.5.4.1), the device returns this Poison
> list as a set of Media Error Records that include the source of the
> error, the starting device physical address and length. The length is
> the number of adjacent DPAs in the record and is in units of 64 bytes.
>
> Retrieve the list and log each Media Error Record as a trace event of
> type cxl_poison_list.
>
> Signed-off-by: Alison Schofield <alison.schofield@xxxxxxxxx>
> ---

> +int cxl_mem_get_poison_list(struct device *dev)
> +{
> + struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> + struct cxl_mbox_poison_payload_out *po;
> + struct cxl_mbox_poison_payload_in pi;
> + int nr_records = 0;
> + int rc, i;
> +
> + if (range_len(&cxlds->pmem_range)) {
> + pi.offset = cpu_to_le64(cxlds->pmem_range.start);
> + pi.length = cpu_to_le64(range_len(&cxlds->pmem_range));
> + } else {
> + return -ENXIO;
> + }
> +
> + po = kvmalloc(cxlds->payload_size, GFP_KERNEL);
> + if (!po)
> + return -ENOMEM;
> +
> + do {
> + rc = cxl_mbox_send_cmd(cxlds, CXL_MBOX_OP_GET_POISON, &pi,
> + sizeof(pi), po, cxlds->payload_size);
> + if (rc)
> + goto out;
> +
> + if (po->flags & CXL_POISON_FLAG_OVERFLOW) {
> + time64_t o_time = le64_to_cpu(po->overflow_timestamp);
> +
> + dev_err(dev, "Poison list overflow at %ptTs UTC\n",
> + &o_time);
> + rc = -ENXIO;
> + goto out;
> + }
> +
> + if (po->flags & CXL_POISON_FLAG_SCANNING) {
> + dev_err(dev, "Scan Media in Progress\n");
> + rc = -EBUSY;
> + goto out;
> + }
> +
> + for (i = 0; i < le16_to_cpu(po->count); i++) {
> + u64 addr = le64_to_cpu(po->record[i].address);
> + u32 len = le32_to_cpu(po->record[i].length);
> + int source = FIELD_GET(CXL_POISON_SOURCE_MASK, addr);
> +
> + if (!CXL_POISON_SOURCE_VALID(source)) {
> + dev_dbg(dev, "Invalid poison source %d",
> + source);
> + source = CXL_POISON_SOURCE_INVALID;
> + }
> +
> + trace_cxl_poison_list(dev, source, addr, len);
> + }
> +
> + /* Protect against an uncleared _FLAG_MORE */
> + nr_records = nr_records + le16_to_cpu(po->count);
> + if (nr_records >= cxlds->poison_max)

If this happens and _FLAG_MORE is set (it will occur anyway currently
if we happen to have poison_max records - I hit this in QEMU because
until now default of poison_max == 0)
then we should spit out an error message as I think that means the
hardware is broken.


> + goto out;
> +
> + } while (po->flags & CXL_POISON_FLAG_MORE);
> +
> +out:
> + kvfree(po);
> + return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_mem_get_poison_list, CXL);
> +
> struct cxl_dev_state *cxl_dev_state_create(struct device *dev)
> {
> struct cxl_dev_state *cxlds;