Re: [RFC PATCH 7/9] cxl/mem: Implement polled mode mailbox
From: Dan Williams
Date: Tue Nov 17 2020 - 13:38:50 EST
On Tue, Nov 17, 2020 at 10:07 AM Jonathan Cameron
<Jonathan.Cameron@xxxxxxxxxx> wrote:
>
> On Tue, 17 Nov 2020 08:34:38 -0800
> Ben Widawsky <ben.widawsky@xxxxxxxxx> wrote:
>
> > On 20-11-17 15:31:22, Jonathan Cameron wrote:
> > > On Tue, 10 Nov 2020 21:43:54 -0800
> > > Ben Widawsky <ben.widawsky@xxxxxxxxx> wrote:
> > >
> > > > Create a function to handle sending a command, optionally with a
> > > > payload, to the memory device, polling on a result, and then optionally
> > > > copying out the payload. The algorithm for doing this come straight out
> > > > of the CXL 2.0 specification.
> > > >
> > > > Primary mailboxes are capable of generating an interrupt when submitting
> > > > a command in the background. That implementation is saved for a later
> > > > time.
> > > >
> > > > Secondary mailboxes aren't implemented at this time.
> > > >
> > > > WARNING: This is untested with actual timeouts occurring.
> > > >
> > > > Signed-off-by: Ben Widawsky <ben.widawsky@xxxxxxxxx>
> > >
> > > Question inline for why the preempt / local timer dance is worth bothering with.
> > > What am I missing?
> > >
> > > Thanks,
> > >
> > > Jonathan
> > >
> > > > ---
> > > > drivers/cxl/cxl.h | 16 +++++++
> > > > drivers/cxl/mem.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > 2 files changed, 123 insertions(+)
> > > >
> > > > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > > > index 482fc9cdc890..f49ab80f68bd 100644
> > > > --- a/drivers/cxl/cxl.h
> > > > +++ b/drivers/cxl/cxl.h
> > > > @@ -21,8 +21,12 @@
> > > > #define CXLDEV_MB_CTRL 0x04
> > > > #define CXLDEV_MB_CTRL_DOORBELL BIT(0)
> > > > #define CXLDEV_MB_CMD 0x08
> > > > +#define CXLDEV_MB_CMD_PAYLOAD_LENGTH_SHIFT 16
> > > > #define CXLDEV_MB_STATUS 0x10
> > > > +#define CXLDEV_MB_STATUS_RET_CODE_SHIFT 32
> > > > +#define CXLDEV_MB_STATUS_RET_CODE_MASK 0xffff
> > > > #define CXLDEV_MB_BG_CMD_STATUS 0x18
> > > > +#define CXLDEV_MB_PAYLOAD 0x20
> > > >
> > > > /* Memory Device */
> > > > #define CXLMDEV_STATUS 0
> > > > @@ -114,4 +118,16 @@ static inline u64 __cxl_raw_read_reg64(struct cxl_mem *cxlm, u32 reg)
> > > >
> > > > return readq(reg_addr + reg);
> > > > }
> > > > +
> > > > +static inline void cxl_mbox_payload_fill(struct cxl_mem *cxlm, u8 *input,
> > > > + unsigned int length)
> > > > +{
> > > > + memcpy_toio(cxlm->mbox.regs + CXLDEV_MB_PAYLOAD, input, length);
> > > > +}
> > > > +
> > > > +static inline void cxl_mbox_payload_drain(struct cxl_mem *cxlm,
> > > > + u8 *output, unsigned int length)
> > > > +{
> > > > + memcpy_fromio(output, cxlm->mbox.regs + CXLDEV_MB_PAYLOAD, length);
> > > > +}
> > > > #endif /* __CXL_H__ */
> > > > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > > > index 9fd2d1daa534..08913360d500 100644
> > > > --- a/drivers/cxl/mem.c
> > > > +++ b/drivers/cxl/mem.c
> > > > @@ -1,5 +1,6 @@
> > > > // SPDX-License-Identifier: GPL-2.0-only
> > > > // Copyright(c) 2020 Intel Corporation. All rights reserved.
> > > > +#include <linux/sched/clock.h>
> > > > #include <linux/module.h>
> > > > #include <linux/pci.h>
> > > > #include <linux/io.h>
> > > > @@ -7,6 +8,112 @@
> > > > #include "pci.h"
> > > > #include "cxl.h"
> > > >
> > > > +struct mbox_cmd {
> > > > + u16 cmd;
> > > > + u8 *payload;
> > > > + size_t payload_size;
> > > > + u16 return_code;
> > > > +};
> > > > +
> > > > +static int cxldev_wait_for_doorbell(struct cxl_mem *cxlm)
> > > > +{
> > > > + u64 start, now;
> > > > + int cpu, ret, timeout = 2000000000;
> > > > +
> > > > + start = local_clock();
> > > > + preempt_disable();
> > > > + cpu = smp_processor_id();
> > > > + for (;;) {
> > > > + now = local_clock();
> > > > + preempt_enable();
> > >
> > > What do we ever do with this mailbox that is particularly
> > > performance critical? I'd like to understand why we care enough
> > > to mess around with the preemption changes and local clock etc.
> > >
> >
> > It is quite obviously a premature optimization at this point (since we only
> > support a single command in QEMU). However, the polling can be anywhere from
> > instant to 2 seconds. QEMU implementation aside again, some devices may never
> > support interrupts on completion, and so I thought providing a poll function now
> > that is capable of working for most [all?] cases was wise.
>
> Definitely seems premature. I'd want to see real numbers on hardware
> to justify this sort of complexity. Maybe others disagree though.
The polling is definitely needed, but I think it can be a simple
jiffies based loop and avoid this sched_clock() complexity.