RE: [PATCH v8 5/5] remoteproc: Add initial zynqmp R5 remoteproc driver
From: Ben Levinsky
Date: Thu Aug 20 2020 - 11:13:32 EST
> -----Original Message-----
> From: Stefano Stabellini <stefano.stabellini@xxxxxxxxxx>
> Sent: Wednesday, August 19, 2020 2:21 PM
> To: Ben Levinsky <BLEVINSK@xxxxxxxxxx>
> Cc: Stefano Stabellini <stefanos@xxxxxxxxxx>; Michal Simek
> <michals@xxxxxxxxxx>; devicetree@xxxxxxxxxxxxxxx;
> mathieu.poirier@xxxxxxxxxx; Ed T. Mooring <emooring@xxxxxxxxxx>; linux-
> remoteproc@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> robh+dt@xxxxxxxxxx; Jiaying Liang <jliang@xxxxxxxxxx>; linux-arm-
> kernel@xxxxxxxxxxxxxxxxxxx
> Subject: RE: [PATCH v8 5/5] remoteproc: Add initial zynqmp R5 remoteproc
> driver
>
> On Tue, 18 Aug 2020, Ben Levinsky wrote:
> > > > +/**
> > > > + * struct zynqmp_r5_mem - zynqmp rpu memory data
> > > > + * @pnode_id: TCM power domain ids
> > > > + * @res: memory resource
> > > > + * @node: list node
> > > > + */
> > > > +struct zynqmp_r5_mem {
> > > > + u32 pnode_id[MAX_MEM_PNODES];
> > > > + struct resource res;
> > > > + struct list_head node;
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct zynqmp_r5_pdata - zynqmp rpu remote processor private data
> > > > + * @dev: device of RPU instance
> > > > + * @rproc: rproc handle
> > > > + * @pnode_id: RPU CPU power domain id
> > > > + * @mems: memory resources
> > > > + * @is_r5_mode_set: indicate if r5 operation mode is set
> > > > + * @tx_mc: tx mailbox client
> > > > + * @rx_mc: rx mailbox client
> > > > + * @tx_chan: tx mailbox channel
> > > > + * @rx_chan: rx mailbox channel
> > > > + * @mbox_work: mbox_work for the RPU remoteproc
> > > > + * @tx_mc_skbs: socket buffers for tx mailbox client
> > > > + * @rx_mc_buf: rx mailbox client buffer to save the rx message
> > > > + */
> > > > +struct zynqmp_r5_pdata {
> > > > + struct device dev;
> > > > + struct rproc *rproc;
> > > > + u32 pnode_id;
> > > > + struct list_head mems;
> > > > + bool is_r5_mode_set;
> > > > + struct mbox_client tx_mc;
> > > > + struct mbox_client rx_mc;
> > > > + struct mbox_chan *tx_chan;
> > > > + struct mbox_chan *rx_chan;
> > > > + struct work_struct mbox_work;
> > > > + struct sk_buff_head tx_mc_skbs;
> > > > + unsigned char rx_mc_buf[RX_MBOX_CLIENT_BUF_MAX];
> > >
> > > A simple small reordering of the struct fields would lead to small
> > > memory savings due to alignment.
> > >
> > >
> > [Ben Levinsky] will do. Do you mean ordering in either ascending or
> descending order?
>
> Each field has a different alignment in the struct, so for example after
> pnode_id there are probably 4 unused bytes because mems is 64bit
> aligned.
>
>
[Ben Levinsky] ok will update this so the alignments are done with less unused memory per struct allocation.
> > > > +/*
> > > > + * TCM needs mapping to R5 relative address and xilinx platform mgmt
> call
> > > > + */
> > > > +struct rproc_mem_entry *handle_tcm_parsing(struct device *dev,
> > > > + struct reserved_mem *rmem,
> > > > + struct device_node *node,
> > > > + int lookup_idx)
> > > > +{
> > > > + void *va;
> > > > + dma_addr_t dma;
> > > > + resource_size_t size;
> > > > + int ret;
> > > > + u32 pnode_id;
> > > > + struct resource rsc;
> > > > + struct rproc_mem_entry *mem;
> > > > +
> > > > + pnode_id = tcm_addr_to_pnode[lookup_idx][1];
> > > > + ret = zynqmp_pm_request_node(pnode_id,
> > > > + ZYNQMP_PM_CAPABILITY_ACCESS, 0,
> > > > + ZYNQMP_PM_REQUEST_ACK_BLOCKING);
> > > > + if (ret < 0) {
> > > > + dev_err(dev, "failed to request power node: %u\n",
> > > pnode_id);
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + ret = of_address_to_resource(node, 0, &rsc);
> > > > + if (ret < 0) {
> > > > + dev_err(dev, "failed to get resource of memory %s",
> > > > + of_node_full_name(node));
> > > > + return -EINVAL;
> > > > + }
> > > > + size = resource_size(&rsc);
> > > > + va = devm_ioremap_wc(dev, rsc.start, size);
> > > > + if (!va)
> > > > + return -ENOMEM;
> > > > +
> > > > + /* zero out tcm base address */
> > > > + if (rsc.start & 0xffe00000) {
> > > > + /* R5 can't see anything past 0xfffff so wipe it */
> > > > + rsc.start &= 0x000fffff;
> > >
> > > If that is the case why not do:
> > >
> > > rsc.start &= 0x000fffff;
> > >
> > > unconditionally? if (rsc.start & 0xffe00000) is superfluous.
> > >
> > > But I thought that actually the R5s could see TCM at both the low
> > > address (< 0x000fffff) and also at the high address (i.e. 0xffe00000).
> > >
> > >
> > [Ben Levinsky] Here yes can make rsc.start &= 0x000fffff undconditional. Will
> update in v9 as such
> > Also, this logic is because this is only for the Xilinx R5
> mappings of TCM banks that are at (from the TCM point of view) 0x0 and
> 0x2000
>
> What I meant is that as far as I understand from the TRM the RPU should
> also be able to access the same banks at the same address of the APU,
> which I imagine is in the 0xffe00000 range. But I could be wrong on
> this.
>
> If we could use the same addresses for RPU and APU, it would simplify
> this driver.
>
>
> > > > + /*
> > > > + * handle tcm banks 1 a and b (0xffe9000 and
> > > > + * 0xffeb0000)
> > > > + */
> > > > + if (rsc.start & 0x80000)
> > > > + rsc.start -= 0x90000;
> > >
> > > It is very unclear to me why we have to do this
> > >
> > >
> > [Ben Levinsky] This is for TCM bank 0B and bank 1B to map them to
> 0x00020000 so that the TCM relative addressing lines up. For example
> (0xffe90000 & 0x000fffff) - 0x90000 = 0x20000
>
> Could you please explain the mapping in an in-code comment?
> The comment currently mentions 0xffe9000 and 0xffeb0000 but:
>
> - 0xffe9000 & 0x000fffff = 0xe9000
> 0xe9000 - 0x90000 = 0x59000
>
> - 0xffeb0000 & 0x000fffff = 0xeb000
> 0xeb000 - 0x90000 = 0xeb000
>
> Either way we don't get 0x20000. What am I missing?
>
[Ben Levinsky] I apologize there is typo in the comment... it should be 0xffe90000 and 0xffeb0000
The output is:
0xffe90000 & 0x000fffff = 0x90000
0x90000 - 0x90000 = 0x0
And
0xffeb0000 & 0x000fffff = 0xB0000
0xB0000 - 0x90000 = 0x20000
So these line up for the relative addressing for RPU's view of TCMs
>
>
> > > > + }
> > > > +
> > > > + dma = (dma_addr_t)rsc.start;
> > >
> > > Given the way the dma parameter is used by
> > > rproc_alloc_registered_carveouts, I think it might be best to pass the
> > > original start address (i.e. 0xffe00000) as dma.
> > >
> > >
> > > > + mem = rproc_mem_entry_init(dev, va, dma, (int)size, rsc.start,
> > > > + NULL, zynqmp_r5_mem_release,
> > >
> > > I don't know too much about the remoteproc APIs, but shouldn't you be
> > > passing zynqmp_r5_rproc_mem_alloc to it instead of NULL?
> > >
> > >
> > [Ben Levinsky] the difference is that for TCM we have to do make the
> relative address work for TCM, so the dma input to rproc_mem_entry_init is
> different in TCM case.
>
> The dma address is the address as seen by the RPU, is that right?
> So you are trying to set the dma address to something in the range 0 -
> 0x20000?
>
>
[Ben Levinsky] yes
> > > > + rsc.name);
> > > > + if (!mem)
> > > > + return -ENOMEM;
> > > > +
> > > > + return mem;
> > > > +}
> > > > +
> > > > +static int parse_mem_regions(struct rproc *rproc)
> > > > +{
> > > > + int num_mems, i;
> > > > + struct zynqmp_r5_pdata *pdata = rproc->priv;
> > > > + struct device *dev = &pdata->dev;
> > > > + struct device_node *np = dev->of_node;
> > > > + struct rproc_mem_entry *mem;
> > > > +
> > > > + num_mems = of_count_phandle_with_args(np, "memory-region",
> > > NULL);
> > > > + if (num_mems <= 0)
> > > > + return 0;
> > > > + for (i = 0; i < num_mems; i++) {
> > > > + struct device_node *node;
> > > > + struct reserved_mem *rmem;
> > > > +
> > > > + node = of_parse_phandle(np, "memory-region", i);
> > >
> > > Check node != NULL ?
> > >
> > [Ben Levinsky] will add this in v9
> > >
> > > > + rmem = of_reserved_mem_lookup(node);
> > > > + if (!rmem) {
> > > > + dev_err(dev, "unable to acquire memory-region\n");
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + if (strstr(node->name, "vdev0buffer")) {
> > >
> > > vdev0buffer is not described in the device tree bindings, is that
> > > normal/expected?
> > >
> > >
> > [Ben Levinsky] vdev0buffer is not required, as there might be simple
> firmware loading with no IPC. Vdev0buffer only needed for IPC.
>
> OK, good. It should probably still be described in the device tree
> binding as optional property.
>
>
> > > > + /* Register DMA region */
> > > > + mem = rproc_mem_entry_init(dev, NULL,
> > > > + (dma_addr_t)rmem->base,
> > > > + rmem->size, rmem->base,
> > > > + NULL, NULL,
> > > > + "vdev0buffer");
> > > > + if (!mem) {
> > > > + dev_err(dev, "unable to initialize memory-
> > > region %s\n",
> > > > + node->name);
> > > > + return -ENOMEM;
> > > > + }
> > > > + dev_dbg(dev, "parsed %s at %llx\r\n", mem->name,
> > > > + mem->dma);
> > > > + } else if (strstr(node->name, "vdev0vring")) {
> > >
> > > Same here
> > >
> > >
> > > > + int vring_id;
> > > > + char name[16];
> > > > +
> > > > + /*
> > > > + * can be 1 of multiple vring IDs per IPC channel
> > > > + * e.g. 'vdev0vring0' and 'vdev0vring1'
> > > > + */
> > > > + vring_id = node->name[14] - '0';
> > >
> > > Where does the "14" comes from? Are we sure it is not possible to have a
> > > node->name smaller than 14 chars?
> > >
> > [Ben Levinsky] Presently there are only 2 vrings used per Xilinx OpenAMP
> channel to RPU. In Xilinx kernel we have hard-coded names as these are the
> only nodes that use it. For example RPU0vdev0vring0 and RPU1vdev0vring0.
> Hence we only check for vdev0vring and not a sscanf("%*s%i") to parse out
> the vring ID or other, cleaner solution.
>
> OK, but I think it is best if we use node->name[14] only if we
> explicitly check for a string of at least 14 characters. Using strstr,
> it could return the string "vdev0vring" which is less than 14 chars,
> leading to a bug.
>
>
> > >
> > > > + snprintf(name, sizeof(name), "vdev0vring%d",
> > > vring_id);
> > > > + /* Register vring */
> > > > + mem = rproc_mem_entry_init(dev, NULL,
> > > > + (dma_addr_t)rmem->base,
> > > > + rmem->size, rmem->base,
> > > > +
> > > zynqmp_r5_rproc_mem_alloc,
> > > > +
> > > zynqmp_r5_rproc_mem_release,
> > > > + name);
> > > > + dev_dbg(dev, "parsed %s at %llx\r\n", mem->name,
> > > > + mem->dma);
> > > > + } else {
> > > > + int idx;
> > > > +
> > > > + /*
> > > > + * if TCM update address space for R5 and
> > > > + * make xilinx platform mgmt call
> > > > + */
> > > > + for (idx = 0; idx < ZYNQMP_R5_NUM_TCM_BANKS;
> > > idx++) {
> > > > + if (tcm_addr_to_pnode[idx][0] == rmem-
> > > >base)
> > > > + break;
> > >
> > > There is something I don't quite understand. If the memory region to use
> > > is TCM, why would it be also described under reserved-memory?
> > > Reserved-memory is for normal memory being reserved, while TCM is not
> > > normal memory. Am I missing something?
> > >
> > [Ben Levinsky] I can change this in v9 as discussed. That is, have no TCM
> under reserved mem. Instead have the banks as nodes in device tree with
> status="[enabled|disabled]" and if enabled, then try to add memories in
> parse_fw call.
> > >
> > > > + }
> > > > +
> > > > + if (idx != ZYNQMP_R5_NUM_TCM_BANKS) {
> > > > + mem = handle_tcm_parsing(dev, rmem, node,
> > > idx);
> > > > + } else {
> > > > + mem = rproc_mem_entry_init(dev, NULL,
> > > > + (dma_addr_t)rmem->base,
> > > > + rmem->size, rmem->base,
> > > > +
> > > zynqmp_r5_rproc_mem_alloc,
> > > > +
> > > zynqmp_r5_rproc_mem_release,
> > > > + node->name);
> > >
> > > This case looks identical to the vdev0vring above. Is the difference
> > > really just in the "name"? If so, can we merge the two cases into one?
> > > no, because the devm_ioremap_wc call has to be done before changing
> the dma address to relative for TCM banks.
> > >
> > > > + }
> > > > +
> > > > + if (!mem) {
> > > > + dev_err(dev,
> > > > + "unable to init memory-region %s\n",
> > > > + node->name);
> > > > + return -ENOMEM;
> > > > + }
> > > > + }
> > > > + rproc_add_carveout(rproc, mem);
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int zynqmp_r5_parse_fw(struct rproc *rproc, const struct
> firmware
> > > *fw)
> > > > +{
> > > > + int ret;
> > > > + struct zynqmp_r5_pdata *pdata = rproc->priv;
> > > > + struct device *dev = &pdata->dev;
> > > > +
> > > > + ret = parse_mem_regions(rproc);
> > > > + if (ret) {
> > > > + dev_err(dev, "parse_mem_regions failed %x\n", ret);
> > > > + return ret;
> > > > + }
> > > > +
> > > > + ret = rproc_elf_load_rsc_table(rproc, fw);
> > > > + if (ret == -EINVAL) {
> > > > + dev_info(dev, "no resource table found.\n");
> > > > + ret = 0;
> > >
> > > Why do we want to continue ignoring the error in this case?
> > >
> > [Ben Levinsky] as there can be simple firmware loaded onto R5 that do not
> have resource table. Resource table only needed for specific IPC case.
>
> OK, an in-code comment would be good
>
>
> > > > + struct device *dev;
> > > > + struct zynqmp_r5_mem *mem;
> > > > + int ret;
> > > > + struct property *prop;
> > > > + const __be32 *cur;
> > > > + u32 val;
> > > > + int i;
> > > > +
> > > > + dev = &pdata->dev;
> > > > + mem = devm_kzalloc(dev, sizeof(*mem), GFP_KERNEL);
> > > > + if (!mem)
> > > > + return -ENOMEM;
> > > > + ret = of_address_to_resource(node, 0, &mem->res);
> > > > + if (ret < 0) {
> > > > + dev_err(dev, "failed to get resource of memory %s",
> > > > + of_node_full_name(node));
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + /* Get the power domain id */
> > > > + i = 0;
> > > > + if (of_find_property(node, "pnode-id", NULL)) {
> > > > + of_property_for_each_u32(node, "pnode-id", prop, cur, val)
> > > > + mem->pnode_id[i++] = val;
> > > > + }
> > > > + list_add_tail(&mem->node, &pdata->mems);
> > > > + return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * zynqmp_r5_release() - ZynqMP R5 device release function
> > > > + * @dev: pointer to the device struct of ZynqMP R5
> > > > + *
> > > > + * Function to release ZynqMP R5 device.
> > > > + */
> > > > +static void zynqmp_r5_release(struct device *dev)
> > > > +{
> > > > + struct zynqmp_r5_pdata *pdata;
> > > > + struct rproc *rproc;
> > > > + struct sk_buff *skb;
> > > > +
> > > > + pdata = dev_get_drvdata(dev);
> > > > + rproc = pdata->rproc;
> > > > + if (rproc) {
> > > > + rproc_del(rproc);
> > > > + rproc_free(rproc);
> > > > + }
> > > > + if (pdata->tx_chan)
> > > > + mbox_free_channel(pdata->tx_chan);
> > > > + if (pdata->rx_chan)
> > > > + mbox_free_channel(pdata->rx_chan);
> > > > + /* Discard all SKBs */
> > > > + while (!skb_queue_empty(&pdata->tx_mc_skbs)) {
> > > > + skb = skb_dequeue(&pdata->tx_mc_skbs);
> > > > + kfree_skb(skb);
> > > > + }
> > > > +
> > > > + put_device(dev->parent);
> > > > +}
> > > > +
> > > > +/**
> > > > + * event_notified_idr_cb() - event notified idr callback
> > > > + * @id: idr id
> > > > + * @ptr: pointer to idr private data
> > > > + * @data: data passed to idr_for_each callback
> > > > + *
> > > > + * Pass notification to remoteproc virtio
> > > > + *
> > > > + * Return: 0. having return is to satisfy the idr_for_each() function
> > > > + * pointer input argument requirement.
> > > > + **/
> > > > +static int event_notified_idr_cb(int id, void *ptr, void *data)
> > > > +{
> > > > + struct rproc *rproc = data;
> > > > +
> > > > + (void)rproc_vq_interrupt(rproc, id);
> > > > + return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * handle_event_notified() - remoteproc notification work funciton
> > > > + * @work: pointer to the work structure
> > > > + *
> > > > + * It checks each registered remoteproc notify IDs.
> > > > + */
> > > > +static void handle_event_notified(struct work_struct *work)
> > > > +{
> > > > + struct rproc *rproc;
> > > > + struct zynqmp_r5_pdata *pdata;
> > > > +
> > > > + pdata = container_of(work, struct zynqmp_r5_pdata, mbox_work);
> > > > +
> > > > + (void)mbox_send_message(pdata->rx_chan, NULL);
> > > > + rproc = pdata->rproc;
> > > > + /*
> > > > + * We only use IPI for interrupt. The firmware side may or may
> > > > + * not write the notifyid when it trigger IPI.
> > > > + * And thus, we scan through all the registered notifyids.
> > > > + */
> > > > + idr_for_each(&rproc->notifyids, event_notified_idr_cb, rproc);
> > >
> > > This looks expensive. Should we at least check whether the notifyid was
> > > written as first thing before doing the scan?
> > >
> > [Ben Levinsky] this will be at most 2 vrings presently per firmware-load and
> only done when the firmware is loaded so the latency so should not impact
> performace or user
>
> OK
>
>
> > > > + /* Get R5 power domain node */
> > > > + ret = of_property_read_u32(node, "pnode-id", &pdata->pnode_id);
> > > > + if (ret) {
> > > > + dev_err(dev, "failed to get power node id.\n");
> > > > + goto error;
> > > > + }
> > > > +
> > > > + /* TODO Check if R5 is running */
> > > > +
> > > > + /* Set up R5 if not already setup */
> > > > + ret = pdata->is_r5_mode_set ? 0 : r5_set_mode(pdata);
> > > > + if (ret) {
> > > > + dev_err(dev, "failed to set R5 operation mode.\n");
> > > > + return ret;
> > > > + }
> > >
> > > is_r5_mode_set is set by r5_set_mode(), which is only called here.
> > > So it looks like this check is important in cases where
> > > zynqmp_r5_probe() is called twice for the same R5 node. But I don't
> > > think that is supposed to happen?
> > >
> > [Ben Levinsky] this is needed as there are cases where user can repeatedly
> load different firmware so the check is needed in cases like this where rpu is
> already configured. It is also possible that a user might repeatedly
> load/unload the module
>
> OK