RE: [PATCH] nvme: Enable acceleration feature of A64FX processor

From: Elliott, Robert (Persistent Memory)
Date: Thu Feb 14 2019 - 15:46:10 EST

> -----Original Message-----
> From: Linux-nvme [mailto:linux-nvme-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Keith Busch
> Sent: Tuesday, February 5, 2019 8:39 AM
> To: Takao Indoh <indou.takao@xxxxxxxxxxx>
> Cc: Takao Indoh <indou.takao@xxxxxxxxxxxxxx>; sagi@xxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-
> nvme@xxxxxxxxxxxxxxxxxxx; axboe@xxxxxx; hch@xxxxxx
> Subject: Re: [PATCH] nvme: Enable acceleration feature of A64FX processor
> On Tue, Feb 05, 2019 at 09:56:05PM +0900, Takao Indoh wrote:
> > On Fri, Feb 01, 2019 at 07:54:14AM -0700, Keith Busch wrote:
> > > On Fri, Feb 01, 2019 at 09:46:15PM +0900, Takao Indoh wrote:
> > > > From: Takao Indoh <indou.takao@xxxxxxxxxxx>
> > > >
> > > > Fujitsu A64FX processor has a feature to accelerate data transfer of
> > > > internal bus by relaxed ordering. It is enabled when the bit 56 of dma
> > > > address is set to 1.
> > >
> > > Wait, what? RO is a standard PCIe TLP attribute. Why would we need this?
> >
> > I should have explained this patch more carefully.
> >
> > Standard PCIe devices can use Relaxed Ordering (RO) by setting Attr
> > field in the TLP header, however, this mechanism cannot be utilized if
> > the device does not support RO feature. Fujitsu A64FX processor has an
> > alternate feature to enable RO in its Root Port by setting the bit 56 of
> > DMA address. This mechanism enables to utilize RO feature even if the
> > device does not support standard PCIe RO.
> I think you're better of just purchasing devices that support the
> capability per spec rather than with a non-standard work around.

The PCIe and NVMe specifications dosn't standardize a way to tell the device
when to use RO, which leads to system workarounds like this.

The Enable Relaxed Ordering bit defined by PCIe tells the device when it
cannot use RO, but doesn't advise when it should or shall use RO.

For SCSI Express (SOP+PQI), we were going to allow specifying these
on a per-command basis:
* TLP attributes (No Snoop, Relaxed Ordering, ID-based Ordering)
* TLP processing hints (Processing Hints and Steering Tags)

to be used by the data transfers for the command. In some systems, one
setting per queue or per device might suffice. Transactions to the
queues and doorbells require stronger ordering.

For this workaround:
* making an extra pass through the SGL to set the address bit is
inefficient; it should be done as the SGL is created.
* why doesn't it support PRP Lists?
* how does this interact with an iommu, if there is one? Must the
address with bit 56 also be granted permission, or is that
stripped off before any iommu comparisons?