Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR
From: Luis R. Rodriguez
Date: Thu Apr 16 2015 - 14:54:42 EST
On Thu, Apr 16, 2015 at 01:18:37PM +0900, Hyong-Youb Kim wrote:
> On Thu, Apr 16, 2015 at 01:58:16AM +0200, Luis R. Rodriguez wrote:
> >
> > An alternative... is to just ioremap_wc() the entire region, including
> > MMIO registers for these old devices. I see one ethernet driver that does
> > this, myri10ge, and am curious how and why they ended up deciding this
> > and if they have run into any issues. I wonder if this is a reasonable
> > comrpomise for these 2 remaining corner cases.
> >
>
> For myri10ge, it a performance thing. Descriptor rings are in NIC
> memory BAR0, not in host memory. Say, to send a packet, the driver
> writes the send descriptor to the ioremap'd NIC memory. It is a
> multi-word descriptor. So, to send it as one PCIE MWr transaction,
> the driver maps the whole BAR0 as WC and does "copy descriptor; wmb".
Interesting, so you burst write multi-word descriptor writes using
write-combining here for the Ethernet device.
> Without WC, descriptors would end up as multiple 4B or 8B MWr packets
> to the NIC, which has a pretty big performance impact on this
> particular NIC.
How big are the descriptors?
> Most registers that do not want WC are actually in BAR2, which is not
> mapped as WC. For registers that are in BAR0, we do "write to the
> register; wmb". If we want to wait till the NIC has seen the write,
> we do "write; wmb; read".
Interesting, thanks, yeah using this as a work around to the problem sounds
plausible however it still would require likely making just as many changes to
the ivtv and ipath driver as to just do a proper split. I do wonder however if
this sort of work around can be generalized somehow though so that others could
use, if this sort of thing is going to become prevalent. If so then this would
serve two purposes: work around for the corner cases of MTRR use on Linux and
also these sorts of device constraints.
In order to determine if this is likely to be generally useful could you elaborate
a bit more about the detals of the performance issues of not bursting writes
for the descriptor on this device.
Even if that is done a conversion over to this work around seems it may require
device specific nitpicks. For instance I note in myri10ge_submit_req() for
small writes you just do a reverse write and do the first set last, then
finally the last 32 bits are rewritten, I guess to trigger something?
> This approach has worked for this device for many years. I cannot say
> whether it works for other devices, though.
I think it should but the more interesting question would be exactly
*why* it was needed for this device, who determined that, and why?
Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/