Re: [PATCH] vfio: Fix endianness handling for emulated BARs

From: Alexey Kardashevskiy
Date: Wed Jun 18 2014 - 23:48:14 EST


On 06/19/2014 11:50 AM, Alexey Kardashevskiy wrote:
> On 06/19/2014 10:50 AM, Alexey Kardashevskiy wrote:
>> On 06/19/2014 04:35 AM, Alex Williamson wrote:
>>> On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote:
>>>> VFIO exposes BARs to user space as a byte stream so userspace can
>>>> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
>>>> not do byte swapping and simply return values as it gets them from
>>>> PCI device.
>>>>
>>>> Instead, the existing code assumes that byte stream in read/write is
>>>> little-endian and it fixes endianness for values which it passes to
>>>> ioreadXX/iowriteXX helpers. This works for little-endian as PCI is
>>>> little endian and le32_to_cpu/... are stubs.
>>>
>>> vfio read32:
>>>
>>> val = cpu_to_le32(ioread32(io + off));
>>>
>>> Where the typical x86 case, ioread32 is:
>>>
>>> #define ioread32(addr) readl(addr)
>>>
>>> and readl is:
>>>
>>> __le32_to_cpu(__raw_readl(addr));
>>>
>>> So we do canceling byte swaps, which are both nops on x86, and end up
>>> returning device endian, which we assume is little endian.
>>>
>>> vfio write32 is similar:
>>>
>>> iowrite32(le32_to_cpu(val), io + off);
>>>
>>> The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel
>>> out, so input data is device endian, which is assumed little.
>>>
>>>> This also works for big endian but rather by an accident: it reads 4 bytes
>>>> from the stream (@val is big endian), converts to CPU format (which should
>>>> be big endian) as it was little endian (@val becomes actually little
>>>> endian) and calls iowrite32() which does not do swapping on big endian
>>>> system.
>>>
>>> Really?
>>>
>>> In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around
>>> writel(), which seems to use the generic implementation, which does
>>> include a cpu_to_le32.
>>
>>
>> Ouch, wrong comment. iowrite32() does swapping. My bad.
>>
>>
>>>
>>> I also see other big endian archs like parisc doing cpu_to_le32 on
>>> iowrite32, so I don't think this statement is true. I imagine it's
>>> probably working for you because the swap cancel.
>>>
>>>> This removes byte swapping and makes use ioread32be/iowrite32be
>>>> (and 16bit versions) on big-endian systems. The "be" helpers take
>>>> native endian values and do swapping at the moment of writing to a PCI
>>>> register using one of "store byte-reversed" instructions.
>>>
>>> So now you want iowrite32() on little endian and iowrite32be() on big
>>> endian, the former does a cpu_to_le32 (which is a nop on little endian)
>>> and the latter does a cpu_to_be32 (which is a nop on big endian)...
>>> should we just be using __raw_writel() on both?
>>
>>
>> We can do that too. The beauty of iowrite32be on ppc64 is that it does not
>> swap and write separately, it is implemented via the "Store Word
>> Byte-Reverse Indexed X-form" single instruction.
>>
>> And some archs (do not know which ones) may add memory barriers in their
>> implementations of ioread/iowrite. __raw_writel is too raw :)
>>
>>> There doesn't actually
>>> seem to be any change in behavior here, it just eliminates back-to-back
>>> byte swaps, which are a nop on x86, but not power, right?
>>
>> Exactly. No dependency for QEMU.
>
> How about that:
> ===
>
> VFIO exposes BARs to user space as a byte stream so userspace can
> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
> not do byte swapping and simply return values as it gets them from
> PCI device.
>
> Instead, the existing code assumes that byte stream in read/write is
> little-endian and it fixes endianness for values which it passes to
> ioreadXX/iowriteXX helpers in native format. The IO helpers do swapping
> again. Since both byte swaps are nops on little-endian host, this works.
>
> This also works for big endian but rather by an accident: it reads 4 bytes
> from the stream (@val is big endian), converts to CPU format (which should
> be big endian) as it was little endian (and @val becomes actually little
> endian) and calls iowrite32() which does swapping on big endian
> system again. So byte swap gets cancelled, __raw_writel() receives
> a native value and then
> *(volatile unsigned int __force *)PCI_FIX_ADDR(addr) = v;
> just does the right thing.

I am wrong here, sorry. This is what happens when you watch soccer between
2am and 4am :)


>
> This removes byte swaps and makes use of ioread32be/iowrite32be
> (and 16bit versions) which do explicit byte swapping at the moment
> of write to a PCI register. PPC64 uses a special "Store Word
> Byte-Reverse Indexed X-form" instruction which does swap and store.

No swapping is done here if we use ioread32be as it calls in_be32 and that
animal does "lwz" which is simple load from memory.

So @val (16/32 bit variable on stack) will have different values on LE and
BE but since we do not handle it the host and just memcpy it to the buffer,
nothing breaks here.


So it should be like this:
===
VFIO exposes BARs to user space as a byte stream so userspace can
read it using pread()/pwrite(). Since this is a byte stream, VFIO should
not do byte swapping and simply return values as it gets them from
PCI device and copy_to_user will save bytes in the correct
same true for writes.

Instead, the existing code assumes that byte stream in read/write is
little-endian and it fixes endianness for values which it passes to
ioreadXX/iowriteXX helpers in native format. The IO helpers do swapping
again. Since both byte swaps are nops on little-endian host, this works.

This also works for big endian but rather by an accident: it reads 4 bytes
from the stream (@val is big endian), converts to CPU format (which should
be big endian) as it was little endian (and @val becomes actually little
endian) and calls iowrite32() which does swapping on big endian
system again. So byte swap in the host gets cancelled and __raw_writel()
writes the value which was swapped originally by the guest.

This removes byte swaps and makes use of ioread32be/iowrite32be
(and 16bit versions) which do not do byte swap on BE hosts.
For LE hosts, ioread32/iowrite32 are still used.

===


> ===
>
> any better?
>
>
>
>
>>>> Suggested-by: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx>
>>>> ---
>>>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++++++++----
>>>> 1 file changed, 16 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
>>>> index 210db24..f363b5a 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
>>>> @@ -21,6 +21,18 @@
>>>>
>>>> #include "vfio_pci_private.h"
>>>>
>>>> +#ifdef __BIG_ENDIAN__
>>>> +#define ioread16_native ioread16be
>>>> +#define ioread32_native ioread32be
>>>> +#define iowrite16_native iowrite16be
>>>> +#define iowrite32_native iowrite32be
>>>> +#else
>>>> +#define ioread16_native ioread16
>>>> +#define ioread32_native ioread32
>>>> +#define iowrite16_native iowrite16
>>>> +#define iowrite32_native iowrite32
>>>> +#endif
>>>> +
>>>> /*
>>>> * Read or write from an __iomem region (MMIO or I/O port) with an excluded
>>>> * range which is inaccessible. The excluded range drops writes and fills
>>>> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>>>> if (copy_from_user(&val, buf, 4))
>>>> return -EFAULT;
>>>>
>>>> - iowrite32(le32_to_cpu(val), io + off);
>>>> + iowrite32_native(val, io + off);
>>>> } else {
>>>> - val = cpu_to_le32(ioread32(io + off));
>>>> + val = ioread32_native(io + off);
>>>>
>>>> if (copy_to_user(buf, &val, 4))
>>>> return -EFAULT;
>>>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf,
>>>> if (copy_from_user(&val, buf, 2))
>>>> return -EFAULT;
>>>>
>>>> - iowrite16(le16_to_cpu(val), io + off);
>>>> + iowrite16_native(val, io + off);
>>>> } else {
>>>> - val = cpu_to_le16(ioread16(io + off));
>>>> + val = ioread16_native(io + off);
>>>>
>>>> if (copy_to_user(buf, &val, 2))
>>>> return -EFAULT;
>>>
>>>
>>>
>>
>>
>
>


--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/