Re: [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

From: Andrea Arcangeli
Date: Mon Mar 11 2019 - 09:43:16 EST


On Mon, Mar 11, 2019 at 08:48:37AM -0400, Michael S. Tsirkin wrote:
> Using copyXuser is better I guess.

It certainly would be faster there, but I don't think it's needed if
that would be the only use case left that justifies supporting two
different models. On small 32bit systems with little RAM kmap won't
perform measurably different on 32bit or 64bit systems. If the 32bit
host has a lot of ram it all gets slow anyway at accessing RAM above
the direct mapping, if compared to 64bit host kernels, it's not just
an issue for vhost + mmu notifier + kmap and the best way to optimize
things is to run 64bit host kernels.

Like Christoph pointed out, the main use case for retaining the
copy-user model would be CPUs with virtually indexed not physically
tagged data caches (they'll still suffer from the spectre-v1 fix,
although I exclude they have to suffer the SMAP
slowdown/feature). Those may require some additional flushing than the
current copy-user model requires.

As a rule of thumb any arch where copy_user_page doesn't define as
copy_page will require some additional cache flushing after the
kmap. Supposedly with vmap, the vmap layer should have taken care of
that (I didn't verify that yet).

There are some accessories like copy_to_user_page()
copy_from_user_page() that could work and obviously defines to raw
memcpy on x86 (the main cons is they don't provide word granular
access) and at least on sparc they're tailored to ptrace assumptions
so then we'd need to evaluate what happens if this is used outside of
ptrace context. kmap has been used generally either to access whole
pages (i.e. copy_user_page), so ptrace may actually be the only use
case with subpage granularity access.

#define copy_to_user_page(vma, page, vaddr, dst, src, len) \
do { \
flush_cache_page(vma, vaddr, page_to_pfn(page)); \
memcpy(dst, src, len); \
flush_ptrace_access(vma, page, vaddr, src, len, 0); \
} while (0)

So I wouldn't rule out the need for a dual model, until we solve how
to run this stable on non-x86 arches with not physically tagged
caches.

Thanks,
Andrea