Re: [patch] x86, mm: pass in 'total' to __copy_from_user_*nocache()

From: Ingo Molnar
Date: Wed Feb 25 2009 - 03:30:32 EST



* Nick Piggin <nickpiggin@xxxxxxxxxxxx> wrote:

> On Wednesday 25 February 2009 18:25:03 Ingo Molnar wrote:
> > * Nick Piggin <nickpiggin@xxxxxxxxxxxx> wrote:
> > > On Wednesday 25 February 2009 02:52:34 Linus Torvalds wrote:
> > > > On Tue, 24 Feb 2009, Nick Piggin wrote:
> > > > > > it does make some kind of sense to try to avoid the noncached
> > > > > > versions for small writes - because small writes tend to be for
> > > > > > temp-files.
> > > > >
> > > > > I don't see the significance of a temp file. If the pagecache is
> > > > > truncated, then the cachelines remain dirty and so you can't avoid an
> > > > > eventual store back to RAM?
> > > >
> > > > No, because many small files end up being used as scratch-pads (think
> > > > shell script sequences etc), and get read back immediately again. Doing
> > > > non-temporal stores might just be bad simply because trying to play
> > > > games with caching may simply do the wrong thing.
> > >
> > > OK, for that angle it could make sense. Although as has been
> > > noted earlier, at this point of the copy, we don't have much
> > > idea about the length of the write passed into the vfs (and
> > > obviously will never know the higher level intention of
> > > userspace).
> > >
> > > I don't know if we can say a 1 page write is nontemporal, but
> > > anything smaller is temporal. And having these kinds of
> > > behavioural cutoffs I would worry will create strange
> > > performance boundary conditions in code.
> >
> > I agree in principle.
> >
> > The main artifact would be the unaligned edges around a bigger
> > write. In particular the tail portion of a big write will be
> > cached.
> >
> > For example if we write a 100,000 bytes file, we'll copy the
> > first 24 pages (98304 bytes) uncached, while the final 1696
> > bytes cached. But there is nothing that necessiates this
> > assymetry.
> >
> > For that reason it would be nice to pass down the total size of
> > the write to the assembly code. These are single-usage-site APIs
> > anyway so it should be easy.
> >
> > I.e. something like the patch below (untested). I've extended
> > the copy APIs to also pass down a 'total' size as well, and
> > check for that instead of the chunk 'len'. Note that it's
> > checked in the inlined portion so all the "total == len" special
> > cases will optimize out the new parameter completely.
>
> This does give more information, not exactly all (it could be
> a big total write with many smaller writes especially if the
> source is generated on the fly and will be in cache, or if the
> source is not in cache, then we would also want to do
> nontemporal loads from there etc etc).

A big total write with many smaller writes should already be
handled in this codepath, as long as it's properly iovec-ed.

We can do little about user-space doing stupid things as doing a
big write as a series of many smaller-than-4K writes.

> > This should express the 'large vs. small' question
> > adequately, with no artifacts. Agreed?
>
> Well, no artifacts, but it still has a boundary condition
> where one might cut from temporal to nontemporal behaviour.
>
> If it is a *really* important issue, maybe some flags should
> be incorporated into an extended API? It not, then I wonder if
> it is important enough to add such complexity for?

the iozone numbers in the old commits certainly are convincing.
The new numbers from Salman are convincing too - and his fix
should preserve the iozone [large-write] numbers as well.

I.e. i think this is a reasonable compromise, it should handle
all the sane cases.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/