On Fri, 28 Sep 2012 23:16:30 -0400 John Stultz <john.stultz@xxxxxxxxxx> wrote:So while I do agree that I won't be able to please everyone, especially when it comes to how this interface is implemented internally, I do want to make sure that the userland interface really does make sense and isn't limited by my own short-sightedness. :)
After Kernel Summit and Plumbers, I wanted to consider all the variousI wonder if you are trying to please everyone and risking pleasing no-one?
side-discussions and try to summarize my current thoughts here along
with sending out my current implementation for review.
Also: I'm going on four weeks of paternity leave in the very near
(but non-deterministic) future. So while I hope I still have time
for some discussion, I may have to deal with fussier complaints
then yours. :) In any case, you'll have more time to chew on
the idea and come up with amazing suggestions. :)
Well, maybe not quite that extreme, but you can't please all the people all
the time.
For example, allowing sub-page volatile region seems to be above and beyondAlthough if someone marked a page and a half as volatile, would it be reasonable to throw away the second half of that second page? That seems unexpected to me. So we're really only marking the whole pages specified as volatlie, similar to how FALLOC_FL_PUNCH_HOLE behaves.
the call of duty. You cannot mmap sub-pages, so why should they be volatile?
Similarly the suggestion of using madvise - while tempting - is probably aFor now I see this as a lower priority, but its something I'd like to investigate. As depending on tmpfs has issues since there's no quota support, so having a user-writable tmpfs partition mounted is a DoS opening, especially on low-memory systems.
minority interest and can probably be managed with library code. I'm glad
you haven't pursued it.
I think discarding whole ranges at a time is very sensible, and so mergingTrue. If we avoid coalescing non-whole page ranges, keeping non-overlapping ranges independent is fairly easy.
adjacent ranges is best avoided. If you require page-aligned ranges this
becomes trivial - is that right?
I wonder if the oldest page/oldest range issue can be defined way byNot sure I followed this. Are you suggesting keeping non-initial ranges off the vmscan LRU lists entirely?
requiring apps the touch the first page in a range when they touch the range.
Then the age of a range is the age of the first page. Non-initial pages
could even be kept off the free list .... though that might confuse NUMA
page reclaim if a range had pages from different nodes.
Application to non-tmpfs files seems very unclear and so probably bestI don't think I see the exclusivity aspect. If we say "Dear kernel, you may punch a hole at this offset in this file whenever you want in the future" and then later say "Cancel my earlier hole punching request" (which the kernel can say "Sorry, too late") it has very close semantics to what I'm describing with the abstract interface to volatile range. Maybe the only subtlety with the hole-punching oriented worldview is that the kernel is smart enough not bother writing out any data that could be punched out in the future.
avoided.
If I understand you correctly, then you have suggested both that a volatile
range would be a "lazy hole punch" and a "don't let this get written to disk
yet" flag. It cannot really be both. The former sounds like fallocate,
the latter like fadvise.
I think the later sounds more like the general purpose of volatile ranges,I mostly agree, as I don't have the context to see how this could be useful to other filesystems. So I'm limiting my functionality to tmpfs. However DaveC saw some value in allowing it to be extended to other filesystems, and I'm not opposed in seeing the same interface be used if the semantics are close enough.
but I also suspect that some journalling filesystems might be uncomfortable
providing a guarantee like that. So I would suggest firmly stating that it
is a tmpfs-only feature. If someone wants something vaguely similar for
other filesystems, let them implement it separately.
The SIGBUS interface could have some merit if it really reduces overhead. IInitially I didn't like the idea, but have warmed considerably to it. Mainly due to the concern that the constant unmark/access/mark pattern would be too much overhead, and having a lazy method will be much nicer for performance. But yes, at the cost of additional complexity of handling the signal, marking the faulted address range as non-volatile, restoring the data and continuing.
worry about app bugs that could result from the non-deterministic
behaviour. A range could get unmapped while it is in use and testing for
the case of "get a SIGBUS half way though accessing something" would not
be straight forward (SIGBUS on first step of access should be easy).
I guess that is up to the app writer, but I have never liked anything about
the signal interface and encouraging further use doesn't feel wise.