ST> On 13 Jan 1998 05:59:30 -0600, ebiederm+eric@npwt.net (Eric W. Biederman) said:
>> O.K. Then I guess the basic task is a follows.
>> For writepage (which really hasn't been used yet). Revert that to use
>> an inode (instead of a dentry).
ST> We've already come across the problem that NFS needs dentries, not
ST> inodes. We solve this currently by doing all of the flushing from
ST> virtual memory to filesystem when we traverse the vm region or page
ST> tables: that way, we have the relevant vma in hand when we spot dirty
ST> data, and can get the dentry from that. We've recently got a patch
ST> which traverses vma's on a page clearing pte dirty bits whenever we
ST> msync such a page.
O.K. so that extra benefit is no longer needed.
ST> So, I guess the question is, why _exactly_ do you want to have dirty
ST> page cache pages? I'm not saying it's a bad idea, but you do need to
ST> identify precisely what problem you are trying to solve before we can
ST> say whether or not the solution is a brilliant idea!
The primary problem I'm trying to solve is:
The buffer cache can _only_ be used block device based filesystems.
These filesystems have a 1-1 relationship between data in the file,
and data on the disk, so the common case is cached by the buffer
cache.
For any other kind of filesystem: without a 1-1 relationship between
data in the a file an data on disk (perhaps because there is no disk).
There is no premade cache, and each filesystem must grow it's own.
Any time you roll your own caching it tends to be:
more work, more error prone, less optimized
than a general solution that the rest of the kernel provides.
So a generic facility can easily increase reliability and efficiency,
and even speed of implementation.
THAT is the primary problem.
Filesystems that depart from this:
- Filesytems that use compression:
e2compr
- Networking filesystems.
SMBFS, NFS
- Filesystems that reside in swap
Mine for shared memory, solaris tmpfs
- Filesystems that cache slower devices on faster ones.
We keep seeing proposals.
All of these categories require file based caching.
We only have block device based cache.
>> Reimplement the buffer_head pointer as a generic pointer, with a flag
>> that says it's a buffer_head pointer.
ST> Possibly. The existing filesystems don't need it.
It appears NFS does (with the dentry thing), _if_ it is going to use
my diry pages in the page cache. Also there are many more easily
concived needs if various filesystems are going to use the page cache
to hold dirty data.
So _if_ we have dirty data in a page either the buffer_head pointer
must be made generic or another added. (And slow the struct page
shrinkage :)
ST> You could certainly
ST> make things easier for some network filesystems by maintaining dirty
ST> data for them, but who is responsible for that dirty data? What happens
ST> when different users, or perhaps different dirent aliases for the same
ST> file, dirty the same page? These have to be filesystem policy
ST> decisions. I guess this means that the page cache changes you propose
ST> really are purely facilities for filesystem use, not to be used
ST> directly by any generic file IO routines.
Correct:
As I said I won't set the dirty bit.
That is for updatepage to do _if_ the filesystem wants.
The basic idea (if we let the VFS do as much as it can).
When data is written with fops->write_file == generic_file_write.
generic_file_write calls updatepage.
updatepage being filesystem specific sets the filesystem policy
on weather or not to set the dirty bit in the page cache.
ST> So, once we have a page cache which can hold dirty data, who are you
ST> proposing is responsible for flushing that dirty data to backing store?
ST> I'm not sure whether you are suggesting a completely new mechanism for
ST> maintaining asynchronously written dirty data with callbacks into the
ST> filesystem, or whether the page cache extensions will do nothing extra
ST> on their own but will be called by the filesystem when it decides for
ST> itself that data needs to be written out.
There are 2 times when a page may be 'cleaned' that is written out
and the dirty bit removed.
- In response to some instigation such as: fsync, filesystem_sync, or
the filesystem deciding it wants to write the page now.
- In response to a low memory condition in which shrink_mmap is called.
In both conditions the 'writepage' function (again fs specific) is called.
The data is written out by the filesystem.
For performance reasons hopefully asynchronously, with the page lock set.
>> Buffer cache changes, and ext2 support, and other filesystems.
>> Once the basic code is in place.
ST> The ext2 support doesn't _need_ any changes to the page cache! It
ST> doesn't, strictly speaking, need any buffer cache changes either, but a
ST> new free-after-bdflush buffer type would be highly desirable just from a
ST> performance point of view --- there's no reason to maintain the physical
ST> view of the data past the point where it has reached disk.
_Only_ because we maintain the logical data in the page cache,
which later becomes the physical data, so writes are not frequent.
I thought there was a proposal in the works, (by you) to use this as
method to speed of ext2 fsync. If it doesn't hurt any other
performance and does speed up ext2 it is a definite plus.
And of course to gain that edge ext2 would need a little conversion.
Eric