Re: GFS

From: Zach Brown
Date: Tue Aug 09 2005 - 12:17:31 EST


Pekka Enberg wrote:

> In addition, the vma walk will become an unmaintainable mess as soon
> as someone introduces another mmap() capable fs that needs similar
> locking.

Yup, I suspect that if the core kernel ends up caring about this problem
then the VFS will be involved in helping file systems sort the locks
they'll acquire around IO.

> I am not an expert so could someone please explain why this cannot be
> done with a_ops->prepare_write and friends?

I'll try, briefly.

Usually clustered file systems in Linux maintain data consistency for
normal posix IO by holding DLM locks for the duration of their
file->{read,write} methods. A task on a node won't be able to read
until all tasks on other nodes have finished any conflicting writes they
might have been performing, etc, nothing surprising here.

Now say we want to extend consistency guarantees to mmap(). This boils
down to protecting mappings with DLM locks. Say a page is mapped for
reading, the continued presence of that mapping is protected by holding
a DLM lock. If another node goes to write to that page, the read lock
is revoked and the mapping is torn down. These locks are acquired in
a_ops->nopage as the task faults and tries to bring up the mapping.

And that's the problem. Because they're acquired in ->nopage they can
be acquired during a fault that is servicing the 'buf' argument to an
outer file->{read,write} operation which has grabbed a lock for the
target file. Acquiring multiple locks introduces the risk of ABBA
deadlocks. It's trivial to construct examples of mmap(), read(), and
write() on 2 nodes with 2 files that deadlock.

So clustered file systems in Linux (GFS, Lustre, OCFS2, (GPFS?)) all
walk vmas in their file->{read,write} to discover mappings that belong
to their files so that they can preemptively sort and acquire the locks
that will be needed to cover the mappings that might be established in
->nopage. As you point out, this both relies on the mappings not
changing and gets very exciting when you mix files and mappings between
file systems that are each sorting and acquiring their own DLM locks.

I brought this up with some people at the kernel summit but no one,
including myself, considers it a high priority. It wouldn't be too hard
to construct a patch if people want to take a look.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/