Re: [POC/RFC PATCH] overlayfs: fix data inconsistency at copy up

From: Miklos Szeredi
Date: Fri Oct 21 2016 - 05:12:24 EST


On Thu, Oct 20, 2016 at 04:54:08PM -0400, Vivek Goyal wrote:
> On Thu, Oct 20, 2016 at 04:46:30PM -0400, Vivek Goyal wrote:
>
> [..]
> > > +static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > > +{
> > > + struct file *file = iocb->ki_filp;
> > > + bool isupper = OVL_TYPE_UPPER(ovl_path_type(file->f_path.dentry));
> > > + ssize_t ret = -EINVAL;
> > > +
> > > + if (likely(!isupper)) {
> > > + const struct file_operations *fop = ovl_real_fop(file);
> > > +
> > > + if (likely(fop->read_iter))
> > > + ret = fop->read_iter(iocb, to);
> > > + } else {
> > > + struct file *upperfile = filp_clone_open(file);
> > > +
> >
> > IIUC, every read of lower file will call filp_clone_open(). Looking at the
> > code of filp_clone_open(), I am concerned about the overhead of this call.
> > Is it significant? Don't want to be paying too much of penalty for read
> > operation on lower files. That would be a common case for containers.
> >
>
> Looks like I read the code in reverse. So if I open a file read-only,
> and if it has not been copied up, I will simply call read_iter() on
> lower filesystem. But if file has been copied up, then I will call
> filp_clone_open() and pay the cost. And this will continue till this
> file is closed by caller.
>
> When file is opened again, by that time it is upper file and we will
> install real fop in file (instead of overlay fop).

Right.

The lockdep issue seems to be real, we can't take i_mutex and s_vfs_rename_mutex
while mmap_sem is locked. Fortunately copy up doesn't need mmap_sem, so we can
do it while unlocked and retry the mmap.

Here's an incremental workaround patch.

I don't like adding such workarounds to the VFS/MM but they are really cheap for
the non-overlay case and there doesn't appear to be an alternative in this case.

Thanks,
Miklos

---
fs/overlayfs/inode.c | 19 +++++--------------
mm/util.c | 22 ++++++++++++++++++++++
2 files changed, 27 insertions(+), 14 deletions(-)

--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -419,21 +419,12 @@ static int ovl_mmap(struct file *file, s
bool isupper = OVL_TYPE_UPPER(ovl_path_type(file->f_path.dentry));
int err;

- /*
- * Treat MAP_SHARED as hint about future writes to the file (through
- * another file descriptor). Caller might not have had such an intent,
- * but we hope MAP_PRIVATE will be used in most such cases.
- *
- * If we don't copy up now and the file is modified, it becomes really
- * difficult to change the mapping to match that of the file's content
- * later.
- */
if (unlikely(isupper || vma->vm_flags & VM_MAYSHARE)) {
- if (!isupper) {
- err = ovl_copy_up(file->f_path.dentry);
- if (err)
- goto out;
- }
+ /*
+ * File should have been copied up by now. See vm_mmap_pgoff().
+ */
+ if (WARN_ON(!isupper))
+ return -EIO;

file = filp_clone_open(file);
err = PTR_ERR(file);
--- a/mm/util.c
+++ b/mm/util.c
@@ -297,6 +297,28 @@ unsigned long vm_mmap_pgoff(struct file

ret = security_mmap_file(file, prot, flag);
if (!ret) {
+ /*
+ * Special treatment for overlayfs:
+ *
+ * Take MAP_SHARED/PROT_READ as hint about future writes to the
+ * file (through another file descriptor). Caller might not
+ * have had such an intent, but we hope MAP_PRIVATE will be used
+ * in most such cases.
+ *
+ * If we don't copy up now and the file is modified, it becomes
+ * really difficult to change the mapping to match that of the
+ * file's content later.
+ *
+ * Copy up needs to be done without mmap_sem since it takes vfs
+ * locks which would potentially deadlock under mmap_sem.
+ */
+ if ((flag & MAP_SHARED) && !(prot & PROT_WRITE)) {
+ void *p = d_real(file->f_path.dentry, NULL, O_WRONLY);
+
+ if (IS_ERR(p))
+ return PTR_ERR(p);
+ }
+
if (down_write_killable(&mm->mmap_sem))
return -EINTR;
ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,