Re: [PATCH v3 12/13] dax: handle truncate of dma-busy pages
From: Dan Williams
Date: Fri Oct 20 2017 - 11:42:07 EST
On Fri, Oct 20, 2017 at 6:05 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> On Thu, 2017-10-19 at 19:40 -0700, Dan Williams wrote:
>> get_user_pages() pins file backed memory pages for access by dma
>> devices. However, it only pins the memory pages not the page-to-file
>> offset association. If a file is truncated the pages are mapped out of
>> the file and dma may continue indefinitely into a page that is owned by
>> a device driver. This breaks coherency of the file vs dma, but the
>> assumption is that if userspace wants the file-space truncated it does
>> not matter what data is inbound from the device, it is not relevant
>> anymore.
>>
>> The assumptions of the truncate-page-cache model are broken by DAX where
>> the target DMA page *is* the filesystem block. Leaving the page pinned
>> for DMA, but truncating the file block out of the file, means that the
>> filesytem is free to reallocate a block under active DMA to another
>> file!
>>
>> Here are some possible options for fixing this situation ('truncate' and
>> 'fallocate(punch hole)' are synonymous below):
>>
>> 1/ Fail truncate while any file blocks might be under dma
>>
>> 2/ Block (sleep-wait) truncate while any file blocks might be under
>> dma
>>
>> 3/ Remap file blocks to a "lost+found"-like file-inode where
>> dma can continue and we might see what inbound data from DMA was
>> mapped out of the original file. Blocks in this file could be
>> freed back to the filesystem when dma eventually ends.
>>
>> 4/ Disable dax until option 3 or another long term solution has been
>> implemented. However, filesystem-dax is still marked experimental
>> for concerns like this.
>>
>> Option 1 will throw failures where userspace has never expected them
>> before, option 2 might hang the truncating process indefinitely, and
>> option 3 requires per filesystem enabling to remap blocks from one inode
>> to another. Option 2 is implemented in this patch for the DAX path with
>> the expectation that non-transient users of get_user_pages() (RDMA) are
>> disallowed from setting up dax mappings and that the potential delay
>> introduced to the truncate path is acceptable compared to the response
>> time of the page cache case. This can only be seen as a stop-gap until
>> we can solve the problem of safely sequestering unallocated filesystem
>> blocks under active dma.
>>
>
> FWIW, I like #3 a lot more than #2 here. I get that it's quite a bit
> more work though, so no objection to this as a stop-gap fix.
I agree, but it needs quite a bit more thought and restructuring of
the truncate path. I also wonder how we reclaim those stranded
filesystem blocks, but a first approximation is wait for the
administrator to delete them or auto-delete them at the next mount.
XFS seems well prepared to reflink-swap these DMA blocks around, but
I'm not sure about EXT4.
>
>
>> The solution introduces a new FL_ALLOCATED lease to pin the allocated
>> blocks in a dax file while dma might be accessing them. It behaves
>> identically to an FL_LAYOUT lease save for the fact that it is
>> immediately sheduled to be reaped, and that the only path that waits for
>> its removal is the truncate path. We can not reuse FL_LAYOUT directly
>> since that would deadlock in the case where userspace did a direct-I/O
>> operation with a target buffer backed by an mmap range of the same file.
>>
>> Credit / inspiration for option 3 goes to Dave Hansen, who proposed
>> something similar as an alternative way to solve the problem that
>> MAP_DIRECT was trying to solve.
>>
>> Cc: Jan Kara <jack@xxxxxxx>
>> Cc: Jeff Moyer <jmoyer@xxxxxxxxxx>
>> Cc: Dave Chinner <david@xxxxxxxxxxxxx>
>> Cc: Matthew Wilcox <mawilcox@xxxxxxxxxxxxx>
>> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>
>> Cc: "Darrick J. Wong" <darrick.wong@xxxxxxxxxx>
>> Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
>> Cc: Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
>> Cc: "J. Bruce Fields" <bfields@xxxxxxxxxxxx>
>> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>> Reported-by: Christoph Hellwig <hch@xxxxxx>
>> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>
>> ---
>> fs/Kconfig | 1
>> fs/dax.c | 188 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> fs/locks.c | 17 ++++-
>> include/linux/dax.h | 23 ++++++
>> include/linux/fs.h | 22 +++++-
>> mm/gup.c | 27 ++++++-
>> 6 files changed, 268 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/Kconfig b/fs/Kconfig
>> index 7aee6d699fd6..a7b31a96a753 100644
>> --- a/fs/Kconfig
>> +++ b/fs/Kconfig
>> @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
>> config FS_DAX
>> bool "Direct Access (DAX) support"
>> depends on MMU
>> + depends on FILE_LOCKING
>> depends on !(ARM || MIPS || SPARC)
>> select FS_IOMAP
>> select DAX
>> diff --git a/fs/dax.c b/fs/dax.c
>> index b03f547b36e7..e0a3958fc5f2 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -22,6 +22,7 @@
>> #include <linux/genhd.h>
>> #include <linux/highmem.h>
>> #include <linux/memcontrol.h>
>> +#include <linux/file.h>
>> #include <linux/mm.h>
>> #include <linux/mutex.h>
>> #include <linux/pagevec.h>
>> @@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>> }
>> }
>> EXPORT_SYMBOL_GPL(dax_iomap_fault);
>> +
>> +enum dax_lease_flags {
>> + DAX_LEASE_PAGES,
>> + DAX_LEASE_BREAK,
>> +};
>> +
>> +struct dax_lease {
>> + struct page **dl_pages;
>> + unsigned long dl_nr_pages;
>> + unsigned long dl_state;
>> + struct file *dl_file;
>> + atomic_t dl_count;
>> + /*
>> + * Once the lease is taken and the pages have references we
>> + * start the reap_work to poll for lease release while acquiring
>> + * fs locks that synchronize with truncate. So, either reap_work
>> + * cleans up the dax_lease instances or truncate itself.
>> + *
>> + * The break_work sleepily polls for DMA completion and then
>> + * unlocks/removes the lease.
>> + */
>> + struct delayed_work dl_reap_work;
>> + struct delayed_work dl_break_work;
>> +};
>> +
>> +static void put_dax_lease(struct dax_lease *dl)
>> +{
>> + if (atomic_dec_and_test(&dl->dl_count)) {
>> + fput(dl->dl_file);
>> + kfree(dl);
>> + }
>> +}
>
> Any reason not to use the new refcount_t type for dl_count? Seems like a
> good place for it.
I'll take a look.
>> +
>> +static void dax_lease_unlock_one(struct work_struct *work)
>> +{
>> + struct dax_lease *dl = container_of(work, typeof(*dl),
>> + dl_break_work.work);
>> + unsigned long i;
>> +
>> + /* wait for the gup path to finish recording pages in the lease */
>> + if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) {
>> + schedule_delayed_work(&dl->dl_break_work, HZ);
>> + return;
>> + }
>> +
>> + /* barrier pairs with dax_lease_set_pages() */
>> + smp_mb__after_atomic();
>> +
>> + /*
>> + * If we see all pages idle at least once we can remove the
>> + * lease. If we happen to race with someone else taking a
>> + * reference on a page they will have their own lease to protect
>> + * against truncate.
>> + */
>> + for (i = 0; i < dl->dl_nr_pages; i++)
>> + if (page_ref_count(dl->dl_pages[i]) > 1) {
>> + schedule_delayed_work(&dl->dl_break_work, HZ);
>> + return;
>> + }
>> + vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl);
>> + put_dax_lease(dl);
>> +}
>> +
>> +static void dax_lease_reap_all(struct work_struct *work)
>> +{
>> + struct dax_lease *dl = container_of(work, typeof(*dl),
>> + dl_reap_work.work);
>> + struct file *file = dl->dl_file;
>> + struct inode *inode = file_inode(file);
>> + struct address_space *mapping = inode->i_mapping;
>> +
>> + if (mapping->a_ops->dax_flush_dma) {
>> + mapping->a_ops->dax_flush_dma(inode);
>> + } else {
>> + /* FIXME: dax-filesystem needs to add dax-dma support */
>> + break_allocated(inode, true);
>> + }
>> + put_dax_lease(dl);
>> +}
>> +
>> +static bool dax_lease_lm_break(struct file_lock *fl)
>> +{
>> + struct dax_lease *dl = fl->fl_owner;
>> +
>> + if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) {
>> + atomic_inc(&dl->dl_count);
>> + schedule_delayed_work(&dl->dl_break_work, HZ);
>> + }
>> +
>
> I haven't gone over this completely, but what prevents you from doing a
> 0->1 transition on the dl_count here, and possibly doing a use-after
> free?
>
> Ahh ok...I guess we know that we hold a reference since this is on the
> flc_lease list? Fair enough. Still, might be worth a comment there as to
> why that's safe.
Right, we hold a reference count at the beginning of time that is only
dropped when the lease is unlocked. If the break happens before unlock
we take this reference while the break_work is running. I'll add this
as a comment.