Re: [PATCH v1 5/7] dax: Add huge page fault support

From: Dave Chinner
Date: Sun Oct 12 2014 - 21:13:58 EST

On Thu, Oct 09, 2014 at 04:47:16PM -0400, Matthew Wilcox wrote:
> On Wed, Oct 08, 2014 at 11:11:00PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 08, 2014 at 09:25:27AM -0400, Matthew Wilcox wrote:
> > > + pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > + if (pgoff >= size)
> > > + return VM_FAULT_SIGBUS;
> > > + /* If the PMD would cover blocks out of the file */
> > > + if ((pgoff | PG_PMD_COLOUR) >= size)
> > > + return VM_FAULT_FALLBACK;
> >
> > IIUC, zero pading would work too.
> The blocks after this file might be allocated to another file already.
> I suppose we could ask the filesystem if it wants to allocate them to
> this file.
> Dave, Jan, is it acceptable to call get_block() for blocks that extend
> beyond the current i_size?

In what context? XFS basically does nothing for certain cases (e.g.
read mapping for direct IO) where zeroes are always going to be
returned, so essentially filesystems right now may actually just
return a "hole" for any read mapping request beyond EOF.

If "create" is set, then we'll either create or map existing blocks
beyond EOF because the we have to reserve space or allocate blocks
before the EOF gets extended when the write succeeds fully...

> > > + if (length < PMD_SIZE)
> > > + goto fallback;
> > > + if (pfn & PG_PMD_COLOUR)
> > > + goto fallback; /* not aligned */
> >
> > So, are you rely on pure luck to make get_block() allocate 2M aligned pfn?
> > Not really productive. You would need assistance from fs and
> > arch_get_unmapped_area() sides.
> Certainly ext4 and XFS will align their allocations; if you ask it for a
> 2MB block, it will try to allocate a 2MB block aligned on a 2MB boundary.

As a sweeping generalisation, that's wrong. Empty filesystems might
behave that way, but we don't *guarantee* that this sort of
alignment will occur.

XFS has several different extent alignment strategies and
none of them will always work that way. Many of them are dependent
on mkfs parameters, and even then are used only as *guidelines*.
Further, alignment is dependent on the size of the write being done
- on some filesystem configs a 2MB write might be aligned, but on
others it won't be. More complex still is that mount options can
change alignment behaviour, as can per-file extent size hints, as
can truncation that removes post-eof blocks...

IOWs, if you want the filesystem to guarantee alignment to the
underlying hardware in this way for DAX, we're going to need to make
some modifications to the allocator alignment strategy.


Dave Chinner
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at