Re: [PATCH] filesystem-dax: Disable PMD support

From: Dan Williams
Date: Thu Jun 27 2019 - 15:09:43 EST


On Thu, Jun 27, 2019 at 11:58 AM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>
> On Thu, Jun 27, 2019 at 11:29 AM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> >
> > On Thu, Jun 27, 2019 at 9:06 AM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> > >
> > > On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote:
> > > > > Ever since the conversion of DAX to the Xarray a RocksDB benchmark has
> > > > > been encountering intermittent lockups. The backtraces always include
> > > > > the filesystem-DAX PMD path, multi-order entries have been a source of
> > > > > bugs in the past, and disabling the PMD path allows a test that fails in
> > > > > minutes to run for an hour.
> > > >
> > > > On May 4th, I asked you:
> > > >
> > > > Since this is provoked by a fatal signal, it must have something to do
> > > > with a killable or interruptible sleep. There's only one of those in the
> > > > DAX code; fatal_signal_pending() in dax_iomap_actor(). Does rocksdb do
> > > > I/O with write() or through a writable mmap()? I'd like to know before
> > > > I chase too far down this fault tree analysis.
> > >
> > > RocksDB in this case is using write() for writes and mmap() for reads.
> >
> > It's not clear to me that a fatal signal is a component of the failure
> > as much as it's the way to detect that the benchmark has indeed locked
> > up.
>
> Even though db_bench is run with the mmap_read=1 option:
>
> cmd="${rocksdb_dir}/db_bench $params_r --benchmarks=readwhilewriting \
> --use_existing_db=1 \
> --mmap_read=1 \
> --num=$num_keys \
> --threads=$num_read_threads \
>
> When the lockup occurs there are db_bench processes in the write fault path:
>
> [ 1666.635212] db_bench D 0 2492 2435 0x00000000
> [ 1666.641339] Call Trace:
> [ 1666.644072] ? __schedule+0x24f/0x680
> [ 1666.648162] ? __switch_to_asm+0x34/0x70
> [ 1666.652545] schedule+0x29/0x90
> [ 1666.656054] get_unlocked_entry+0xcd/0x120
> [ 1666.660629] ? dax_iomap_actor+0x270/0x270
> [ 1666.665206] grab_mapping_entry+0x14f/0x230
> [ 1666.669878] dax_iomap_pmd_fault.isra.42+0x14d/0x950
> [ 1666.675425] ? futex_wait+0x122/0x230
> [ 1666.679518] ext4_dax_huge_fault+0x16f/0x1f0
> [ 1666.684288] __handle_mm_fault+0x411/0x1350
> [ 1666.688961] ? do_futex+0xca/0xbb0
> [ 1666.692760] ? __switch_to_asm+0x34/0x70
> [ 1666.697144] handle_mm_fault+0xbe/0x1e0
> [ 1666.701429] __do_page_fault+0x249/0x4f0
> [ 1666.705811] do_page_fault+0x32/0x110
> [ 1666.709903] ? page_fault+0x8/0x30
> [ 1666.713702] page_fault+0x1e/0x30
>
> ...where __handle_mm_fault+0x411 is in wp_huge_pmd():
>
> (gdb) li *(__handle_mm_fault+0x411)
> 0xffffffff812713d1 is in __handle_mm_fault (mm/memory.c:3800).
> 3795 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf,
> pmd_t orig_pmd)
> 3796 {
> 3797 if (vma_is_anonymous(vmf->vma))
> 3798 return do_huge_pmd_wp_page(vmf, orig_pmd);
> 3799 if (vmf->vma->vm_ops->huge_fault)
> 3800 return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
> 3801
> 3802 /* COW handled on pte level: split pmd */
> 3803 VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma);
> 3804 __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
>
> This bug feels like we failed to unlock, or unlocked the wrong entry
> and this hunk in the bisected commit looks suspect to me. Why do we
> still need to drop the lock now that the radix_tree_preload() calls
> are gone?

Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
wonder why we don't restart the lookup like the old implementation.