Re: [PATCH] Revert "NFS: readdirplus optimization by cache mechanism" (memleak)

From: Trond Myklebust
Date: Fri Jul 12 2019 - 16:16:13 EST


On Fri, 2019-07-12 at 17:28 +0200, Greg KH wrote:
> On Fri, Jul 12, 2019 at 04:18:06PM +0200, Max Kellermann wrote:
> > This reverts commit be4c2d4723a4a637f0d1b4f7c66447141a4b3564.
> >
> > That commit caused a severe memory leak in nfs_readdir_make_qstr().
> >
> > When listing a directory with more than 100 files (this is how many
> > struct nfs_cache_array_entry elements fit in one 4kB page), all
> > allocated file name strings past those 100 leak.
> >
> > The root of the leakage is that those string pointers are managed
> > in
> > pages which are never linked into the page cache.
> >
> > fs/nfs/dir.c puts pages into the page cache by calling
> > read_cache_page(); the callback function nfs_readdir_filler() will
> > then fill the given page struct which was passed to it, which is
> > already linked in the page cache (by do_read_cache_page() calling
> > add_to_page_cache_lru()).
> >
> > Commit be4c2d4723a4 added another (local) array of allocated pages,
> > to
> > be filled with more data, instead of discarding excess items
> > received
> > from the NFS server. Those additional pages can be used by the
> > next
> > nfs_readdir_filler() call (from within the same nfs_readdir()
> > call).
> >
> > The leak happens when some of those additional pages are never used
> > (copied to the page cache using copy_highpage()). The pages will
> > be
> > freed by nfs_readdir_free_pages(), but their contents will
> > not. The
> > commit did not invoke nfs_readdir_clear_array() (and doing so would
> > have been dangerous, because it did not track which of those pages
> > were already copied to the page cache, risking double free bugs).
> >
> > How to reproduce the leak:
> >
> > - Use a kernel with CONFIG_SLUB_DEBUG_ON.
> >
> > - Create a directory on a NFS mount with more than 100 files with
> > names long enough to use the "kmalloc-32" slab (so we can easily
> > look up the allocation counts):
> >
> > for i in `seq 110`; do touch ${i}_0123456789abcdef; done
> >
> > - Drop all caches:
> >
> > echo 3 >/proc/sys/vm/drop_caches
> >
> > - Check the allocation counter:
> >
> > grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
> > 30564391 nfs_readdir_add_to_array+0x73/0xd0
> > age=534558/4791307/6540952 pid=370-1048386 cpus=0-47 nodes=0-1
> >
> > - Request a directory listing and check the allocation counters
> > again:
> >
> > ls
> > [...]
> > grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
> > 30564511 nfs_readdir_add_to_array+0x73/0xd0
> > age=207/4792999/6542663 pid=370-1048386 cpus=0-47 nodes=0-1
> >
> > There are now 120 new allocations.
> >
> > - Drop all caches and check the counters again:
> >
> > echo 3 >/proc/sys/vm/drop_caches
> > grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
> > 30564401 nfs_readdir_add_to_array+0x73/0xd0
> > age=735/4793524/6543176 pid=370-1048386 cpus=0-47 nodes=0-1
> >
> > 110 allocations are gone, but 10 have leaked and will never be
> > freed.
> >
> > Unhelpfully, those allocations are explicitly excluded from
> > KMEMLEAK,
> > that's why my initial attempts with KMEMLEAK were not successful:
> >
> > /*
> > * Avoid a kmemleak false positive. The pointer to the name is
> > stored
> > * in a page cache page which kmemleak does not scan.
> > */
> > kmemleak_not_leak(string->name);
> >
> > It would be possible to solve this bug without reverting the whole
> > commit:
> >
> > - keep track of which pages were not used, and call
> > nfs_readdir_clear_array() on them, or
> > - manually link those pages into the page cache
> >
> > But for now I have decided to just revert the commit, because the
> > real
> > fix would require complex considerations, risking more dangerous
> > (crash) bugs, which may seem unsuitable for the stable branches.
> >
> > Signed-off-by: Max Kellermann <mk@xxxxxxxxxx>
> > ---
> > fs/nfs/dir.c | 90 ++++---------------------------------------
> > ----
> > fs/nfs/internal.h | 3 +-
> > 2 files changed, 7 insertions(+), 86 deletions(-)
>
> No cc: stable@vger on this patch to get it backported?
>

I've added one. I've also backed out the 3 fixes for other problems
with the same patch that were in the linux-next tree. (St. Stephen
Rothwell please forgive me, for I have sinned...)

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx