Re: [PATCH] Revert "NFS: readdirplus optimization by cache mechanism" (memleak)
From: Greg KH
Date: Fri Jul 12 2019 - 11:28:20 EST
On Fri, Jul 12, 2019 at 04:18:06PM +0200, Max Kellermann wrote:
> This reverts commit be4c2d4723a4a637f0d1b4f7c66447141a4b3564.
>
> That commit caused a severe memory leak in nfs_readdir_make_qstr().
>
> When listing a directory with more than 100 files (this is how many
> struct nfs_cache_array_entry elements fit in one 4kB page), all
> allocated file name strings past those 100 leak.
>
> The root of the leakage is that those string pointers are managed in
> pages which are never linked into the page cache.
>
> fs/nfs/dir.c puts pages into the page cache by calling
> read_cache_page(); the callback function nfs_readdir_filler() will
> then fill the given page struct which was passed to it, which is
> already linked in the page cache (by do_read_cache_page() calling
> add_to_page_cache_lru()).
>
> Commit be4c2d4723a4 added another (local) array of allocated pages, to
> be filled with more data, instead of discarding excess items received
> from the NFS server. Those additional pages can be used by the next
> nfs_readdir_filler() call (from within the same nfs_readdir() call).
>
> The leak happens when some of those additional pages are never used
> (copied to the page cache using copy_highpage()). The pages will be
> freed by nfs_readdir_free_pages(), but their contents will not. The
> commit did not invoke nfs_readdir_clear_array() (and doing so would
> have been dangerous, because it did not track which of those pages
> were already copied to the page cache, risking double free bugs).
>
> How to reproduce the leak:
>
> - Use a kernel with CONFIG_SLUB_DEBUG_ON.
>
> - Create a directory on a NFS mount with more than 100 files with
> names long enough to use the "kmalloc-32" slab (so we can easily
> look up the allocation counts):
>
> for i in `seq 110`; do touch ${i}_0123456789abcdef; done
>
> - Drop all caches:
>
> echo 3 >/proc/sys/vm/drop_caches
>
> - Check the allocation counter:
>
> grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
> 30564391 nfs_readdir_add_to_array+0x73/0xd0 age=534558/4791307/6540952 pid=370-1048386 cpus=0-47 nodes=0-1
>
> - Request a directory listing and check the allocation counters again:
>
> ls
> [...]
> grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
> 30564511 nfs_readdir_add_to_array+0x73/0xd0 age=207/4792999/6542663 pid=370-1048386 cpus=0-47 nodes=0-1
>
> There are now 120 new allocations.
>
> - Drop all caches and check the counters again:
>
> echo 3 >/proc/sys/vm/drop_caches
> grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
> 30564401 nfs_readdir_add_to_array+0x73/0xd0 age=735/4793524/6543176 pid=370-1048386 cpus=0-47 nodes=0-1
>
> 110 allocations are gone, but 10 have leaked and will never be freed.
>
> Unhelpfully, those allocations are explicitly excluded from KMEMLEAK,
> that's why my initial attempts with KMEMLEAK were not successful:
>
> /*
> * Avoid a kmemleak false positive. The pointer to the name is stored
> * in a page cache page which kmemleak does not scan.
> */
> kmemleak_not_leak(string->name);
>
> It would be possible to solve this bug without reverting the whole
> commit:
>
> - keep track of which pages were not used, and call
> nfs_readdir_clear_array() on them, or
> - manually link those pages into the page cache
>
> But for now I have decided to just revert the commit, because the real
> fix would require complex considerations, risking more dangerous
> (crash) bugs, which may seem unsuitable for the stable branches.
>
> Signed-off-by: Max Kellermann <mk@xxxxxxxxxx>
> ---
> fs/nfs/dir.c | 90 ++++-------------------------------------------
> fs/nfs/internal.h | 3 +-
> 2 files changed, 7 insertions(+), 86 deletions(-)
No cc: stable@vger on this patch to get it backported?
thanks,
greg k-h