Re: [PATCH 2/2] shmem: fix splicing from a hole while it's punched

From: Michal Hocko
Date: Fri Jul 25 2014 - 10:33:52 EST


On Tue 15-07-14 03:33:02, Hugh Dickins wrote:
> shmem_fault() is the actual culprit in trinity's hole-punch starvation,
> and the most significant cause of such problems: since a page faulted is
> one that then appears page_mapped(), needing unmap_mapping_range() and
> i_mmap_mutex to be unmapped again.
>
> But it is not the only way in which a page can be brought into a hole in
> the radix_tree while that hole is being punched; and Vlastimil's testing
> implies that if enough other processors are busy filling in the hole,
> then shmem_undo_range() can be kept from completing indefinitely.
>
> shmem_file_splice_read() is the main other user of SGP_CACHE, which
> can instantiate shmem pagecache pages in the read-only case (without
> holding i_mutex, so perhaps concurrently with a hole-punch). Probably
> it's silly not to use SGP_READ already (using the ZERO_PAGE for holes):
> which ought to be safe, but might bring surprises - not a change to be
> rushed.
>
> shmem_read_mapping_page_gfp() is an internal interface used by
> drivers/gpu/drm GEM (and next by uprobes): it should be okay. And
> shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
> called internally by the kernel (perhaps for a stacking filesystem,
> which might rely on holes to be reserved): it's unclear whether it
> could be provoked to keep hole-punch busy or not.
>
> We could apply the same umbrella as now used in shmem_fault() to
> shmem_file_splice_read() and the others; but it looks ugly, and use
> over a range raises questions - should it actually be per page? can
> these get starved themselves?
>
> The origin of this part of the problem is my v3.1 commit d0823576bf4b
> ("mm: pincer in truncate_inode_pages_range"), once it was duplicated
> into shmem.c. It seemed like a nice idea at the time, to ensure
> (barring RCU lookup fuzziness) that there's an instant when the entire
> hole is empty; but the indefinitely repeated scans to ensure that make
> it vulnerable.
>
> Revert that "enhancement" to hole-punch from shmem_undo_range(), but
> retain the unproblematic rescanning when it's truncating; add a couple
> of comments there.
>
> Remove the "indices[0] >= end" test: that is now handled satisfactorily
> by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
> to be worth avoiding here.
>
> But if we do not always loop indefinitely, we do need to handle the case
> of swap swizzled back to page before shmem_free_swap() gets it: add a
> retry for that case, as suggested by Konstantin Khlebnikov; and for the
> case of page swizzled back to swap, as suggested by Johannes Weiner.
>
> Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> Reported-by: Sasha Levin <sasha.levin@xxxxxxxxxx>
> Suggested-by: Vlastimil Babka <vbabka@xxxxxxx>
> Cc: Konstantin Khlebnikov <koct9i@xxxxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Lukas Czerner <lczerner@xxxxxxxxxx>
> Cc: Dave Jones <davej@xxxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx> [3.1+]

Reviewed-by: Michal Hocko <mhocko@xxxxxxx>

> ---
> Please replace mmotm's
> shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
> by this patch: which is substantially the same as that, but with
> updated commit comment, and a page retry as indicated by Hannes.
>
> mm/shmem.c | 24 +++++++++++++++---------
> 1 file changed, 15 insertions(+), 9 deletions(-)
>
> --- 3.16-rc5+/mm/shmem.c 2014-07-14 20:34:28.196153828 -0700
> +++ 3.16-rc5++/mm/shmem.c 2014-07-14 20:35:14.156154916 -0700
> @@ -468,23 +468,20 @@ static void shmem_undo_range(struct inod
> return;
>
> index = start;
> - for ( ; ; ) {
> + while (index < end) {
> cond_resched();
>
> pvec.nr = find_get_entries(mapping, index,
> min(end - index, (pgoff_t)PAGEVEC_SIZE),
> pvec.pages, indices);
> if (!pvec.nr) {
> - if (index == start || unfalloc)
> + /* If all gone or hole-punch or unfalloc, we're done */
> + if (index == start || end != -1)
> break;
> + /* But if truncating, restart to make sure all gone */
> index = start;
> continue;
> }
> - if ((index == start || unfalloc) && indices[0] >= end) {
> - pagevec_remove_exceptionals(&pvec);
> - pagevec_release(&pvec);
> - break;
> - }
> mem_cgroup_uncharge_start();
> for (i = 0; i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
> @@ -496,8 +493,12 @@ static void shmem_undo_range(struct inod
> if (radix_tree_exceptional_entry(page)) {
> if (unfalloc)
> continue;
> - nr_swaps_freed += !shmem_free_swap(mapping,
> - index, page);
> + if (shmem_free_swap(mapping, index, page)) {
> + /* Swap was replaced by page: retry */
> + index--;
> + break;
> + }
> + nr_swaps_freed++;
> continue;
> }
>
> @@ -506,6 +507,11 @@ static void shmem_undo_range(struct inod
> if (page->mapping == mapping) {
> VM_BUG_ON_PAGE(PageWriteback(page), page);
> truncate_inode_page(mapping, page);
> + } else {
> + /* Page was replaced by swap: retry */
> + unlock_page(page);
> + index--;
> + break;
> }
> }
> unlock_page(page);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/