Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped
From: Dave Hansen
Date: Fri May 31 2024 - 12:13:16 EST
On 5/31/24 02:19, Byungchul Park wrote:
..
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0283cf366c2a..03683bf66031 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2872,6 +2872,12 @@ static inline void file_end_write(struct file *file)
> if (!S_ISREG(file_inode(file)->i_mode))
> return;
> sb_end_write(file_inode(file)->i_sb);
> +
> + /*
> + * XXX: If needed, can be optimized by avoiding luf_flush() if
> + * the address space of the file has never been involved by luf.
> + */
> + luf_flush();
> }
..
> +void luf_flush(void)
> +{
> + unsigned long flags;
> + unsigned short int ugen;
> +
> + /*
> + * Obtain the latest ugen number.
> + */
> + spin_lock_irqsave(&luf_lock, flags);
> + ugen = luf_gen;
> + spin_unlock_irqrestore(&luf_lock, flags);
> +
> + check_luf_flush(ugen);
> +}
Am I reading this right? There's now an unconditional global spinlock
acquired in the sys_write() path? How can this possibly scale?
So, yeah, I think an optimization is absolutely needed. But, on a more
fundamental level, I just don't believe these patches are being tested.
Even a simple microbenchmark should show a pretty nasty regression on
any decently large system:
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/write1.c
Second, I was just pointing out sys_write() as an example of how the
page cache could change. Couldn't a separate, read/write mmap() of the
file do the same thing and *not* go through sb_end_write()?
So:
fd = open("foo");
ptr1 = mmap(fd, PROT_READ);
ptr2 = mmap(fd, PROT_READ|PROT_WRITE);
foo = *ptr1; // populate the page cache
... page cache page is reclaimed and LUF'd
*ptr2 = bar; // new page cache page is allocated and written to
printk("*ptr1: %d\n", *ptr1);
Doesn't the printk() see stale data?
I think tglx would call all of this "tinkering". The approach to this
series is to "fix" narrow, specific cases that reviewers point out, make
it compile, then send it out again, hoping someone will apply it.
So, for me, until the approach to this series changes: NAK, for x86.
Andrew, please don't take this series. Or, if you do, please drop the
patch enabling it on x86.
I also have the feeling our VFS friends won't take kindly to having
random luf_foo() hooks in their hot paths, optimized or not. I don't
see any of them on cc.