Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

From: Nadav Amit
Date: Fri Oct 28 2022 - 19:57:51 EST


On Oct 27, 2022, at 2:44 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote:

> Anyhow, admittedly, I need to give it more thought. For instance, in respect
> to the code that you mentioned (in zap_pte_range), after reading it again,
> it seems strange: how is ok to defer the TLB flush after the rmap was
> removed, even if it is done while the PTL is held.
> folio_clear_dirty_for_io() would not sync on the PTL afterwards, so the page
> might be later re-dirtied using a stale cached PTE. Oh well.

Peter,

So it appears to be a problem - flushing after removing the rmap. I attach a
PoC that shows the problem.

The problem is in the following code of zap_pte_range():

if (!PageAnon(page)) {
if (pte_dirty(ptent)) {
force_flush = 1;
set_page_dirty(page);
}

}
page_remove_rmap(page, vma, false);

Once we remove the rmap, rmap_walk() would not acquire the page-table lock
anymore. As a result, nothing prevents the kernel from performing writeback
and cleaning the page-struct dirty-bit before the TLB is actually flushed.
As a result, if there is an additional thread that has the dirty-PTE cached
in the TLB, it can keep writing to the page and nothing (PTE/page-struct)
will keep track that the page has been dirtied.

In other words, writes to the memory mapped file after
munmap()/MADV_DONTNEED started can be lost.

This affects both munmap() and MADV_DONTNEED. One might argue that if you
don’t keep writing after munmap()/MADV_DONTNEED it’s your fault
(questionable), but if that’s the case why do we bother with force_flush at
all?

If we want it to behave correctly - i.e., writes after munmap/MADV_DONTNEED
to propagate to the file - we need to collect dirty pages and remove their
rmap only after we flush the TLB (and before we release the ptl). mmu_gather
would probably need to be changed for this matter.

Thoughts?

---

Some details about the PoC (below):

We got 3 threads that use 2MB range:

1. Maintains a counter in the first 4KB of the 2MB range and checks it
actually updates the memory.

2. Dirties pages in 2MB range (just to make the race more likely).

3. Either (i) maps the same mapping at the first 4KB (to test munmap
indirectly); or (ii) runs MADV_DONTNEED on the first 4KB.

In addition, a child process runs msync and fdatasync to writeback the first
4KB.

The PoC should be run with a file on a block RAM device. It manages to
trigger the issue relatively reliably (within 60 seconds) with munmap() and
slightly less reliably with MADV_DONTNEED. I have no idea if it works in VM,
deep C-states, etc..

-- >8 --

#define _GNU_SOURCE
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>

#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)

void *p;
volatile bool stop = false;
pid_t flusher_pid;
int fd;

#define PAGE_SIZE (4096ul)
#define PAGES_PER_PMD (512)
#define HPAGE_SIZE (PAGE_SIZE * PAGES_PER_PMD)

// Comment MUNMAP_TEST for MADV_DONTNEED test
#define MUNMAP_TEST

void *dirtying_thread(void *arg)
{
int i;

while (!stop) {
for (i = 1; i < PAGES_PER_PMD; i++) {
*(volatile char *)(p + (i * PAGE_SIZE) + 64) = 5;
}
}
return NULL;
}

void *checking_thread(void *arg)
{
volatile unsigned long *ul_p = (volatile unsigned long*)p;
unsigned long cnt = 0;

while (!stop) {
*ul_p = cnt;
if (*ul_p != cnt) {
printf("FAILED: expected %ld, got %ld\n", cnt, *ul_p);
kill(flusher_pid, SIGTERM);
exit(0);
}
cnt++;
}
return NULL;
}

void *remap_thread(void *arg)
{
void *ptr;
struct timespec t = {
.tv_nsec = 10000,
};

while (!stop) {
#ifdef MUNMAP_TEST
ptr = mmap(p, PAGE_SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);
if (ptr == MAP_FAILED)
handle_error("remap_thread");
#else
if (madvise(p, PAGE_SIZE, MADV_DONTNEED) < 0)
handle_error("MADV_DONTNEED");
nanosleep(&t, NULL);
#endif
}
return NULL;
}

void flushing_process(void)
{
// Remove the pages to speed up rmap_walk and allow to drop caches.
if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0)
handle_error("MADV_DONTNEED");

while (true) {
if (msync(p, PAGE_SIZE, MS_SYNC))
handle_error("msync");
if (posix_fadvise(fd, 0, PAGE_SIZE, POSIX_FADV_DONTNEED))
handle_error("posix_fadvise");
}
}

int main(int argc, char *argv[])
{
void *(*thread_funcs[])(void*) = {
&dirtying_thread,
&checking_thread,
&remap_thread,
};
int r, i;
int rc1, rc2;
unsigned long addr;
void *ptr;
char *page = malloc(PAGE_SIZE);
int n_threads = sizeof(thread_funcs) / sizeof(*thread_funcs);
pthread_t *threads = malloc(sizeof(pthread_t) * n_threads);
pid_t pid;

if (argc < 2) {
fprintf(stderr, "usages: %s [filename]\n", argv[0]);
exit(EXIT_FAILURE);
}

fd = open(argv[1], O_RDWR|O_CREAT, 0666);
if (fd == -1)
handle_error("open fd");

for (i = 0; i < PAGES_PER_PMD; i++) {
if (write(fd, page, PAGE_SIZE) != PAGE_SIZE)
handle_error("write");
}
free(page);

ptr = mmap(NULL, HPAGE_SIZE * 2, PROT_NONE, MAP_PRIVATE|MAP_ANON,
-1, 0);

if (ptr == MAP_FAILED)
handle_error("mmap anon");

addr = (unsigned long)(ptr + HPAGE_SIZE - 1) & ~(HPAGE_SIZE - 1);
printf("starting...\n");

ptr = mmap((void *)addr, HPAGE_SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);

if (ptr == MAP_FAILED)
handle_error("mmap file - start");

p = ptr;

for (i = 0; i < n_threads; i++) {
r = pthread_create(&threads[i], NULL, thread_funcs[i], NULL);
if (r)
handle_error("pthread_create");
}

// Run the flushing process in a different process, so msync() would
// not require mmap_lock.
pid = fork();
if (pid == 0)
flushing_process();
flusher_pid = pid;

sleep(60);

stop = true;
for (i = 0; i < n_threads; i++)
pthread_join(threads[i], NULL);
kill(flusher_pid, SIGTERM);
printf("Finished without an error");

exit(0);
}