Re: [PATCH RESEND v15 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

From: Muhammad Usama Anjum
Date: Wed May 24 2023 - 07:26:53 EST


On 5/24/23 12:43 AM, Peter Xu wrote:
> Hi, Muhammad,
>
> On Mon, May 22, 2023 at 04:26:07PM +0500, Muhammad Usama Anjum wrote:
>> On 5/22/23 3:24 PM, Muhammad Usama Anjum wrote:
>>> On 4/26/23 7:13 PM, Peter Xu wrote:
>>>> Hi, Muhammad,
>>>>
>>>> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote:
>>>>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote:
>>>>>> +/* Supported flags */
>>>>>> +#define PM_SCAN_OP_GET (1 << 0)
>>>>>> +#define PM_SCAN_OP_WP (1 << 1)
>>>>> We have only these flag options available in PAGEMAP_SCAN IOCTL.
>>>>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can
>>>>> be specified as need. But PM_SCAN_OP_WP cannot be specified without
>>>>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate
>>>>> functionality which can be achieved by UFFDIO_WRITEPROTECT.)
>>>>>
>>>>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP
>>>>> vs
>>>>> 2) UFFDIO_WRITEPROTECT
>>>>>
>>>>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are
>>>>> getting really good performance which is comparable just like we are
>>>>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp,
>>>>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT
>>>>> performance and behavior wise.
>>>>>
>>>>> I've got the results from someone else that UFFDIO_WRITEPROTECT block
>>>>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this
>>>>> as I don't have tests comparing them one-to-one.
>>>>>
>>>>> What are your thoughts about it? Have you thought about making
>>>>> UFFDIO_WRITEPROTECT perform better?
>>>>>
>>>>> I'm sorry to mention the word "performance" here. Actually we want better
>>>>> performance to emulate Windows syscall. That is why we are adding this
>>>>> functionality. So either we need to see what can be improved in
>>>>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in
>>>>> pagemap_ioctl?
>>>>
>>>> I'm fine if you want to add it back if it works for you. Though before
>>>> that, could you remind me why there can be a difference on performance?
>>> I've looked at the code again and I think I've found something. Lets look
>>> at exact performance numbers:
>>>
>>> I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used
>>> for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured
>>> the average write time to the same memory which is being WP-ed and total
>>> time of execution of these APIs:
>
> What is the steps of the test? Is it as simple as "writeprotect",
> "unprotect", then write all pages in a single thread?
>
> Is UFFDIO_WRITEPROTECT sent in one range covering all pages?
>
> Maybe you can attach the test program here too.

I'd not attached the test earlier as I thought that you wouldn't be
interested in running the test. I've attached it now. The test has multiple
threads where one thread tries to get status of flags and reset them, while
other threads write to that memory. In main(), we call the pagemap_scan
ioctl to get status of flags and reset the memory area as well. While in N
threads, the memory is written.

I usually run the test by following where memory area is of 100000 * pages:
./win2_linux 8 100000 1 1 0

I'm running tests on real hardware. The results are pretty consistent. I'm
also testing only on x86_64. PM_SCAN_OP_WP wins every time as compared to
UFFDIO_WRITEPROTECT.

The PM_SCAN_OP_WP op doesn't work exclusively on v15. So please find the
updated WIP code here:
https://gitlab.collabora.com/usama.anjum/linux-mainline/-/commits/memwatchv16/

>
>>>
>>> **avg write time:**
>>> | No of pages | 2000 | 8192 | 100000 | 500000 |
>>> |------------------------|------|------|--------|--------|
>>> | UFFDIO_WRITEPROTECT | 2200 | 2300 | 4100 | 4200 |
>>> | PM_SCAN_OP_WP | 2000 | 2300 | 2500 | 2800 |
>>>
>>> **Execution time measured in rdtsc:**
>>> | No of pages | 2000 | 8192 | 100000 | 500000 |
>>> |------------------------|------|-------|--------|--------|
>>> | UFFDIO_WRITEPROTECT | 3200 | 14000 | 59000 | 58000 |
>>> | PM_SCAN_OP_WP | 1900 | 7000 | 38000 | 40000 |
>>>
>>> Avg write time for UFFDIO_WRITEPROTECT is 1.3 times slow. The execution
>>> time is 1.5 times slower in the case of UFFDIO_WRITEPROTECT. So
>>> UFFDIO_WRITEPROTECT is making writes slower to the pages and execution time
>>> is also slower.
>>>
>>> This proves that PM_SCAN_OP_WP is better than UFFDIO_WRITEPROTECT. Although
>>> PM_SCAN_OP_WP and UFFDIO_WRITEPROTECT have been implemented differently. We
>>> should have seen no difference in performance. But we have quite a lot of
>>> difference in performance here. PM_SCAN_OP_WP takes read mm lock, uses
>>> walk_page_range() to walk over pages which finds VMAs from address ranges
>>> to walk over them and pagemap_scan_pmd_entry() is handling most of the work
>>> including tlb flushing. UFFDIO_WRITEPROTECT is also taking the mm lock and
>>> iterating from all the different page directories until a pte is found and
>>> then flags are updated there and tlb is flushed for every pte.
>>>
>>> My next deduction would be that we are getting worse performance as we are
>>> flushing tlb for one page at a time in case of UFFDIO_WRITEPROTECT. While
>>> we flush tlb for 512 pages (moslty) at a time in case of PM_SCAN_OP_WP.
>>> I've just verified this by adding some logs to the change_pte_range() and
>>> pagemap_scan_pmd_entry(). Logs are attached. I've allocated memory of 1000
>>> pages and write-protected it with UFFDIO_WRITEPROTECT and PM_SCAN_OP_WP.
>>> The logs show that UFFDIO_WRITEPROTECT has flushed tlb 1000 times of size 1
>>> page each time. While PM_SCAN_OP_WP has flushed only 3 times of bigger
>>> sizes. I've learned over my last experience that tlb flush is very
>>> expensive. Probably this is what we need to improve if we don't want to add
>>> PM_SCAN_OP_WP?
>>>
>>> The UFFDIO_WRITEPROTECT uses change_pte_range() which is very generic
>>> function and I'm not sure if can try to not do tlb flushes if uffd_wp is
>>> true. We can try to do flush somewhere else and hopefully we should do only
>>> one flush if possible. It will not be so straight forward to move away from
>>> generic fundtion. Thoughts?
>> I've just tested this theory of not doing per pte flushes and only did one
>> flush on entire range in uffd_wp_range(). But it didn't improve the
>> situation either. I was wrong that tlb flushes may be the cause.
>
> I had a feeling that you were trapping tlb_flush_pte_range(), which is
> actually not really sending any TLB flushes but updating mmu_gather object
> for the addr range for future invalidations.
>
> That's probably why it didn't show an effect when you comment it out.
Yeah, probably.

>
> I am not sure whether the wr-protect path difference can be caused by the
> arch hooks, namely arch_enter_lazy_mmu_mode() / arch_leave_lazy_mmu_mode().
>
> On x86 I saw that it's actually hooked onto some PV calls. I had a feeling
> that this is for optimization only, but maybe it's still a good idea you
> also take that into your new code:
>
> static inline void arch_enter_lazy_mmu_mode(void)
> {
> PVOP_VCALL0(mmu.lazy_mode.enter);
> }
I've just looked into it. It isn't making any difference. But I think I
should include it in the code. It must be helpful for hypervisors etc.

>
> The other thing is I think you're flushing tlb outside pgtable lock in your
> new code. IIUC that's racy, see:
>
> commit 6ce64428d62026a10cb5d80138ff2f90cc21d367
> Author: Nadav Amit <namit@xxxxxxxxxx>
> Date: Fri Mar 12 21:08:17 2021 -0800
>
> mm/userfaultfd: fix memory corruption due to writeprotect
>
> So you may want to put it at least into pgtable lock critical section, or
> IIUC you can also do inc_tlb_flush_pending() then dec_tlb_flush_pending()
> just like __tlb_gather_mmu(), to make sure do_wp_page() will properly flush
> the page when unluckily hit some of the page.
Good point. I'll release page table lock after tlb flushing. I've just
added it to next WIP v16.

>
> That's also the spot (the flush_tlb_page() in 6ce64428d) that made me think
> on whether it caused the slowness on writting to those pages. But it
> really depends on your test program, e.g. if it's a single threaded I don't
> think it'll trigger because when writting mm_tlb_flush_pending() should
> start to return 0 already, so the tlb should logically not be needed. If
> you want maybe you can double check that.
>
> So in short, I had a feeling that the new PM_SCAN_OP_WP just misses
> something here and there so it's faster - it means even if it's faster it
> may also be prone to race conditions etc so we'd better figure it out..
The test program is multi-threaded. The performance number cannot be
reproduced with single-threaded application.

>
> Thanks,
>

--
BR,
Muhammad Usama Anjum#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#include <sys/mman.h>
#define _GNU_SOURCE
#define _OPEN_THREADS
#include <pthread.h>
#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
#include <sys/ioctl.h>
#include <linux/userfaultfd.h>
#include <sys/ioctl.h>

#define LOG(format, ...) {printf("%lx:%s: " format, getpid(), __func__ __VA_OPT__(,)__VA_ARGS__);}

#define MAX_THREAD_COUNT 64

#define PAGE_SIZE 0x1000

#define TEST_TIME (3.0 * 1000.0) * 10

static bool finish;
static int nthreads;

static volatile long long raw_writes_count, writes_time;
static char *mem;

static bool random_access, read_reset;

#if defined(__i386__)

static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}

#elif defined(__x86_64__)

static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

#endif

struct thread_info
{
volatile void *mem;
size_t npages;
};

void *thread_proc(void *data)
{
struct thread_info *info = data;
volatile unsigned char *m = info->mem;
size_t page;
unsigned long long t1, t2;

page = 0;
while (!finish)
{
t1 = rdtsc();

if (random_access)
page = rand() % info->npages;
else
page = (page + 1) % info->npages;

++*(volatile unsigned int *)(m + page * PAGE_SIZE);
__atomic_add_fetch (&raw_writes_count, 1, __ATOMIC_RELAXED);
t2 = rdtsc();
__atomic_add_fetch (&writes_time, t2 - t1, __ATOMIC_RELAXED);
}

return 0;
}

double curr_time_ms(void)
{
struct timespec current_time;

clock_gettime(CLOCK_MONOTONIC, &current_time);
return current_time.tv_sec * 1000 +
current_time.tv_nsec / 1000000;
}

static double rdtsc_c;


//static double inline mcs_from_rdtsc_avg(void *tsc, unsigned long long count)
//{
// return (unsigned int)tsc * rdtsc_c / count;
//}

////////////////////////////////////////////////////////////////////////////////////////////////
#define UFFD_FEATURE_WP_UNPOPULATED (1<<13)
#define UFFD_FEATURE_WP_ASYNC (1<<14)

#ifndef PAGEMAP_SCAN
/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */
#define PAGE_IS_WRITTEN (1 << 0)
#define PAGE_IS_FILE (1 << 1)
#define PAGE_IS_PRESENT (1 << 2)
#define PAGE_IS_SWAPPED (1 << 3)

/*
* struct page_region - Page region with bitmap flags
* @start: Start of the region
* @len: Length of the region
* bitmap: Bits sets for the region
*/
struct page_region {
unsigned long long start;
unsigned long long len;
unsigned long long bitmap;
};

/*
* struct pm_scan_arg - Pagemap ioctl argument
* @size: Size of the structure
* @flags: Flags for the IOCTL
* @start: Starting address of the region
* @len: Length of the region (All the pages in this length are included)
* @vec: Address of page_region struct array for output
* @vec_len: Length of the page_region struct array
* @max_pages: Optional max return pages
* @required_mask: Required mask - All of these bits have to be set in the PTE
* @anyof_mask: Any mask - Any of these bits are set in the PTE
* @excluded_mask: Exclude mask - None of these bits are set in the PTE
* @return_mask: Bits that are to be reported in page_region
*/
struct pm_scan_arg {
unsigned long long size;
unsigned long long flags;
unsigned long long start;
unsigned long long len;
unsigned long long vec;
unsigned long long vec_len;
unsigned long long max_pages;
unsigned long long required_mask;
unsigned long long anyof_mask;
unsigned long long excluded_mask;
unsigned long long return_mask;
};

#define PM_SCAN_OP_GET (1 << 0)
#define PM_SCAN_OP_WP (1 << 1)

/* Pagemap ioctl */
#define PAGEMAP_SCAN _IOWR('f', 16, struct pm_scan_arg)

#endif

#define __NR_userfaultfd 323

#define PAGEMAP "/proc/self/pagemap"
int pagemap_fd;
int uffd;

static long pagemap_ioctl(void *start, int len, void *vec, int vec_len, int flag,
int max_pages, long required_mask, long anyof_mask, long excluded_mask,
long return_mask)
{
struct pm_scan_arg arg;

arg.start = (uintptr_t)start;
arg.len = len;
arg.vec = (uintptr_t)vec;
arg.vec_len = vec_len;
arg.flags = flag;
arg.size = sizeof(struct pm_scan_arg);
arg.max_pages = max_pages;
arg.required_mask = required_mask;
arg.anyof_mask = anyof_mask;
arg.excluded_mask = excluded_mask;
arg.return_mask = return_mask;

return ioctl(pagemap_fd, PAGEMAP_SCAN, &arg);
}

int init_uffd(void)
{
struct uffdio_api uffdio_api;

uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
if (uffd == -1)
exit(1);

uffdio_api.api = UFFD_API;
uffdio_api.features = UFFD_FEATURE_WP_UNPOPULATED | UFFD_FEATURE_WP_ASYNC |
UFFD_FEATURE_WP_HUGETLBFS_SHMEM;
if (ioctl(uffd, UFFDIO_API, &uffdio_api))
exit(1);

if (!(uffdio_api.api & UFFDIO_REGISTER_MODE_WP) ||
!(uffdio_api.features & UFFD_FEATURE_WP_UNPOPULATED) ||
!(uffdio_api.features & UFFD_FEATURE_WP_ASYNC) ||
!(uffdio_api.features & UFFD_FEATURE_WP_HUGETLBFS_SHMEM))
exit(1);

return 0;
}

int wp_init(void *lpBaseAddress, int dwRegionSize)
{
struct uffdio_register uffdio_register;
struct uffdio_writeprotect wp;

uffdio_register.range.start = (unsigned long)lpBaseAddress;
uffdio_register.range.len = dwRegionSize;
uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
exit(1);

if (!(uffdio_register.ioctls & UFFDIO_WRITEPROTECT))
exit(1);

wp.range.start = (unsigned long)lpBaseAddress;
wp.range.len = dwRegionSize;
wp.mode = UFFDIO_WRITEPROTECT_MODE_WP;

if (ioctl(uffd, UFFDIO_WRITEPROTECT, &wp))
exit(1);

return 0;
}

int wp_free(void *lpBaseAddress, int dwRegionSize)
{
struct uffdio_register uffdio_register;

uffdio_register.range.start = (unsigned long)lpBaseAddress;
uffdio_register.range.len = dwRegionSize;
uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range))
exit(1);
return 0;
}

int wp_addr_range(void *lpBaseAddress, int dwRegionSize)
{
struct uffdio_writeprotect wp;

wp.range.start = (unsigned long)lpBaseAddress;
wp.range.len = dwRegionSize;
wp.mode = UFFDIO_WRITEPROTECT_MODE_WP;

if (ioctl(uffd, UFFDIO_WRITEPROTECT, &wp))
exit(1);

return 0;
}

//int wp_addr_range_(void *lpBaseAddress, int dwRegionSize)
//{
// struct page_region *vec;
// int ret;
//
// vec = malloc(sizeof(struct page_region) * dwRegionSize);
//
// ret = pagemap_ioctl(lpBaseAddress, dwRegionSize, vec, dwRegionSize, PM_SCAN_OP_GET | PM_SCAN_OP_WP,
// 0, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
//
// if (ret < 0)
// exit(1);
//
// free(vec);
// return 0;
//}

int wp_addr_range__(void *lpBaseAddress, int dwRegionSize)
{
int ret;

ret = pagemap_ioctl(lpBaseAddress, dwRegionSize, NULL, 0, PM_SCAN_OP_WP,
0, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);

if (ret < 0)
exit(1);

return 0;
}

#define WRITE_WATCH_FLAG_RESET 1
struct page_region *buf;

int GetWriteWatch(int flag, char *start, long len, long **vec, unsigned int *ww_count,
unsigned int *granularity)
{
struct pm_scan_arg arg;
int i, j, ind = 0, ret;

arg.start = (uintptr_t)start;
arg.len = len;
arg.vec = (uintptr_t)buf; //(uintptr_t)vec;
arg.vec_len = *ww_count;
arg.flags = PM_SCAN_OP_GET;
if (flag == WRITE_WATCH_FLAG_RESET)
arg.flags |= PM_SCAN_OP_WP;
arg.size = sizeof(struct pm_scan_arg);
arg.max_pages = *ww_count;
arg.required_mask = PAGE_IS_WRITTEN;
arg.anyof_mask = 0;
arg.excluded_mask = 0;
arg.return_mask = PAGE_IS_WRITTEN;

if (granularity)
*granularity = 4096;

ret = ioctl(pagemap_fd, PAGEMAP_SCAN, &arg);
if (ret < 0)
goto free_and_return;

*ww_count = 0;
for (i = 0; i < ret; i++) {
*ww_count += buf[i].len;

for (j = 0; j < buf[i].len; j++)
vec[ind++] = (long int *)(buf[i].start + j * 4096);
}

ret = 0;

free_and_return:
return ret;
}

//#define USE_PAGEMAP_GET_WP
//#define USE_PAGEMAP_WP

unsigned long long reset_total;
int ResetWriteWatch(void *lpBaseAddress, int dwRegionSize)
{
unsigned long long t = rdtsc();
int ret;

#ifdef USE_PAGEMAP_GET_WP
ret = wp_addr_range_(lpBaseAddress, dwRegionSize);
#elif defined USE_PAGEMAP_WP
ret = wp_addr_range__(lpBaseAddress, dwRegionSize);
#else
ret = wp_addr_range(lpBaseAddress, dwRegionSize);
#endif
reset_total += rdtsc() - t;

return ret;
}


int main(int argc, char *argv[])
{
unsigned int ww_count, ww_total, cycle_count, fault_count;
struct thread_info info[MAX_THREAD_COUNT];
unsigned long long t1, t2, rdtsc_start, rdtsc_end;
double start, curr, cycle_start;
static long **ww_addr;
long long writes_count;
unsigned long long wwread_time;
double rw_delay_ms;
unsigned int granularity;
unsigned int old_prot;
unsigned int count;
unsigned int i;
int get_count;
size_t npages;

pagemap_fd = open(PAGEMAP, O_RDWR);
if (pagemap_fd < 0) {
perror("pagemapfd");
return -EINVAL;
}

if (init_uffd())
return -1;

if (argc < 6) {
puts("Usage: win.exe <nthreads> <npages> <watch_delay_ms> <random_access> <read_reset>\n");
return -1;
}

nthreads = atoi(argv[1]);
if (nthreads > MAX_THREAD_COUNT) {
LOG("Maximum of %u threads supported.\n", MAX_THREAD_COUNT);
return -1;
}
npages = atoi(argv[2]);
if (npages < nthreads || npages % nthreads) {
LOG("npages should be > nthreads and evenly divisible by nthreads.\n");
return -1;
}
rw_delay_ms = atof(argv[3]);
random_access = atoi(argv[4]);
read_reset = atoi(argv[5]);

ww_addr = malloc(sizeof(*ww_addr) * npages);

mem = mmap(NULL, npages * PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
if (!mem) {
LOG("Failed to allocate virtual memory.\n");
return -1;
}
wp_init(mem, npages * PAGE_SIZE);
wp_addr_range(mem, npages * PAGE_SIZE);

mem[0] = 0x28;

mem[0x1000] = 0x29;

buf = malloc(100000 * sizeof(struct page_region));

ww_count = 100;
if (GetWriteWatch(0, mem, 0x1000 * npages, ww_addr, &ww_count, &granularity)) {
LOG("GetWriteWatch() failed, GetLastError() %lu.\n", errno);
return -1;
}
// LOG("count %llu, %p, %p.\n", ww_count, mem, ww_addr[0]);
mem = mmap(NULL, npages * 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
if (!mem) {
perror("Error commit");
return -1;
}
wp_init(mem, npages * 0x1000);
wp_addr_range(mem, npages * 0x1000);

ww_count = 100;
if (GetWriteWatch(0, mem, 0x1000 * npages, ww_addr, &ww_count, &granularity)) {
LOG("GetWriteWatch() failed, GetLastError() %lu.\n", errno);
return -1;
}
// LOG("count %llu, %p, %p.\n", ww_count, mem, ww_addr[0]);

memset(mem, 0, npages * PAGE_SIZE);
ResetWriteWatch(mem, npages * PAGE_SIZE);

ww_count = 100;
if (GetWriteWatch(0, mem, 0x1000 * npages, ww_addr, &ww_count, &granularity)) {
LOG("GetWriteWatch() failed, GetLastError() %lu.\n", errno);
return -1;
}
// LOG("read bytes %lu, count after read %llu, %p, %p.\n", count, ww_count, mem, ww_addr[0]);

memset(mem, 0, npages * PAGE_SIZE);
ResetWriteWatch(mem, npages * PAGE_SIZE);

pthread_t th;
for (i = 0; i < nthreads; ++i) {
info[i].mem = mem + 0x1000 * i * npages / nthreads;
info[i].npages = npages / nthreads;

pthread_create(&th, NULL, thread_proc, &info[i]);
}

get_count = npages;

wwread_time = 0;
curr = start = curr_time_ms();
ww_total = 0;
cycle_count = 0;
rdtsc_start = rdtsc();
while (curr - start < TEST_TIME) {
char *addr, *end;

cycle_start = curr;

t1 = rdtsc();

addr = mem;
end = mem + npages * PAGE_SIZE;
// LOG("cycle %I64u.\n", cycle_count);
while (addr < end) {
ww_count = get_count;
if (GetWriteWatch(read_reset ? WRITE_WATCH_FLAG_RESET : 0, addr, end - addr,
ww_addr, &ww_count, &granularity)) {
LOG("GetWriteWatch() failed, GetLastError() %lu.\n", errno);
return -1;
}
ww_total += ww_count;
if (ww_count < get_count)
break;
addr = (char *)ww_addr[ww_count - 1] + 0x1000;
LOG("addr %p, end %p, ww_count %I64u.\n", addr, end, ww_count);
}
if (!read_reset)
ResetWriteWatch(mem, end - mem);

t2 = rdtsc();
wwread_time += t2 - t1;
curr = curr_time_ms();
while (curr - start < TEST_TIME && curr - cycle_start < rw_delay_ms) {
sched_yield();
curr = curr_time_ms();
}
++cycle_count;
}
// rdtsc_end = rdtsc();
writes_count = raw_writes_count;
finish = true;
// rdtsc_c = 1000.0 * (curr - start) / (rdtsc_end - rdtsc_start);
// LOG("rdtsc_c %lf.\n", rdtsc_c);

sleep(1);

LOG("Elapsed %.1lf, cycle_count %llu, writes_count %lld, writes watched %llu.\n",
curr - start, cycle_count, writes_count, ww_total);
LOG("writes per thread * msec %.3lf, avg. write time %.1lf, GetWriteWatch() avg %.1lf.\n",
writes_count / (TEST_TIME * nthreads), (float)writes_time/writes_count,
wwread_time / cycle_count);

printf("ResetWriteWatch() time rdtsc --> %llu M\n", reset_total/1000000);

free(buf);
return 0;
}