Re:[PATCH] bitmap: fix a unproper remap when mpol_rebind_nodemask()
From: wang.yi59
Date: Mon Jun 13 2022 - 23:45:56 EST
Hi Yury,
Thanks for your quick and clear response!
> On Mon, Jun 13, 2022 at 4:31 AM Yi Wang <wang.yi59@xxxxxxxxxx> wrote:
> >
> > Consider one situation:
> >
> > The app have two vmas which mbind() to node 1 and node3 respectively,
> > and its cpuset.mems is 0-3, now set its cpuset.mems to 1,3, according
> > to current bitmap_remap(), we got:
>
> Regarding the original problem - can you please confirm that
> it's reproduced on current kernels, show the execution path etc.
> From what I see on modern kernel, the only user of nodes_remap()
> is mpol_rebind_nodemask(). Is that the correct path?
Yes, it's mpol_rebind_nodemask() calls nodes_remap() from
mpol_rebind_policy(). The stacks are as follow:
[ 290.836747] bitmap_remap+0x84/0xe0
[ 290.836753] mpol_rebind_nodemask+0x64/0x2a0
[ 290.836764] mpol_rebind_mm+0x3a/0x90
[ 290.836768] update_tasks_nodemask+0x8a/0x1e0
[ 290.836774] cpuset_write_resmask+0x563/0xa00
[ 290.836780] cgroup_file_write+0x81/0x150
[ 290.836784] kernfs_fop_write_iter+0x12d/0x1c0
[ 290.836791] new_sync_write+0x109/0x190
[ 290.836800] vfs_write+0x218/0x2a0
[ 290.836809] ksys_write+0x59/0xd0
[ 290.836812] do_syscall_64+0x37/0x80
[ 290.836818] entry_SYSCALL_64_after_hwframe+0x46/0xb0
To reproduce this situation, I write a program which seems like this:
unsigned int flags = MAP_PRIVATE | MAP_ANONYMOUS;
unsigned long size = 262144 << 12;
unsigned long node1 = 2; // node 1
unsigned long node2 = 8; // node 3
p1 = vma1 = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
p2 = vma2 = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
assert(!mbind(vma1, size, MPOL_BIND, &node1, MAX_NODES, MPOL_MF_STRICT | MPOL_MF_MOVE));
assert(!mbind(vma2, size, MPOL_BIND, &node2, MAX_NODES, MPOL_MF_STRICT | MPOL_MF_MOVE));
Start the program whos name is mbind_tester, and do follow steps:
mkdir && cd /sys/fs/cgroup/cpuset/mbind
echo 0-31 > cpuset.cpus
echo 0-3 > cpuset.mems
cat /proc/`pidof mbind_tester`/numa_maps |grep bind -w
7ff73e200000 bind:3 anon=262144 dirty=262144 active=0 N3=262144 kernelpagesize_kB=4
7ff77e200000 bind:1 anon=262144 dirty=262144 active=0 N1=262144 kernelpagesize_kB=4
echo 1,3 > cpuset.mems
cat /proc/`pidof mbind_tester`/numa_maps |grep bind -w
7ff73e200000 bind:3 anon=262144 dirty=262144 active=0 N3=262144 kernelpagesize_kB=4
7ff77e200000 bind:3 anon=262144 dirty=262144 active=0 N1=262144 kernelpagesize_kB=4
As you see, after set cpuset.mems to 1,3, the nodes which one of vma
binded to changed from 1 to 3.
This maybe confused, the original nodes binded is 1, after modify
cpuset.mems to 1,3 which include the node 3, it changed to 3...
>
> Anyways, as per name, bitmap_remap() is intended to change bit
> positions, and it doesn't look wrong if it does so.
>
> This is not how the function is supposed to work. For example,
> old: 00111000
> new: 00011100
>
> means:
> old: 00111 000
> || \\\|||
> new: 000 11100
>
> And after this patch it would be:
> old: 001 11000
> || \|||||
> new: 000 11100
>
> Which is not the same, right?
Right.
Actually this is what makes me embarrassed. If we want to fix this
situtation, we can:
- change the bitmap_remap() as this patch did, but this changed the
behavior of this routine which looks does the right thing. One good
news is this function is only called by mpol_rebind_nodemask().
- don't change the bitmap_remap(), to be honest, I didn't figure out
a way. Any suggestions?
>
> If mpol_rebind() wants to keep previous relations, then according to
> the comment:
> * The positions of unset bits in @old are mapped to themselves
> * (the identify map).
>
> , you can just clear @old bits that already have good relations
> you'd like to preserve.
Actually this does not work for me :)
In the example above, if set cpuset.mems to 0,2 firstly, the nodes
binds will change from 1 to 2. And then set cpuset.mems to 1,3, it will
change from 2 to 3 again.
---
Best wishes
Yi Wang