On Tue, Jan 14, 2025 at 11:03:48AM -0800, Yang Shi wrote:
Yeah, I don't think we can accept this unfortunately.
On 1/14/25 10:14 AM, Lorenzo Stoakes wrote:
This is getting into realms of discussion so to risk sounding rude - to beI admit this is a concern, but I don't think this is really that bad to kill
clear - NACK.
The user-visible change to /proc/$pid/[s]maps kills this patch dead. This
is regardless of any other discussed issue.
this patch. May this change result in userspace regression? Maybe, likely
happens to some debugging and monitoring scripts, typically we don't worry
them that much. Of course, I can't completely guarantee no regression for
real life applications, it should just be unlikely IMHO.
This patch is SUPER important though even if rejected, because you've made
me realise we really need to audit all of these mmap handlers... so it's
all super appreciated regardless :)
OK that's more interesting. Though the user-facing thing remains...But more importantly, I hadn't realise mmap_zero() was on the .mmap()TBH, I don't think we need to make fault handler more complicated, it is
callback (sorry my mistake) - you're simply not permitted to change
vm_pgoff and vm_file fields here, the mapping logic doesn't expect it, and
it's broken.
To me the alternative would be to have a custom fault handler that hands
back the zero page, but I"m not sure that's workable, you'd have to install
a special mapping etc. and huge pages are weird and...
just handled as anonymous fault handler.
I understand your concern about changing those vma filed outside core mm. An
alternative is to move such change to vma.c. For example:
diff --git a/mm/vma.c b/mm/vma.c
index bb2119e5a0d0..2a7ea9901f57 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2358,6 +2358,12 @@ static int __mmap_new_vma(struct mmap_state *map,
struct vm_area_struct **vmap)
else
vma_set_anonymous(vma);
+ if (vma_is_anonymous(vma) && vma->vm_file) {
+ fput(vma->vm_file);
+ vma->vm_file = NULL;
+ vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
+ }
+
It's possiible we could detect that the underlying thing is a zero page and
manually print out /dev/zero, but can somebody create a zero page file
elsewhere? In which case they might find this confusing.
It's actually a nice idea to have this _explicitly_ covered off as we could
then also add a comment explaining 'hey there's this weird type of VMA' and
have it in a place where it's actually obvious to mm folk anyway.
But this maps thing is just a killer. Somebody somewhere will be
confused. And it is not for us to judge whether that's silly or not...
if (error)It's about user confusion for me really.
goto free_iter_vma;
I do appreciate you raising this especially as I was blissfully unaware,It does reference a file, but the file is /dev/zero... And if kernel already
but I don't see how this patch can possibly work, sorry :(
On Tue, Jan 14, 2025 at 08:53:01AM -0800, Yang Shi wrote:
You're creating a new concept of an anon but not anon but also now with
On 1/14/25 4:05 AM, Lorenzo Stoakes wrote:
+ Willy for the fs/weirdness elements of this.But the file is /dev/zero. I don't see this could break the semantic of
On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote:
When creating private mapping for /dev/zero, the driver makes it anHm yikes.
anonymous mapping by calling set_vma_anonymous(). But it just sets
vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset.
This is a special case and the VMA doesn't look like either anonymous VMAMy concern is that ostensibly there _is_ a file right? Are we certain that by
or file VMA. It confused other kernel subsystem, for example, khugepaged [1].
It seems pointless to keep such special case. Making private /dev/zero
mapping a full anonymous mapping doesn't change the semantic of
/dev/zero either.
not setting this we are not breaking something somewhere else?
Are we not creating a sort of other type of 'non-such-beast' here?
/dev/zero. The shared mapping of /dev/zero is not affected by this change,
kernel already treated private mapping of /dev/zero as anonymous mapping,
but with some weird settings in VMA. When reading the mapping, it returns 0
with zero page, when writing the mapping, a new anonymous folio is
allocated.
anon vm_pgoff and missing vm_file even though it does reference a file
and... yeah.
This is not usual :)
treated it as anonymous mapping, it sounds like the file may not matter that
much, so why not make it as a real anonymous mapping? Then we end up having
anonymous VMA and file VMA only instead of anonymous VMA, file VMA and
hybrid special VMA. So we have less thing to worry about. If VMA is
anonymous VMA, it is guaranteed vm_file is NULL, vm_ops is NULL and vm_pgoff
is linear pgoff. But it is not true now.
Ack.As I mentioned above, even handing back zero page should be not needed.I'm obviously not suggesting allocating a bunch of extra folios, I wasI mean already setting it anon and setting vm_file non-NULL is really strange.Yes, it is possible if someone really care whether the anonymous-like
The user visible effect is the mapping entry shown in /proc/<PID>/smapsYeah this seems like it might break somebody to be honest, it's really
and /proc/<PID>/maps.
Before the change:
ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero
After the change:
ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0
really really strange to map a file then for it not to be mapped.
mapping is mapped by /dev/zero or just created by malloc(). But I don't know
who really do...
But it's possibly EVEN WEIRDER to map a file and for it to seem mapped as aTBH, I don't see why page cache has to be involved. When reading, 0 is
file but for it to be marked anonymous.
God what a mess.
[1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@xxxxxxxxxx/I kind of hate that we have to mitigate like this for a case that should
never ever happen so I'm inclined towards your solution but a lot more
inclined towards us totally rethinking this.
Do we _have_ to make this anonymous?? Why can't we just reference the zero
page as if it were in the page cache (Willy - feel free to correct naive
misapprehension here).
returned by zero page. When writing a CoW is triggered if page cache is
involved, but the content of the page cache should be just 0, so we copy 0
to the new folio then write to it. It doesn't make too much sense. I think
this is why private /dev/zero mapping is treated as anonymous mapping in the
first place.
thinking there would be some means of handing back the actual zero
page. But I am not sure this is workable.
Yes and I've opened a can of worms and the worms have jumped out and on toSure, hardening the VMA initialization code and making less surprisingThis is just not permitted. We maintain mmap state which contains the fileSigned-off-by: Yang Shi <yang@xxxxxxxxxxxxxxxxxxxxxx>
---
drivers/char/mem.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 169eed162a7f..dae113f7fc1b 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma)
if (vma->vm_flags & VM_SHARED)
return shmem_zero_setup(vma);
vma_set_anonymous(vma);
+ fput(vma->vm_file);
+ vma->vm_file = NULL;
+ vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
and pgoff state which gets threaded through the mapping operation, and
simply do not expect you to change these fields.
In future we will assert on this or preferably, restrict users to only
changing VMA flags, the private field and vm_ops.
corner case is definitely helpful.
my face and were not worms but in fact an alien facehugger :P
In other words, I am going to be looking into this very seriously and
auditing this whole thing... yay for making work for myself... :>)
Ah OK, in that case fine on that front.Hmm, this might have been mremap()'d _potentially_ though? And then nowI'm not quite familiar with the subtle details and corner cases of
this will be wrong? But then we'd have no way of tracking it correctly...
meremap(). But mmap_zero() should be called by mmap(), so the VMA has not
been visible to user yet at this point IIUC. How come mremap() could move
it?
But you are not permitted to touch this field (we need to enforce this...)
I've not checked the function but do we mark this as a special mapping of
some kind?
+
return 0;
}
--
2.47.0