David Hildenbrand <david@xxxxxxxxxx> writes:
On 26.08.21 19:48, Andy Lutomirski wrote:
On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote:
On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
I’ll bite. How about we attack this in the opposite direction: remove the deny write mechanism entirely.
I think that would be ok, except I can see somebody relying on it.
It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.
Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY. Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it.
Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write. So, in a multithreaded program, one thread does:
fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC);
write(fd, some stuff);
<--- problem is here
Another thread does:
In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails. Whoops. See, for example:
I propose we get rid of deny_write_access() completely to solve this.
Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons.
(OFD locks seem like they might have the same problem. Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?)
It's not like this issue is new (^2017) or relevant in practice. So no
need to hurry IMHO. One step at a time: it might make perfect sense to
remove ETXTBSY, but we have to be careful to not break other user
space that actually cares about the current behavior in practice.
It is an old enough issue that I agree there is no need to hurry.
I also ran into this issue not too long ago when I refactored the
usermode_driver code. My challenge was not being in userspace
the delayed fput was not happening in my kernel thread. Which meant
that writing the file, then closing the file, then execing the file
consistently reported -ETXTBSY.
The kernel code wound up doing:
/* Flush delayed fput so exec can open the file read-only */
As I read the code the delay for userspace file descriptors is
always done with task_work_add, so userspace should not hit
that kind of silliness, and should be able to actually close
the file descriptor before the exec.
On the flip side, I don't know how anything can depend upon getting an
-ETXTBSY. So I don't think there is any real risk of breaking userspace
if we remove it.