Chris Wright wrote:How could such badness ever happen in the kernel?* Anthony Liguori (anthony@xxxxxxxxxxxxx) wrote:
The ioctl() interface is quite bad for what you're doing. You're telling the kernel extra information about a VA range in userspace. That's what madvise is for. You're tweaking simple read/write values of kernel infrastructure. That's what sysfs is for.
I agree re: sysfs (brought it up myself before). As far as madvise vs.
ioctl, the one thing that comes from the ioctl is fops->release to
automagically unregister memory on exit.
This is precisely why ioctl() is a bad interface. fops->release isn't tied to the process but rather tied to the open file. The file can stay open long after the process exits either by a fork()'d child inheriting the file descriptor or through something more sinister like SCM_RIGHTS.
In fact, a common mistake is to leak file descriptors by not closing them when exec()'ing a process. Instead of just delaying a close, if you rely on this behavior to unregister memory regions, you could potentially have badness happen in the kernel if ksm attempted to access an invalid memory region.