Re: [PATCH v2 0/3] File Sealing & memfd_create()

From: Hugh Dickins
Date: Mon May 19 2014 - 18:12:28 EST


On Mon, 19 May 2014, Jan Kara wrote:
> On Mon 19-05-14 13:44:25, David Herrmann wrote:
> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> > > The aspect which really worries me is this: the maintenance burden.
> > > This approach would add some peculiar new code, introducing a rare
> > > special case: which we might get right today, but will very easily
> > > forget tomorrow when making some other changes to mm. If we compile
> > > a list of danger areas in mm, this would surely belong on that list.
> >
> > I tried doing the page-replacement in the last 4 days, but honestly,
> > it's far more complex than I thought. So if no-one more experienced

To be honest, I'm quite glad to hear that: it is still a solution worth
considering, but I'd rather continue the search for a better solution.

> > with mm/ comes up with a simple implementation, I'll have to delay
> > this for some more weeks.
> >
> > However, I still wonder why we try to fix this as part of this
> > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> > amount of time. Same is true for network block-devices, NFS, iscsi,
> > maybe loop-devices, ... This means, _any_ once mapped page can be
> > written to after an arbitrary delay. This can break any feature that
> > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> > sealing, ..).

We need to fix it together with your sealing patchset, because your
patchset is all about introducing a new kind of guarantee: a guarantee
which this async i/o issue makes impossible to give, as things stand.

Exasperating for you, I understand; but that's how it is.
A new feature may make new demands on the infrastructure.

I can imagine existing problems, but (I may be out of touch) I have
not heard of them as problems in practice. Certainly they would not
be recent regressions: mm-page versus fs-file has worked in this way
for as long as I've known them (pages released independently of
unmapping the file, with the understanding that i/o might still
be in progress, so care taken not to free the pages too soon).

> >
> > Shouldn't we try to fix the _cause_ of this?

Nobody is against fixing the cause: we are all looking for the
simplest way of doing so,

> >
> > Isn't there a simple way to lock/mark/.. affected vmas in
> > get_user_pages(_fast)() and release them once done? We could increase
> > i_mmap_writable on all affected address_space and decrease it on
> > release. This would at least prevent sealing and could be check on
> > other operations, too (like setting S_IMMUTABLE).
> > This should be as easy as checking page_mapping(page) != NULL and then
> > adjusting ->i_mmap_writable in
> > get_writable_user_pages/put_writable_user_pages, right?
> Doing this would be quite a bit of work. Currently references returned by
> get_user_pages() are page references like any other and thus are released
> by put_page() or similar. Now you would make them special and they need
> special releasing and there are lots of places in kernel where
> get_user_pages() is used that would need changing.

Lots of places that would need changing, yes; but we have often
wondered in the past whether there should be a put_user_pages().
Though I'm not sure that it would actually solve anything...

>
> Another aspect is that it could have performance implications - if there
> are several processes using get_user_pages[_fast]() on a file, they would
> start contending on modifying i_mmap_writeable.

Doing extra vma work in get_user_pages() wouldn't be so bad. But doing
any vma work in get_user_pages_fast() would upset almost all its users:
get_user_pages_fast() is a fast-path which expressly avoids the vmas,
and hates additional cachelines being added to its machinations.

If sealing had appeared before get_user_pages_fast(), maybe we wouldn't
have let get_user_pages_fast() in; but now it's the other way around.

I would be more interested in attacking from the get_user_pages() and
get_user_pages_fast() end, if I could convince myself that they do
actually delimit the problem; maybe they do, but I'm not yet convinced.

>
> One somewhat crazy idea I have is that maybe we could delay unmapping of a
> page if this was last VMA referencing it until all extra page references of
> pages in there are dropped. That would make i_mmap_writeable reliable for
> you and it would also close those races with remount. Hugh, do you think
> this might be viable?

It is definitely worth pursuing further, but I'm not very hopeful on it.
In a world of free page flags and free struct page fields, maybe. (And
I don't see sealing as a feature sensibly restricted to 64-bit only.)

I think we would have to set a page flag, maybe bump a count, for every
leftover page that raises i_mmap_writable; and lower it (potentially from
interrupt context) at put_page() time. Easy to make i_mmap_writable an
atomic rather than guarded by i_mmap_mutex, but we still need to
synchronize on it falling to 0.

And how would we recognize the relevant, decrementing, put_page()?
page_count divided into "read_"count and write_count? Ugh!

I also have a strong instinct against adding delays into munmap+exit;
though that mainly comes from the urge to free memory, and here we are
only delaying until a page becomes freeable, so maybe I should abandon
that bias in this case.

I did start thinking in this direction last week, but stuck somewhere
and retreated, I forget on what issue. At this moment I'm not really
in that zone, but anxious to complete my promised responses to David's
patches, which I almost but not quite completed last night.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/