Re: [RFC PATCH] keys: flush work when accessing /proc/key-users

From: Eric Biggers
Date: Wed Dec 06 2023 - 21:43:29 EST


On Wed, Dec 06, 2023 at 05:55:52PM +0000, Luis Henriques wrote:
> David Howells <dhowells@xxxxxxxxxx> writes:
>
> > Luis Henriques <lhenriques@xxxxxxx> wrote:
> >
> >> This patch is mostly for getting some feedback on how to fix an fstest
> >> failing for ext4/fscrypt (generic/581). Basically, the test relies on the
> >> data read from /proc/key-users to be up-to-date regarding the number of
> >> keys a given user currently has. However, this file can't be trusted
> >> because it races against the keys GC.
> >
> > Unfortunately, I don't think your patch helps. If the GC hasn't started yet,
> > it won't achieve anything and the GC can still be triggered at any time after
> > the flush and thus race.
> >
> > What is it you're actually trying to determine?
> >
> > And is it only for doing the test?
>
> OK, let me try to describe what the generic/581 fstest does.
>
> After doing a few fscrypt related things, which involve adding and
> removing keys, the test will:
>
> 1. Get the number of keys for user 'fsgqa' from '/proc/key-users'
> 2. Set the maxkeys to 5 + <keys the user had in 1.>
> 3. In a loop, try to add 6 new keys, to confirm the last one will fail
>
> Most of the time the test passes, i.e., the 6th key fails to be added.
> However, if, for example, the test is executed in a loop, it is possible
> to have it fail because the 6th key was successfully added. The reason
> is, obviously, because the test is racy: the GC can kick-in too late,
> after the maxkeys is set in step 2.
>
> So, this is mostly an issue with the test itself, but I couldn't figure
> out a way to work around it.
>
> Another solution I thought but I didn't look too deep into was to try to
> move the
>
> atomic_dec(&key->user->nkeys);
>
> out of the GC, in function key_gc_unused_keys(). Decrementing it
> synchronously in key_put() (or whatever other functions could schedule GC)
> should solve the problem with this test. But as I said I didn't went too
> far looking into that, so I don't really know if that's feasible.
>
> Finally, the test itself could be hacked so that the loop in step 3. would
> update the maxkeys value if needed, i.e. if the current number of keys for
> the user isn't what was expected in each loop iteration. But even that
> would still be racy.

If there was a function that fully and synchronously releases a key's quota,
fs/crypto/ could call it before unlinking the key. key_payload_reserve(key, 0)
almost does the trick, but it would release the key's bytes, not the key itself.

However, that would only fix the flakiness of the key quota for fs/crypto/, not
for other users of the keyrings service. Maybe this suggests that key_put()
should release the key's quota right away if the key's refcount drops to 0?

Either way, note that where fs/crypto/ does key_put() on a whole keyring at
once, it would first need to call keyring_clear() to clear it synchronously.

A third solution would be to make fs/crypto/ completely stop using 'struct key',
and handle quotas itself. It would do it correctly, i.e. synchronously so that
the results are predictable. This would likely mean separate accounting, where
adding an fscrypt key counts against an fscrypt key quota, not the regular
keyrings service quota as it does now. That should be fine, though the same
limits might still need to be used, in case users are relying on the sysctls...

The last solution seems quite attractive at this point, given the number of
times that issues in the keyrings service have caused problems for fs/crypto/.
Any thoughts are appreciated, though.

- Eric