Re: [GIT PULL] x86/mm for 6.2

From: Linus Torvalds
Date: Thu Dec 15 2022 - 12:17:38 EST


On Thu, Dec 15, 2022 at 4:30 AM <kirill.shutemov@xxxxxxxxxxxxxxx> wrote:
>
> My initial enabling allowed userspace to enable the feature per-thread.
> I pointed out the same potential use -- JIT -- case as you did[1].
> But it made harder to handle corner cases. Like, it is not obvious what
> LAM setup kernel thread has to use when acts on behalf of the process
> (io_uring for instance). I ended up with using the most permissive LAM
> mode (that allowed the most tag bits), but it is clearly a hack.

... and doing it per-mm causes more corner cases.

Look at the - actively buggy - uses of the mask outside the mm lock.

And no, "take the mm lock for reading every time you want to do
untagged_addr()" isn't the answer either.

I didn't look at every single use of 'untagged_addr()', but I did look
at a couple, and every single one of the ones I looked at was racy and
wrong.

Does the race _matter_? Almost certainly not. Nobody will ever care.
It's something we would have happily ignored for years, and then
sighed, and added a READ_ONCE() for to make KCSAN happy.

But when I get a new pull request, and the first thing I look at is
fundamentally racy, just take a wild guess at how happy it makes me
about that pull request?

> Thomas pointed[2] that out and following versions used per-process
> approach. It simplified the patchset substantially.

No, it sure as hell didn't.

It "simplified" things by just introducing another bug instead.

> Speaking about, untagged_addr(), how do you see the interface should look
> like? It has to work correctly when called from a kernel thread. I can
> only think of a hack that makes it branch on task->flags & PF_KTHREAD.

So I think that the current branch contents I see are simply not an
option. I do not want to pull code that when I look at it, I see a
data race on context.untag_mask .

I also find the "pass in mm" to be fundamentally objectionable,
because even if that then ends up being the right approach for x86,
this interface really isn't x86-centric.

Yes, arm64 ends up making it all unconditional, which simplifies
things enormously and just means that arm64 doesn't care at all. But
what if somebody else comes along, and really does want to make it
per-thread?

So I'd be ok with an interface that passes in 'struct task_struct',
and then x86 could use "tsk->mm" to get the mm if that really is the
interface x86 people want. Because then at least the *interface* isn't
hardcoded to be broken.

So those are my two fundamental objections against the current LAM
code. An outright bug in x86 code, and a bad interface in non-x86
code.

Here's *one* suggested solution:

- pass in the 'task' for the target for untagged_addr(), so that
we're not locked into one model

- if x86 really wants to make thing per-mm, let it, but have a "lock"
operation that says "now my setting is set in stone". You already do
that in prctl_enable_tagged_addr() with that

/* Already enabled? */
if (mm->context.lam_cr3_mask) {
ret = -EBUSY;
goto out;
}

just expose that as a thing.

- make thread creation lock it (and unlock it on creating a new mm at
execve, but not fork).

And yes, I would actually suggest that _any_ thread creation locks it,
so that you never *EVER* have any issues with "oh, now I need to
synchronize with other threads". A process can set its LAM state at
startup, not in the middle of running!

Note that io_uring will automatically do that locking, because it does
that thread creation (create_io_thread()). So io_uring threads won't
even be a special case.

You could go even further, and just make it a ELF binary flag, and
basically have that LAM thing be a "when creating the mm, this is what
it gets created as". That is probably what most people would actually
want, but it sounds painful from a bootstrapping standpoint.

Now, the above is just a suggestion off the top of my head. It
simplifies things a lot, in that now even if the state is "per mm",
you can actually *cache* it per-thread, and there are no data races.

It also means that prctl_enable_tagged_addr() doesn't have to worry
about serialization and that whole

on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true);

thing.

> Normal untagged_addr() usage doesn't require serialization: IPI in
> prctl_enable_tagged_addr() makes mm->context.untag_mask visible to all
> CPUs that runs the mm before it returns.

BS.

That operation is not serialized IN ANY WAY with the other threads
reading the value. For all you know, untagged_addr() may be reading
that value multiple times due to compiler reloads, and getting
different values between two reads, and doing random things because
you use one value for one thing, and another for the later
"find_vma()" when the untagged address is actually used.

The IPI does *nothing*. It serializes the cores, but since there is
no actual locking (or even a READ_ONCE() or anything), it doesn't
actually do anything on a software level.

The code is buggy. Really. It's likely a "this works in practice
because reloads are unlikely and you wouldn't hit them anyway" kind of
bug, but no, that IPI does not help.

(Of course, the IPI helps in that now at least the cr3 value is
synchronized, so the IPI is needed for the *hardware* accesses, but
doesn't help software use of mm->context.untag_mask).

Linus