Re: [RFC PATCH V2 13/22] x86/intel_rdt: Support schemata write - pseudo-locking core

From: Reinette Chatre
Date: Tue Feb 20 2018 - 13:47:58 EST


Hi Thomas,

On 2/20/2018 9:15 AM, Thomas Gleixner wrote:
> On Tue, 13 Feb 2018, Reinette Chatre wrote:
>> static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
>> {
>> bool is_new_plr = (plr == new_plr);
>> @@ -93,6 +175,23 @@ static void __pseudo_lock_region_release(struct pseudo_lock_region *plr)
>> if (!plr->deleted)
>> return;
>>
>> + if (plr->locked) {
>> + plr->d->plr = NULL;
>> + /*
>> + * Resource groups come and go. Simply returning this
>> + * pseudo-locked region's bits to the default CLOS may
>> + * result in default CLOS to become fragmented, causing
>> + * the setting of its bitmask to fail. Ensure it is valid
>> + * first. If this check does fail we cannot return the bits
>> + * to the default CLOS and userspace intervention would be
>> + * required to ensure portions of the cache do not go
>> + * unused.
>> + */
>> + if (cbm_validate_val(plr->d->ctrl_val[0] | plr->cbm, plr->r))
>> + pseudo_lock_clos_set(plr, 0,
>> + plr->d->ctrl_val[0] | plr->cbm);
>> + pseudo_lock_region_clear(plr);
>> + }
>> kfree(plr);
>> if (is_new_plr)
>> new_plr = NULL;
>
> Are you really sure that the life time rules of plr are correct vs. an
> application which still has the locked memory mapped? i.e. the following
> operation:

You are correct. I am not preventing an administrator from removing the
pseudo-locked region if it is in use. I will fix that.

> 1# create_pseudo_lock_region()
>
> 2# start_app()
> fd = open(/dev/.../lock);
> ptr = mmap(fd, .....); <- takes a ref on fd
> close(fd);
> do_stuff(ptr);
>
> 1# rmdir .../lock
>
> unmap(ptr); <- releases fd
>
> I can't see how that is protected. You already have a kref in the PLR, but
> it's in no way connected to the file descriptor lifetime. So the refcount
> logic here must be:
>
> create_lock_region()
> plr = alloc_plr();
> take_ref(plr);
> if (!init_plr(plr)) {
> drop_ref(plr);
> ...
> }
>
> lockdev_open(filp)
> take_ref(plr);
> filp->private = plr;
>
> rmdir_lock_region()
> ...
> drop_ref(plr);
>
> lockdev_relese(filp)
> filp->private = NULL;
> drop_ref(plr);
>
>> /*
>> + * Only one pseudo-locked region can be set up at a time and that is
>> + * enforced by taking the rdt_pseudo_lock_mutex when the user writes the
>> + * requested schemata to the resctrl file and releasing the mutex on
>> + * completion. The thread locking the kernel memory into the cache starts
>> + * and completes during this time so we can be sure that only one thread
>> + * can run at any time.
>> + * The functions starting the pseudo-locking thread needs to wait for its
>> + * completion and since there can only be one we have a global workqueue
>> + * and variable to support this.
>> + */
>> +static DECLARE_WAIT_QUEUE_HEAD(wq);
>> +static int thread_done;
>
> Eew. For one, you really couldn't come up with more generic and less
> relatable variable names, right?
>
> That aside, its just wrong to build code based on current hardware
> limitations. The waitqueue and the result code belong into PLR.

Will do. This also builds on your previous suggestion to not limit the
number of uninitialized pseudo-locked regions.

>
>> +/**
>> + * pseudo_lock_fn - Load kernel memory into cache
>> + *
>> + * This is the core pseudo-locking function.
>> + *
>> + * First we ensure that the kernel memory cannot be found in the cache.
>> + * Then, while taking care that there will be as little interference as
>> + * possible, each cache line of the memory to be loaded is touched while
>> + * core is running with class of service set to the bitmask of the
>> + * pseudo-locked region. After this is complete no future CAT allocations
>> + * will be allowed to overlap with this bitmask.
>> + *
>> + * Local register variables are utilized to ensure that the memory region
>> + * to be locked is the only memory access made during the critical locking
>> + * loop.
>> + */
>> +static int pseudo_lock_fn(void *_plr)
>> +{
>> + struct pseudo_lock_region *plr = _plr;
>> + u32 rmid_p, closid_p;
>> + unsigned long flags;
>> + u64 i;
>> +#ifdef CONFIG_KASAN
>> + /*
>> + * The registers used for local register variables are also used
>> + * when KASAN is active. When KASAN is active we use a regular
>> + * variable to ensure we always use a valid pointer, but the cost
>> + * is that this variable will enter the cache through evicting the
>> + * memory we are trying to lock into the cache. Thus expect lower
>> + * pseudo-locking success rate when KASAN is active.
>> + */
>
> I'm not a real fan of this mess. But well,
>
>> + unsigned int line_size;
>> + unsigned int size;
>> + void *mem_r;
>> +#else
>> + register unsigned int line_size asm("esi");
>> + register unsigned int size asm("edi");
>> +#ifdef CONFIG_X86_64
>> + register void *mem_r asm("rbx");
>> +#else
>> + register void *mem_r asm("ebx");
>> +#endif /* CONFIG_X86_64 */
>> +#endif /* CONFIG_KASAN */
>> +
>> + /*
>> + * Make sure none of the allocated memory is cached. If it is we
>> + * will get a cache hit in below loop from outside of pseudo-locked
>> + * region.
>> + * wbinvd (as opposed to clflush/clflushopt) is required to
>> + * increase likelihood that allocated cache portion will be filled
>> + * with associated memory
>
> Sigh.
>
>> + */
>> + wbinvd();
>> +
>> + preempt_disable();
>> + local_irq_save(flags);
>
> preempt_disable() is pointless when you disable interrupts. And this
> really should be local_irq_disable(). This code is always called with
> interrupts enabled....
>
>> + /*
>> + * Call wrmsr and rdmsr as directly as possible to avoid tracing
>> + * clobbering local register variables or affecting cache accesses.
>> + */
>
> You probably want to make sure that the code below is in L1 cache already
> before the CLOSID is set to the allocation. To do this you want to put the
> preload mechanics into a separate ASM function.
>
> Then you run it with size = 1 on some other temporary memory buffer with
> the default CLOSID, which has the CBM bits of the lock region excluded,
> Then switch to the real CLOSID and run the loop with the real buffer and
> the real size.

Thank you for the suggestion. I will experiment how this affects the
pseudo-locked region success.

>> + __wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
>
> This wants an explanation why the prefetcher needs to be disabled.
>
>> +static int pseudo_lock_doit(struct pseudo_lock_region *plr,
>> + struct rdt_resource *r,
>> + struct rdt_domain *d)
>> +{
>> + struct task_struct *thread;
>> + int closid;
>> + int ret, i;
>> +
>> + /*
>> + * With the usage of wbinvd we can only support one pseudo-locked
>> + * region per domain at this time.
>
> This really sucks.
>
>> + */
>> + if (d->plr) {
>> + rdt_last_cmd_puts("pseudo-locked region exists on cache\n");
>> + return -ENOSPC;
>
> This check is not sufficient for a CPU which has both L2 and L3 allocation
> capability. If there is already a L3 locked region and the current call
> sets up a L2 locked region then this will not catch it and the following
> wbinvd will wipe the L3 locked region ....
>
>> + }
>> +
>> + ret = pseudo_lock_region_init(plr, r, d);
>> + if (ret < 0)
>> + return ret;
>> +
>> + closid = closid_alloc();
>> + if (closid < 0) {
>> + ret = closid;
>> + rdt_last_cmd_puts("unable to obtain free closid\n");
>> + goto out_region;
>> + }
>> +
>> + /*
>> + * Ensure we end with a valid default CLOS. If a pseudo-locked
>> + * region in middle of possible bitmasks is selected it will split
>> + * up default CLOS which would be a fault and for which handling
>> + * is unclear so we fail back to userspace. Validation will also
>> + * ensure that default CLOS is not zero, keeping some cache
>> + * available to rest of system.
>> + */
>> + if (!cbm_validate_val(d->ctrl_val[0] & ~plr->cbm, r)) {
>> + ret = -EINVAL;
>> + rdt_last_cmd_printf("bm 0x%x causes invalid clos 0 bm 0x%x\n",
>> + plr->cbm, d->ctrl_val[0] & ~plr->cbm);
>> + goto out_closid;
>> + }
>> +
>> + ret = pseudo_lock_clos_set(plr, 0, d->ctrl_val[0] & ~plr->cbm);
>
> Fiddling with the default CBM is wrong. The lock operation should only
> succeed when the bits in that domain are not used by _ANY_ control group
> including the default one. This is a reasonable constraint.

This changes one of my original assumptions. I will rework all to adjust
since your later design change suggestions will impact this.

>> + if (ret < 0) {
>> + rdt_last_cmd_printf("unable to set clos 0 bitmask to 0x%x\n",
>> + d->ctrl_val[0] & ~plr->cbm);
>> + goto out_closid;
>> + }
>> +
>> + ret = pseudo_lock_clos_set(plr, closid, plr->cbm);
>> + if (ret < 0) {
>> + rdt_last_cmd_printf("unable to set closid %d bitmask to 0x%x\n",
>> + closid, plr->cbm);
>> + goto out_clos_def;
>> + }
>> +
>> + plr->closid = closid;
>> +
>> + thread_done = 0;
>> +
>> + thread = kthread_create_on_node(pseudo_lock_fn, plr,
>> + cpu_to_node(plr->cpu),
>> + "pseudo_lock/%u", plr->cpu);
>> + if (IS_ERR(thread)) {
>> + ret = PTR_ERR(thread);
>> + rdt_last_cmd_printf("locking thread returned error %d\n", ret);
>> + /*
>> + * We do not return CBM to newly allocated CLOS here on
>> + * error path since that will result in a CBM of all
>> + * zeroes which is an illegal MSR write.
>
> I'm not sure what you are trying to explain here.
>
> If you remove a ctrl group then the CBM bits are not added to anything
> either. It's up to the operator to handle that. Why would this be any
> different for the pseudo-locking stuff?

It is not different, no. On failure the closid is released but the CBM
associated with it remains. Here I attempted to explain why the CBM
remains. This is the same behavior as current CAT. I will remove the
comment since it is just causing confusion.

>> + */
>> + goto out_clos_def;
>> + }
>> +
>> + kthread_bind(thread, plr->cpu);
>> + wake_up_process(thread);
>> +
>> + ret = wait_event_interruptible(wq, thread_done == 1);
>> + if (ret < 0) {
>> + rdt_last_cmd_puts("locking thread interrupted\n");
>> + goto out_clos_def;
>
> This is broken. If the thread does not get on the CPU for whatever reason
> and the process which sets up the region is interrupted then this will
> leave the thread in runnable state and once it gets on the CPU it will
> happily derefence the freed plr struct and fiddle with the freed memory.
>
> You need to make sure that the thread holds a reference on the plr struct,
> which prevents freeing. That includes the CLOSID .....

Thanks for catching this.

>
>> + }
>> +
>> + /*
>> + * closid will be released soon but its CBM as well as CBM of not
>> + * yet allocated CLOS as stored in the array will remain. Ensure
>> + * that CBM will be what is currently the default CLOS, which
>> + * excludes pseudo-locked region.
>> + */
>> + for (i = 1; i < r->num_closid; i++) {
>> + if (i == closid || !closid_allocated(i))
>> + pseudo_lock_clos_set(plr, i, d->ctrl_val[0]);
>> + }
>
> This is all magical duct tape. The overall design of this is sideways and
> not really well integrated into the existing infrastructure which creates
> these kinds of magic warts and lots of duplicated code.
>
> The deeper I read into the patch series the less I like that interface and
> the implementation.
>
> Let's look at the existing crtl/mon groups which are each represented by a
> directory already.
>
> - Adding a 'size' file to the ctrl groups would be a natural extension
> which makes sense for regular cache allocations as well.
>
> - Adding a 'exclusive' flag would be an interesting feature even for the
> normal use case. Marking a group as exclusive prevents other groups to
> request CBM bits which are held by a exclusive allocation.
>
> I'd suggest to have a file 'mode' for controlling this. The valid values
> would be something like 'shareable' and 'exclusive'.
>
> When trying to set a group to exclusive mode then the schemata has to be
> checked for overlaps with the other schematas and in case of conflict
> the write fails. Once enabled subsequent writes to the schemata file
> need to be checked for conflicts as well.
>
> If the exclusive setting is enabled then the CBM bits of that group
> are excluded from being used in other control groups.
>
> Aside of that a file in the info directory which shows the (un)used CBM
> bits of all groups is really helpful for controlling all of that (even w/o
> pseudo locking). You have this in the 'avail' file, but there is no reason
> why this should only be available for pseudo locking enabled systems.
>
> Now for the pseudo locking part.
>
> What you need on top of the above is a new 'mode': 'locked'. That mode
> utilizes the 'exclusive' mode rules vs. conflict checking and the
> protection against allocating the associated CBM bits in other control
> groups.
>
> The setup would be like this:
>
> mkdir group
> echo '$CONFIG' >group/schemata
> echo 'locked' >group/mode
>
> Setting mode to locked locks down the schemata file along with the
> task/cpus/cpus_list files. The task/cpu files need to be empty when
> entering locked mode, otherwise the operation fails. I'd even would not
> bother handing back the CLOSID. For simplicity the CLOSID should just stay
> associated with the control group until it is destroyed as any other
> control group.

Thank you so much for taking the time to do this thorough review and to
make these suggestions. While I am still digesting the details I do
intend to follow all (as well as the ones earlier I did not explicitly
respond to).

Keeping the CLOSID associated with the pseudo-locked region will surely
make the above simpler since CLOSID's are association with resource
groups (represented by the directories). I would like to highlight that
on some platforms there are only a few (for example, 4) CLOSIDs
available. Not releasing a CLOSID would thus reduce available CLOSIDs
that are already limited. These platforms do have smaller possible
bitmasks though (for example, 8 possible bits), which may make light of
this concern. I thus just add it as informational to the consequence of
this simplification.

> Now the remaining thing is the memory allocation and the mmap itself. I
> really dislike the preallocation of memory right at setup time. Ideally
> that should be an allocation of the application itself, but the horrid
> wbinvd stuff kinda prevents that. With that restriction we are more or less
> bound to immediate allocation and population.

Acknowledged. I am not sure if the current permissions would support
such a dynamic setup though. At this time the system administrator is
the one that sets up the pseudo-locked region and can through
permissions of the character device provide access to these regions to
user space applications.

Reinette