Re: [RFD] CAT user space interface revisited

From: Marcelo Tosatti
Date: Mon Jan 04 2016 - 12:44:39 EST


On Mon, Jan 04, 2016 at 03:20:54PM -0200, Marcelo Tosatti wrote:
> On Thu, Dec 31, 2015 at 11:30:57PM +0100, Thomas Gleixner wrote:
> > Marcelo,
> >
> > On Thu, 31 Dec 2015, Marcelo Tosatti wrote:
> >
> > First of all thanks for the explanation.
> >
> > > There is one directory structure in this topic, CAT. That is the
> > > directory structure which is exposed to userspace to control the
> > > CAT HW.
> > >
> > > With the current patchset posted by Intel ("Subject: [PATCH V16 00/11]
> > > x86: Intel Cache Allocation Technology Support"), the directory
> > > structure there (the files and directories exposed by that patchset)
> > > (*1) does not allow one to configure different CBM masks on each socket
> > > (that is, it forces the user to configure the same mask CBM on every
> > > socket). This is a blocker for us, and it is one of the points in your
> > > proposal.
> > >
> > > There was a call between Red Hat and Intel where it was communicated
> > > to Intel, and Intel agreed, that it was necessary to fix this (fix this
> > > == allow different CBM masks on different sockets).
> > >
> > > Now, that is one change to the current directory structure (*1).
> >
> > I don't have an idea how that would look like. The current structure is a
> > cgroups based hierarchy oriented approach, which does not allow simple things
> > like
> >
> > T1 00001111
> > T2 00111100
> >
> > at least not in a way which is natural to the problem at hand.
>
>
>
> cgroupA/
>
> cbm_mask (if set, set for all CPUs)
>
> socket1/cbm_mask
> socket2/cbm_mask
> ...
> socketN/cbm_mask (if set, overrides global
> cbm_mask).
>
> Something along those lines.
>
> Do you see any problem with it?
>
> > > (*1) modified to allow for different CBM masks on different sockets,
> > > lets say (*2), is what we have been waiting for Intel to post.
> > > It would handle our usecase, and all use-cases which the current
> > > patchset from Intel already handles (Vikas posted emails mentioning
> > > there are happy users of the current interface, feel free to ask
> > > him for more details).
> >
> > I cannot imagine how that modification to the current interface would solve
> > that. Not to talk about per CPU associations which are not related to tasks at
> > all.
>
> Not sure what you mean by per CPU associations.
>
> If you fix a cbmmask on a given pCPU, say CPU1, and control which tasks
> run on that pCPU, then you control the cbmmask for all tasks (say
> tasklist-1) on that CPU, fine.
>
> Can achieve the same by putting all tasks from tasklist-1 into a
> cgroup.
>
> > > What i have asked you, and you replied "to go Google read my previous
> > > post" is this:
> > > What are the advantages over you proposal (which is a completely
> > > different directory structure, requiring a complete rewrite),
> > > over (*2) ?
> > >
> > > (what is my reason behind this: the reason is that if you, with
> > > maintainer veto power, forces your proposal to be accepted, it will be
> > > necessary to wait for another rewrite (a new set of problems, fully
> > > think through your proposal, test it, ...) rather than simply modify an
> > > already known, reviewed, already used directory structure.
> > >
> > > And functionally, your proposal adds nothing to (*2) (other than, well,
> > > being a different directory structure).
> >
> > Sorry. I cannot see at all how a modification to the existing interface would
> > cover all the sensible use cases I described in a coherent way. I really want
> > to see a proper description of the interface before people start hacking on it
> > in a frenzy. What you described is: "let's say (*2)" modification. That's
> > pretty meager.
> >
> > > If Fenghua or you post a patchset, say in 2 weeks, with your proposal,
> > > i am fine with that. But i since i doubt that will be the case, i am
> > > pushing for the interface which requires the least amount of changes
> > > (and therefore the least amount of time) to be integrated.
> > >
> > > >From your email:
> > >
> > > "It would even be sufficient for particular use cases to just associate
> > > a piece of cache to a given CPU and do not bother with tasks at all.
> > >
> > > We really need to make this as configurable as possible from userspace
> > > without imposing random restrictions to it. I played around with it on
> > > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > > enabled) makes it really useless if we force the ids to have the same
> > > meaning on all sockets and restrict it to per task partitioning."
> > >
> > > Yes, thats the issue we hit, that is the modification that was agreed
> > > with Intel, and thats what we are waiting for them to post.
> >
> > How do you implement the above - especially that part:
> >
> > "It would even be sufficient for particular use cases to just associate a
> > piece of cache to a given CPU and do not bother with tasks at all."
> >
> > as a "simple" modification to (*1) ?
>
> As noted above.
> >
> > > > I described a directory structure for that qos/cat stuff in my proposal and
> > > > that's complete AFAICT.
> > >
> > > Ok, lets make the job for the submitter easier. You are the maintainer,
> > > so you decide.
> > >
> > > Is it enough for you to have (*2) (which was agreed with Intel), or
> > > would you rather prefer to integrate the directory structure at
> > > "[RFD] CAT user space interface revisited" ?
> >
> > The only thing I care about as a maintainer is, that we merge something which
> > actually reflects the properties of the hardware and gives the admin the
> > required flexibility to utilize it fully. I don't care at all if it's my
> > proposal or something else which allows to do the same.
> >
> > Let me copy the relevant bits from my proposal here once more and let me ask
> > questions to the various points so you can tell me how that modification to
> > (*1) is going to deal with that.
> >
> > >> At top level:
> > >>
> > >> xxxxxxx/cat/max_cosids <- Assume that all CPUs are the same
> > >> xxxxxxx/cat/max_maskbits <- Assume that all CPUs are the same
>
> This can be exposed to userspace via a file.
>
> > >> xxxxxxx/cat/cdp_enable <- Depends on CDP availability
> >
> > Where is that information in (*2) and how is that related to (*1)? If you
> > think it's not required, please explain why.
>
> Intel has come up with a scheme to implement CDP. I'll go read
> that and reply to this email afterwards.

Pasting relevant parts of the patchset submission.
Looks fine to me, two files, one for data cache cbmmask, another
for instruction cache cbm mask.
Those two files would be moved to "socket-N" directories.

(will review the CDP patchset...).

Subject: [PATCH V2 0/5] x86: Intel Code Data Prioritization Support

This patch set supports Intel code data prioritization which is an
extension of cache allocation and allows to allocate code and data cache
seperately. It also includes cgroup interface for the user as seperate
patches. The cgroup interface for cache alloc is also resent.

This patch adds enumeration support for Code Data Prioritization(CDP)
feature found in future Intel Xeon processors. It includes CPUID
enumeration routines for CDP.

CDP is an extension to Cache Allocation and lets threads allocate subset
of L3 cache for code and data separately. The allocation is represented
by the code or data cache capacity bit mask(cbm) MSRs
IA32_L3_QOS_MASK_n. Each Class of service would be associated with one
dcache_cbm and one icache_cbm MSR and hence the number of available
CLOSids is halved with CDP. The association for a CLOSid 'n' is shown
below :

data_cbm_address (n) = base + (n <<1)
code_cbm_address (n) = base + (n <<1) +1.
During scheduling the kernel writes the CLOSid
of the thread to IA32_PQR_ASSOC_MSR.

Adds two files to the intel_rdt cgroup 'dcache_cbm' and 'icache_cbm'
when code data prioritization(cdp) support is present. The files
represent the data capacity bit mask(cbm) and instruction cbm for L3
cache. User can specify the data and code cbm and the threads belonging
to the cgroup would get to fill the l3 cache represented by the cbm with
the data or code.

For ex: Consider a scenario where the max cbm bits is 10 and L3 cache
size is 10MB:
then specifying a dcache_cbm = 0x3 and icache_cbm = 0xc would give 2MB
of exclusive cache for data and code for the tasks to fill in.

This feature is an extension to cache allocation and lets user specify a
capacity for code and data separately. Initially these cbms would have
the same value as the l3_cbm(which represents the common cbm for code
and data). Once the user tries to write to either the dcache_cbm or
icache_cbm, the kernel tries to enable the cdp mode in hardware by
writing to the IA32_PQOS_CFG MSR. The switch is only possible if the
number of Class of service IDs(CLOSids) used is < half of total CLOSids
available at the time of switch. This is because the CLOSIds are halved
once CDP is enabled and each CLOSid now maps to a data IA32_L3_QOS_n MSR
and a code IA32_L3_QOS_n MSR.
Once the CDP is enabled user can use the dcache_cbm and icache_cbm just
like the l3_cbm. The CLOSids are not exposed to the user and maintained
by the kernel internally.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/