Re: [PATCH v3 1/2] perf,x86: add Intel Memory Bandwidth Monitoring (MBM) PMU
From: Matt Fleming
Date: Tue Sep 08 2015 - 07:47:30 EST
On Mon, 2015-09-07 at 20:22 +0100, Juvva, Kanaka D wrote:
> Hi Thomas,
>
> I'm sending updated patch(s). I have given details for each of
> these items below.
>
Kanaka, this email is HTML formatted and so has been blocked by
vger.kernel.org where the linux-kernel mailing list is hosted.
Please configure outlook not to send html email, or use a different mail
agent for working with upstream.
> Regards,
> -Kanaka
>
> > -----Original Message-----
> > From: Thomas Gleixner [mailto:tglx@xxxxxxxxxxxxx]
> > Sent: Wednesday, August 19, 2015 1:50 PM
> > To: Kanaka Juvva
> > Cc: Juvva, Kanaka D; Williamson, Glenn P; Fleming, Matt; Auld, Will;
> Andi Kleen;
> > LKML; Luck, Tony; Peter Zijlstra; Tejun Heo; x86@xxxxxxxxxx; Ingo
> Molnar; H.
> > Peter Anvin; Shivappa, Vikas
> > Subject: Re: [PATCH v3 1/2] perf,x86: add Intel Memory Bandwidth
> Monitoring
> > (MBM) PMU
> >
> > On Fri, 7 Aug 2015, Kanaka Juvva wrote:
> > > +#define MBM_CNTR_MAX 0xffffff
> > > +#define MBM_SOCKET_MAX 8
> > > +#define MBM_TIME_DELTA_MAX 1000
> > > +#define MBM_TIME_DELTA_MIN 100
> >
> > What are these constants for and how are they determined? Pulled out of thin
> > air?
> >
>
> /*
> * MBM Counter is 24bits wide. MBM_CNTR_MAX defines max counter
> * value
> */
> #define MBM_CNTR_MAX 0xffffff
> /*
> * Max #sockets supported
> */
> #define MBM_SOCKET_MAX 8
This seems like a constant we could get by without. Do we really need to
know this at compile time?
> /*
> * Expected time interval between consecutive MSR reads for a given rmid
> */
> #define MBM_TIME_DELTA_MAX 1000
"max" and "expected" are not the same thing.
> > > #define QOS_L3_OCCUP_EVENT_ID (1 << 0)
> > > +#define QOS_MBM_TOTAL_EVENT_ID (1 << 1)
> > > +#define QOS_MBM_LOCAL_EVENT_ID_HW 0x3
> > > +#define QOS_MBM_LOCAL_EVENT_ID (1 << 2)
> >
> > So we have ID values which are built with (1 << X) and then this HW variant in the
> > middle with 0x3. Of course without any explanation what the heck this stuff is.
> >
> > Last review:
> >
> > "So this wants a descriptive ID name and a comment."
> >
> >
>
> /*
> * MBM Event IDs as defined in SDM section 17.14.6
> * Event IDs used to program MSRs for reading counters
> */
> #define QOS_MBM_TOTAL_EVENT_ID (1 << 1)
> #define QOS_MBM_LOCAL_EVENT_ID_HW 0x3
> /*
> * Perf needs event id to be 1 << x, hence we can't use 0x3 (HW EVENT ID)
> * for MBM_LOCAL_EVENT we use next 1 << x for MBM_LOCAL_EVENT_ID
> */
> #define QOS_MBM_LOCAL_EVENT_ID (1 << 2)
No, perf events do not need to be of the form (1 << X), that was just a
convention we used in the cqm code before we knew what values the MBM
events would take - you can change these to be whatever format you want,
but be sure to make it consistent.
The constants are very much supposed to be programmed into the MSRs,
take a look at __rmid_read().
I would suggest (as I already did privately) that you change the format
to be 0x0x for all of these event IDs.
> > > @@ static bool intel_cqm_sched_in_event(u32 rmid)
> > > return false;
> > > }
> > >
> > > +
> > > +static u32 bw_sum_calc(struct sample *bw_stat, int rmid) {
> > > + u32 val = 0, i, j, index;
> > > +
> > > + if (++bw_stat->fifoout >= mbm_window_size)
> > > + bw_stat->fifoout = 0;
> > > + index = bw_stat->fifoout;
> > > + for (i = 0; i < mbm_window_size - 1; i++) {
> > > + if (index + i >= mbm_window_size)
> > > + j = index + i - mbm_window_size;
> > > + else
> > > + j = index + i;
> > > + val += bw_stat->mbmfifo[j];
> > > + }
> >
> > This math wants a explanatory comment.
> >
> /*
> * Slide the window by 1 and calculate the sum of the last
> * mbm_window_size-1 bandwidth values.
> * fifoout is the current position of the window.
> * Increment the fifoout by 1 to slide the window by 1.
> *
> * Calcalute the bandwidth from ++fifiout to ( ++fifoout + mbm_window_size -1)
> * e.g.fifoout =1; Bandwidth1 Bandwidth2 ..... Bandwidthn are the
> * sliding window values where n is size of the sliding window
> * bandwidth sum: val = Bandwidth2 + Bandwidth3 + .. Bandwidthn
> */
Instead of these large comment blocks please comment smaller,
logically-connected chunks of code, e.g.
/* Slide the window by one */
if (++bw_stat->fifoout >= mbm_window_size)
bw_stat->fifoout = 0;
/*
* Calculate the sum of last mbm_window_size-1 values.
*/
for (i = 0; i < mbm_window_size - 1; i++) {
/* Handle wraparound at end of window */
if (index + i >= mbm_window_size)
j = index + i - mbm_window_size;
else
j = index + i;
val += bw_stat->mbminfo[j];
}
>
>
> > > + return val;
> > > +}
> > > +
> > > +static u32 __mbm_fifo_sum_lastn_out(int rmid, bool is_localbw) {
> > > + if (is_localbw)
> > > + return bw_sum_calc(&mbm_local[rmid], rmid);
> > > + else
> > > + return bw_sum_calc(&mbm_total[rmid], rmid); }
> > > +
> > > +static void __mbm_fifo_in(struct sample *bw_stat, u32 val) {
> > > + bw_stat->mbmfifo[bw_stat->fifoin] = val;
> > > + if (++bw_stat->fifoin >= mbm_window_size)
> >
> > How does that become greater than mbm_windowsize?
> >
>
> This is fixed by changing >= to ==
> Added a comment:
>
> /*
> * store current sample's bw value in sliding window at the
> * index fifoin. Increment fifoin. Check if fifoin has reached
> * max_window_size. If yes reset it to begining i.e. zero
> * e.g.
> * mbm_window_size = 10
> * mbmfifo is a circular fifo 0 1 2 3 4 5 6 7 8 9 10
> * ^ |
> * | |
> * | _ _ _ _ _ _ _ _ _ _|
> *
> * So when fifoin becomes 10, then it is reset to zero
> *
> */
I'm not sure that this comment adds anything of value that isn't already
understood by reading the code. I don't think you need a comment for
this function, it seems pretty straight forward and Thomas' question was
about the boundary limits of ->fifoin.
> > > + bw_stat->fifoin = 0;
> > > +}
> >
> > > +/*
> > > + * __rmid_read_mbm checks whether it is LOCAL or GLOBAL MBM event and
> > > +reads
> > > + * its MSR counter. Check whether overflow occurred and handles it.
> > > +Calculates
> > > + * currenet BW and updates running average.
> >
> > currenet? And please get rid of the double spaces
> >
> This is fixed now. Here is the updated comment:
>
> /*
> * rmid_read_mbm checks whether it is LOCAL or Total MBM event and reads
> * its MSR counter. Check whether overflow occured and handles it. Calculates
> * currenet BW and updates running average.
> *
^^^ You've still misspelled current.
>
> * Overflow Handling:
> * if (MSR current value < MSR previous value) it is an
> * overflow. MSR values are increasing when bandwidth consumption for the thread
> * is non-zero; When MSR values reaches MAX_COUNTER_VALUE it overflows. After overflow,
> * MSR current value goes back to zero and starts increasing again at the rate of
> * bandwidth.
> *
You don't need to provide a definition of "overflow", most people will
be familiar with it. What is more important to document is how the
overflow is handled...
>
> * Overflow handling:
> * Detect an overflow : current read value > last read value
Isn't this inverted? Overflow occurred if current < previous.
>
> * Overflow correction: if (overflow)
> * Current value = (MAX_COUNTER_VALUE - prev read value) + current read value
> * else
> * Current value = current read value
> *
Please don't write pseudocode in the comments. Use English prose to
describe the important parts of the code.
>
> * Calculation of Current Bandwidth value:
> * If MSR is read within last 100ms, then then the smaple is ignored;
> * If the MSR was Read with in last 100ms, why incur an extra overhead
> * of doing the MSR reads again. Anyway there'll be a negligible change or zero
> * change in MSR readings in 100ms.
> *
> * Bandwidth is calculated as:
> * memory bandwidth = difference of last two msr counter values/time difference.
> *
> * cum_avg = Running Average bandwidth of last 'n' bandwidth values for
> * the samples that are processed
> *
Where 'n' is 'mbm_window_size' ? If so, please use 'mbm_window_size',
not 'n'.
>
> * Sliding window is used to save the last 'n' samples. Where,
> * n = sliding_window_size and results in sliding window duration of 'n' secs.
Hmm... this confuses me a lot. Is 'n' a size or a duration? The two are
not the same thing.
>
> * The sliding window size by default set to
> * MBM_FIFO_SIZE_MIN. User can configure it to the values in the range
> * (MBM_FIFO_SIZE_MIN,MBM_FIFO_SIZE_MAX). The range for sliding window
> * is chosen based on a general criteria for monitoring duration. Example
> * for a short lived application, 10sec monitoring period gives
> * good characterization of its bandwidth consumption. For an application
> * that runs for longer duration, 300sec monitoring period gives better
> * characterization of its bandwidth consumption. Since the running average
> * calculated for total monitoring period, user gets the most accuracate
> * average bandwidth for the each monitoring period.
> *
> * Scaling:
> * cum_avg is the raw bandwidth is Bytes/sec.
> * cum_avg is converted to MB/sec by applying MBM_CONVERSION_FACTOR and
> * rounded to nearest integer. User interface gets the Bandwidth values in MB/sec.
> *
> */
> > > + *
> > > + * Overflow Handling:
> > > + * if (MSR current value < MSR previous value) it is an
> > > + * overflow. and overflow is handled.
> >
> > Wow. That's informative as hell!
> >
> Please look at the modified comment above
> > > + *
> > > + * Calculation of Current BW value:
> >
> > BW == Body Weight?
> >
>
> It is fixed now
>
> > > + * If MSR is read within last 100ms, then the value is ignored;
> > > + * this will suppress small deltas. We don't process MBM samples that
> > > + are
> > > + * within 100ms.
> >
> > WHY?
> >
> Explained in the comment. If mbm_read is called within in 100ms for the same rmid, we donât
> have to process the sample.
The key piece of information you're missing here is that skipping these
small deltas is an optimization, because we avoid performing costly
operations for what would likely be a very minor change in the MBM data,
right?
> > > +{
> > > + u64 val, tmp, diff_time, cma, bytes, index;
> > > + bool overflow = false, first = false;
> > > + ktime_t cur_time;
> > > + u32 tmp32 = rmid;
> > > + struct sample *mbm_current;
> > > + u32 vrmid = topology_physical_package_id(smp_processor_id()) *
> > > + cqm_max_rmid + rmid;
> > > +
> > > + rmid = vrmid;
> >
> > From my previous review:
> >
> > "This is completely backwards.
> >
> > tmp32 = rmid;
> > rmid = vrmid;
> > do_stuff(rmid);
> > rmid = tmp32;
> > do_other_stuff(rmid);
> >
> > Why can't you use vrmid for do_stuff() and leave rmid alone? Just
> > because it would make the code simpler to read?"
> >
> > Still applies.
> >
>
> This is now changed to
> u64 val, currentmsr, currentbw, diff_time, cma, bytes, index;
> bool overflow = false, first = false;
> ktime_t cur_time;
> u32 tmp32 = rmid, eventid;
> struct sample *mbm_current;
> u32 vrmid = rmid_2_index(rmid);
>
> rmid = vrmid;
> cur_time = ktime_get();
> if (read_mbm_local) {
> mbm_current = &mbm_local[vrmid];
> eventid = QOS_MBM_LOCAL_EVENT_ID_HW;
> wrmsr(MSR_IA32_QM_EVTSEL, QOS_MBM_LOCAL_EVENT_ID_HW, rmid);
You don't need to perform this wrmsr() here because it's taken care of
in the common code below.
> } else {
> mbm_current = &mbm_total[vrmid];
> eventid = QOS_MBM_TOTAL_EVENT_ID;
> }
> rmid = tmp32;
Why did you assign rmid to vrmid if you reassign it before it was used?
> > > + /* if current msr value < previous msr value , it means overflow */
> > > + if (val < bytes) {
> > > + val = MBM_CNTR_MAX - bytes + val;
> > > + overflow = true;
> > > + } else
> > > + val = val - bytes;
> > > +
> > > + val = (val * MBM_TIME_DELTA_MAX) / diff_time;
> > > +
> > > + if ((diff_time > MBM_TIME_DELTA_MAX) && (!cma))
> > > + /* First sample */
> > > + first = true;
> > > +
> > > + rmid = vrmid;
> >
> > And another time:
> >
> > "More obfuscation"
> >
>
> /*
> * MBM_TIME_DELTA_MAX is picked as per MBM specs. As specified in Intel Platform
> * Quality of Service Monitoring Implementer's Guide V1, Section 2.7.2. page 21,
> * overflow can occur maximum once in a second. So latest we want to read the MSR
> * counters is 1000ms. If it is less than 1000ms we can ignore the sample. Then we
> * decide since when we should ignore. If the MSR was Read with in last 100ms, why
> * process the MSR reads again. Anyway there'll be small change or zero change.
> * So ignoring MSR Reads within 100ms or less is efficient. MBM_TIME_DELTA_MIN
> * is specified as 100ms as per this guideline.
> *
> */
I suspect the document you're referring to above is only available under
NDA, which makes it unsuitable for mention in the kernel source since a
large number of people won't have access to it.
Just explain that the way the hardware is designed puts an upper limit
on how quickly the counter can overflow, which is once per second.
> > > +static void __intel_cqm_event_total_bw_count(void *info) {
> > > + struct rmid_read *rr = info;
> > > + u64 val;
> > > +
> > > + val = __rmid_read_mbm(rr->rmid, false);
> > > + if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
> > > + return;
> > > + atomic64_add(val, &rr->value);
> > > +}
> > > +
> > > +static void __intel_cqm_event_local_bw_count(void *info) {
> > > + struct rmid_read *rr = info;
> > > + u64 val;
> > > +
> > > + val = __rmid_read_mbm(rr->rmid, true);
> > > + if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
> > > + return;
> > > + atomic64_add(val, &rr->value);
> > > +}
> >
> > And once more:
> >
> > "You're really a fan of copy and paste."
> >
>
> These functions are invoked indirectly. They were written keeping intel_cqm_event_count in mind.
> Iâll change the arg to struct mbm_read{
> struct rmid_read *rr;
> u32 eventid;
> };
> Intel_cqm_event_*_bw_count(â.) needs eventid to call for decoding
No, please do not duplicate the rmid_read structure, that is not an
improvement, we don't need two different structs for reading the read
data.
Please add the event field to the existing struct rmid_read.
> > > @@ -1023,6 +1437,17 @@ static void intel_cqm_event_stop(struct
> > perf_event *event, int mode)
> > > } else {
> > > WARN_ON_ONCE(!state->rmid);
> > > }
> > > +
> > > + if (pmu) {
> > > + if (pmu->n_active > 0) {
> >
> > What's the purpose of this check? In the previous version there was a
> > WARN_ON(), which made sense. Did it trigger and you decided to "work"
> > around it?
> >
>
> We actually meant to check if there are active events
I don't follow this answer. Are you saying that the WARN_ON() doesn't
make sense here?
>
> > > +EVENT_ATTR_STR(llc_total_bw.unit, intel_cqm_llc_total_bw_unit,
> > > +"KB/sec"); EVENT_ATTR_STR(llc_local_bw.unit,
> > > +intel_cqm_llc_local_bw_unit, "KB/sec"); #endif
> >
> > > +static ssize_t
> > > +sliding_window_size_store(struct device *dev,
> > > + struct device_attribute *attr,
> > > + const char *buf, size_t count)
> > > +{
> > > + unsigned int bytes;
> > > + int ret;
> > > +
> > > + ret = kstrtouint(buf, 0, &bytes);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + mutex_lock(&cache_mutex);
> > > + if (bytes > 0 && bytes <= MBM_FIFO_SIZE_MAX)
> > > + mbm_window_size = bytes;
> >
> > So, it's valid to set the window to X where 0 < X < MBM_FIFO_SIZE_MIN.
> > What's the actual purpose of MBM_FIFO_SIZE_MIN?
> >
> This is changed to
> if (bytes >= MBM_FIFO_SIZE_MIN && bytes <= MBM_FIFO_SIZE_MAX)
> mbm_window_size = bytes;
Note that if the user passes a value outside of this range you should be
returning -EINVAL to indicate that.
> > > + pmu->timer_interval = ms_to_ktime(MBM_TIME_DELTA_MAX);
> > > + per_cpu(mbm_pmu, cpu) = pmu;
> > > + per_cpu(mbm_pmu_to_free, cpu) = NULL;
> >
> > What's the point of this? If there is still something to be free'd its leaked.
> > Otherwise that's redundant.
> per_cpu(mbm_pmu_to_free, cpu) = NULL; is removed
>
> > > + mbm_hrtimer_init(pmu);
> > > + }
> > > + return 0;
> >
> > s/0/NOTIFY_OK/ because you return that value directly.
> >
> You mean I return the âreturn codeâ
?
You should be using NOTIFY_OK here so that you follow the notifier API
convention.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/