Re: [PATCH v3 1/2] perf, x86: add Intel Memory Bandwidth Monitoring (MBM) PMU

From: kanaka . d . juvva
Date: Tue Sep 08 2015 - 15:58:57 EST


Hi All,

I have switched to Linux based email client (SquirrelMail) right now.
If required I'll resend the emails I just sent now. Sorry about the
repetition.
We have some of our team members planning for travel. I was in a rush.

Regards,
-Kanaka

> On Mon, 2015-09-07 at 20:22 +0100, Juvva, Kanaka D wrote:
>> Hi Thomas,
>>
>> I'm sending updated patch(s). I have given details for each of
>> these items below.
>>
>
> Kanaka, this email is HTML formatted and so has been blocked by
> vger.kernel.org where the linux-kernel mailing list is hosted.
>
> Please configure outlook not to send html email, or use a different mail
> agent for working with upstream.
>
>> Regards,
>> -Kanaka
>>
>> > -----Original Message-----
>> > From: Thomas Gleixner [mailto:tglx@xxxxxxxxxxxxx]
>> > Sent: Wednesday, August 19, 2015 1:50 PM
>> > To: Kanaka Juvva
>> > Cc: Juvva, Kanaka D; Williamson, Glenn P; Fleming, Matt; Auld, Will;
>> Andi Kleen;
>> > LKML; Luck, Tony; Peter Zijlstra; Tejun Heo; x86@xxxxxxxxxx; Ingo
>> Molnar; H.
>> > Peter Anvin; Shivappa, Vikas
>> > Subject: Re: [PATCH v3 1/2] perf,x86: add Intel Memory Bandwidth
>> Monitoring
>> > (MBM) PMU
>> >
>> > On Fri, 7 Aug 2015, Kanaka Juvva wrote:
>> > > +#define MBM_CNTR_MAX 0xffffff
>> > > +#define MBM_SOCKET_MAX 8
>> > > +#define MBM_TIME_DELTA_MAX 1000
>> > > +#define MBM_TIME_DELTA_MIN 100
>> >
>> > What are these constants for and how are they determined? Pulled out
>> of thin
>> > air?
>> >
>>
>> /*
>> * MBM Counter is 24bits wide. MBM_CNTR_MAX defines max counter
>> * value
>> */
>> #define MBM_CNTR_MAX 0xffffff
>> /*
>> * Max #sockets supported
>> */
>> #define MBM_SOCKET_MAX 8
>
> This seems like a constant we could get by without. Do we really need to
> know this at compile time?
>
>> /*
>> * Expected time interval between consecutive MSR reads for a given rmid
>> */
>> #define MBM_TIME_DELTA_MAX 1000
>
> "max" and "expected" are not the same thing.
>
>
>> > > #define QOS_L3_OCCUP_EVENT_ID (1 << 0)
>> > > +#define QOS_MBM_TOTAL_EVENT_ID (1 << 1)
>> > > +#define QOS_MBM_LOCAL_EVENT_ID_HW 0x3
>> > > +#define QOS_MBM_LOCAL_EVENT_ID (1 << 2)
>> >
>> > So we have ID values which are built with (1 << X) and then this HW
>> variant in the
>> > middle with 0x3. Of course without any explanation what the heck this
>> stuff is.
>> >
>> > Last review:
>> >
>> > "So this wants a descriptive ID name and a comment."
>> >
>> >
>>
>> /*
>> * MBM Event IDs as defined in SDM section 17.14.6
>> * Event IDs used to program MSRs for reading counters
>> */
>> #define QOS_MBM_TOTAL_EVENT_ID (1 << 1)
>> #define QOS_MBM_LOCAL_EVENT_ID_HW 0x3
>> /*
>> * Perf needs event id to be 1 << x, hence we can't use 0x3 (HW EVENT ID)
>> * for MBM_LOCAL_EVENT we use next 1 << x for MBM_LOCAL_EVENT_ID
>> */
>> #define QOS_MBM_LOCAL_EVENT_ID (1 << 2)
>
> No, perf events do not need to be of the form (1 << X), that was just a
> convention we used in the cqm code before we knew what values the MBM
> events would take - you can change these to be whatever format you want,
> but be sure to make it consistent.
>
> The constants are very much supposed to be programmed into the MSRs,
> take a look at __rmid_read().
>
> I would suggest (as I already did privately) that you change the format
> to be 0x0x for all of these event IDs.
>
>
>> > > @@ static bool intel_cqm_sched_in_event(u32 rmid)
>> > > return false;
>> > > }
>> > >
>> > > +
>> > > +static u32 bw_sum_calc(struct sample *bw_stat, int rmid) {
>> > > + u32 val = 0, i, j, index;
>> > > +
>> > > + if (++bw_stat->fifoout >= mbm_window_size)
>> > > + bw_stat->fifoout = 0;
>> > > + index = bw_stat->fifoout;
>> > > + for (i = 0; i < mbm_window_size - 1; i++) {
>> > > + if (index + i >= mbm_window_size)
>> > > + j = index + i - mbm_window_size;
>> > > + else
>> > > + j = index + i;
>> > > + val += bw_stat->mbmfifo[j];
>> > > + }
>> >
>> > This math wants a explanatory comment.
>> >
>> /*
>> * Slide the window by 1 and calculate the sum of the last
>> * mbm_window_size-1 bandwidth values.
>> * fifoout is the current position of the window.
>> * Increment the fifoout by 1 to slide the window by 1.
>> *
>> * Calcalute the bandwidth from ++fifiout to ( ++fifoout +
>> mbm_window_size -1)
>> * e.g.fifoout =1; Bandwidth1 Bandwidth2 ..... Bandwidthn are the
>> * sliding window values where n is size of the sliding window
>> * bandwidth sum: val = Bandwidth2 + Bandwidth3 + .. Bandwidthn
>> */
>
> Instead of these large comment blocks please comment smaller,
> logically-connected chunks of code, e.g.
>
> /* Slide the window by one */
> if (++bw_stat->fifoout >= mbm_window_size)
> bw_stat->fifoout = 0;
>
> /*
> * Calculate the sum of last mbm_window_size-1 values.
> */
> for (i = 0; i < mbm_window_size - 1; i++) {
> /* Handle wraparound at end of window */
> if (index + i >= mbm_window_size)
> j = index + i - mbm_window_size;
> else
> j = index + i;
>
> val += bw_stat->mbminfo[j];
> }
>
>>
>>
>> > > + return val;
>> > > +}
>> > > +
>> > > +static u32 __mbm_fifo_sum_lastn_out(int rmid, bool is_localbw) {
>> > > + if (is_localbw)
>> > > + return bw_sum_calc(&mbm_local[rmid], rmid);
>> > > + else
>> > > + return bw_sum_calc(&mbm_total[rmid], rmid); }
>> > > +
>> > > +static void __mbm_fifo_in(struct sample *bw_stat, u32 val) {
>> > > + bw_stat->mbmfifo[bw_stat->fifoin] = val;
>> > > + if (++bw_stat->fifoin >= mbm_window_size)
>> >
>> > How does that become greater than mbm_windowsize?
>> >
>>
>> This is fixed by changing >= to ==
>> Added a comment:
>>
>> /*
>> * store current sample's bw value in sliding window at the
>> * index fifoin. Increment fifoin. Check if fifoin has reached
>> * max_window_size. If yes reset it to begining i.e. zero
>> * e.g.
>> * mbm_window_size = 10
>> * mbmfifo is a circular fifo 0 1 2 3 4 5 6 7 8 9 10
>> * ^
>> |
>> * |
>> |
>> * | _ _ _ _ _ _ _ _ _ _|
>> *
>> * So when fifoin becomes 10, then it is reset to zero
>> *
>> */
>
> I'm not sure that this comment adds anything of value that isn't already
> understood by reading the code. I don't think you need a comment for
> this function, it seems pretty straight forward and Thomas' question was
> about the boundary limits of ->fifoin.
>
>> > > + bw_stat->fifoin = 0;
>> > > +}
>> >
>> > > +/*
>> > > + * __rmid_read_mbm checks whether it is LOCAL or GLOBAL MBM event
>> and
>> > > +reads
>> > > + * its MSR counter. Check whether overflow occurred and handles it.
>> > > +Calculates
>> > > + * currenet BW and updates running average.
>> >
>> > currenet? And please get rid of the double spaces
>> >
>> This is fixed now. Here is the updated comment:
>>
>> /*
>> * rmid_read_mbm checks whether it is LOCAL or Total MBM event and reads
>> * its MSR counter. Check whether overflow occured and handles it.
>> Calculates
>> * currenet BW and updates running average.
>> *
>
> ^^^ You've still misspelled current.
>>
>> * Overflow Handling:
>> * if (MSR current value < MSR previous value) it is an
>> * overflow. MSR values are increasing when bandwidth consumption for the
>> thread
>> * is non-zero; When MSR values reaches MAX_COUNTER_VALUE it overflows.
>> After overflow,
>> * MSR current value goes back to zero and starts increasing again at the
>> rate of
>> * bandwidth.
>> *
>
> You don't need to provide a definition of "overflow", most people will
> be familiar with it. What is more important to document is how the
> overflow is handled...
>>
>> * Overflow handling:
>> * Detect an overflow : current read value > last read value
>
> Isn't this inverted? Overflow occurred if current < previous.
>
>>
>> * Overflow correction: if (overflow)
>> * Current value = (MAX_COUNTER_VALUE - prev
>> read value) + current read value
>> * else
>> * Current value = current read value
>> *
>
> Please don't write pseudocode in the comments. Use English prose to
> describe the important parts of the code.
>
>>
>> * Calculation of Current Bandwidth value:
>> * If MSR is read within last 100ms, then then the smaple is ignored;
>> * If the MSR was Read with in last 100ms, why incur an extra overhead
>> * of doing the MSR reads again. Anyway there'll be a negligible change
>> or zero
>> * change in MSR readings in 100ms.
>> *
>> * Bandwidth is calculated as:
>> * memory bandwidth = difference of last two msr counter values/time
>> difference.
>> *
>> * cum_avg = Running Average bandwidth of last 'n' bandwidth values for
>> * the samples that are processed
>> *
>
> Where 'n' is 'mbm_window_size' ? If so, please use 'mbm_window_size',
> not 'n'.
>>
>> * Sliding window is used to save the last 'n' samples. Where,
>> * n = sliding_window_size and results in sliding window duration of 'n'
>> secs.
>
> Hmm... this confuses me a lot. Is 'n' a size or a duration? The two are
> not the same thing.
>
>>
>> * The sliding window size by default set to
>> * MBM_FIFO_SIZE_MIN. User can configure it to the values in the range
>> * (MBM_FIFO_SIZE_MIN,MBM_FIFO_SIZE_MAX). The range for sliding window
>> * is chosen based on a general criteria for monitoring duration. Example
>> * for a short lived application, 10sec monitoring period gives
>> * good characterization of its bandwidth consumption. For an application
>> * that runs for longer duration, 300sec monitoring period gives better
>> * characterization of its bandwidth consumption. Since the running
>> average
>> * calculated for total monitoring period, user gets the most accuracate
>> * average bandwidth for the each monitoring period.
>> *
>> * Scaling:
>> * cum_avg is the raw bandwidth is Bytes/sec.
>> * cum_avg is converted to MB/sec by applying MBM_CONVERSION_FACTOR and
>> * rounded to nearest integer. User interface gets the Bandwidth values
>> in MB/sec.
>> *
>> */
>> > > + *
>> > > + * Overflow Handling:
>> > > + * if (MSR current value < MSR previous value) it is an
>> > > + * overflow. and overflow is handled.
>> >
>> > Wow. That's informative as hell!
>> >
>> Please look at the modified comment above
>> > > + *
>> > > + * Calculation of Current BW value:
>> >
>> > BW == Body Weight?
>> >
>>
>> It is fixed now
>>
>> > > + * If MSR is read within last 100ms, then the value is ignored;
>> > > + * this will suppress small deltas. We don't process MBM samples
>> that
>> > > + are
>> > > + * within 100ms.
>> >
>> > WHY?
>> >
>> Explained in the comment. If mbm_read is called within in 100ms for the
>> same rmid, we donâ??t
>> have to process the sample.
>
> The key piece of information you're missing here is that skipping these
> small deltas is an optimization, because we avoid performing costly
> operations for what would likely be a very minor change in the MBM data,
> right?
>
>
>> > > +{
>> > > + u64 val, tmp, diff_time, cma, bytes, index;
>> > > + bool overflow = false, first = false;
>> > > + ktime_t cur_time;
>> > > + u32 tmp32 = rmid;
>> > > + struct sample *mbm_current;
>> > > + u32 vrmid = topology_physical_package_id(smp_processor_id()) *
>> > > + cqm_max_rmid + rmid;
>> > > +
>> > > + rmid = vrmid;
>> >
>> > From my previous review:
>> >
>> > "This is completely backwards.
>> >
>> > tmp32 = rmid;
>> > rmid = vrmid;
>> > do_stuff(rmid);
>> > rmid = tmp32;
>> > do_other_stuff(rmid);
>> >
>> > Why can't you use vrmid for do_stuff() and leave rmid alone? Just
>> > because it would make the code simpler to read?"
>> >
>> > Still applies.
>> >
>>
>> This is now changed to
>> u64 val, currentmsr, currentbw, diff_time, cma, bytes, index;
>> bool overflow = false, first = false;
>> ktime_t cur_time;
>> u32 tmp32 = rmid, eventid;
>> struct sample *mbm_current;
>> u32 vrmid = rmid_2_index(rmid);
>>
>> rmid = vrmid;
>> cur_time = ktime_get();
>> if (read_mbm_local) {
>> mbm_current = &mbm_local[vrmid];
>> eventid = QOS_MBM_LOCAL_EVENT_ID_HW;
>> wrmsr(MSR_IA32_QM_EVTSEL, QOS_MBM_LOCAL_EVENT_ID_HW,
>> rmid);
>
> You don't need to perform this wrmsr() here because it's taken care of
> in the common code below.
>
>> } else {
>> mbm_current = &mbm_total[vrmid];
>> eventid = QOS_MBM_TOTAL_EVENT_ID;
>> }
>> rmid = tmp32;
>
> Why did you assign rmid to vrmid if you reassign it before it was used?
>
>
>> > > + /* if current msr value < previous msr value , it means
>> overflow */
>> > > + if (val < bytes) {
>> > > + val = MBM_CNTR_MAX - bytes + val;
>> > > + overflow = true;
>> > > + } else
>> > > + val = val - bytes;
>> > > +
>> > > + val = (val * MBM_TIME_DELTA_MAX) / diff_time;
>> > > +
>> > > + if ((diff_time > MBM_TIME_DELTA_MAX) && (!cma))
>> > > + /* First sample */
>> > > + first = true;
>> > > +
>> > > + rmid = vrmid;
>> >
>> > And another time:
>> >
>> > "More obfuscation"
>> >
>>
>> /*
>> * MBM_TIME_DELTA_MAX is picked as per MBM specs. As specified
>> in Intel Platform
>> * Quality of Service Monitoring Implementer's Guide V1, Section
>> 2.7.2. page 21,
>> * overflow can occur maximum once in a second. So latest we
>> want to read the MSR
>> * counters is 1000ms. If it is less than 1000ms we can ignore
>> the sample. Then we
>> * decide since when we should ignore. If the MSR was Read with
>> in last 100ms, why
>> * process the MSR reads again. Anyway there'll be small change
>> or zero change.
>> * So ignoring MSR Reads within 100ms or less is efficient.
>> MBM_TIME_DELTA_MIN
>> * is specified as 100ms as per this guideline.
>> *
>> */
>
> I suspect the document you're referring to above is only available under
> NDA, which makes it unsuitable for mention in the kernel source since a
> large number of people won't have access to it.
>
> Just explain that the way the hardware is designed puts an upper limit
> on how quickly the counter can overflow, which is once per second.
>
>
>> > > +static void __intel_cqm_event_total_bw_count(void *info) {
>> > > + struct rmid_read *rr = info;
>> > > + u64 val;
>> > > +
>> > > + val = __rmid_read_mbm(rr->rmid, false);
>> > > + if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
>> > > + return;
>> > > + atomic64_add(val, &rr->value);
>> > > +}
>> > > +
>> > > +static void __intel_cqm_event_local_bw_count(void *info) {
>> > > + struct rmid_read *rr = info;
>> > > + u64 val;
>> > > +
>> > > + val = __rmid_read_mbm(rr->rmid, true);
>> > > + if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
>> > > + return;
>> > > + atomic64_add(val, &rr->value);
>> > > +}
>> >
>> > And once more:
>> >
>> > "You're really a fan of copy and paste."
>> >
>>
>> These functions are invoked indirectly. They were written keeping
>> intel_cqm_event_count in mind.
>> Iâ??ll change the arg to struct mbm_read{
>> struct rmid_read
>> *rr;
>> u32 eventid;
>> };
>> Intel_cqm_event_*_bw_count(â?¦.) needs eventid to call for decoding
>
> No, please do not duplicate the rmid_read structure, that is not an
> improvement, we don't need two different structs for reading the read
> data.
>
> Please add the event field to the existing struct rmid_read.
>
>
>> > > @@ -1023,6 +1437,17 @@ static void intel_cqm_event_stop(struct
>> > perf_event *event, int mode)
>> > > } else {
>> > > WARN_ON_ONCE(!state->rmid);
>> > > }
>> > > +
>> > > + if (pmu) {
>> > > + if (pmu->n_active > 0) {
>> >
>> > What's the purpose of this check? In the previous version there was a
>> > WARN_ON(), which made sense. Did it trigger and you decided to "work"
>> > around it?
>> >
>>
>> We actually meant to check if there are active events
>
> I don't follow this answer. Are you saying that the WARN_ON() doesn't
> make sense here?
>
>
>
>>
>> > > +EVENT_ATTR_STR(llc_total_bw.unit, intel_cqm_llc_total_bw_unit,
>> > > +"KB/sec"); EVENT_ATTR_STR(llc_local_bw.unit,
>> > > +intel_cqm_llc_local_bw_unit, "KB/sec"); #endif
>> >
>> > > +static ssize_t
>> > > +sliding_window_size_store(struct device *dev,
>> > > + struct device_attribute *attr,
>> > > + const char *buf, size_t count)
>> > > +{
>> > > + unsigned int bytes;
>> > > + int ret;
>> > > +
>> > > + ret = kstrtouint(buf, 0, &bytes);
>> > > + if (ret)
>> > > + return ret;
>> > > +
>> > > + mutex_lock(&cache_mutex);
>> > > + if (bytes > 0 && bytes <= MBM_FIFO_SIZE_MAX)
>> > > + mbm_window_size = bytes;
>> >
>> > So, it's valid to set the window to X where 0 < X < MBM_FIFO_SIZE_MIN.
>> > What's the actual purpose of MBM_FIFO_SIZE_MIN?
>> >
>> This is changed to
>> if (bytes >= MBM_FIFO_SIZE_MIN && bytes <= MBM_FIFO_SIZE_MAX)
>> mbm_window_size = bytes;
>
> Note that if the user passes a value outside of this range you should be
> returning -EINVAL to indicate that.
>
>
>> > > + pmu->timer_interval = ms_to_ktime(MBM_TIME_DELTA_MAX);
>> > > + per_cpu(mbm_pmu, cpu) = pmu;
>> > > + per_cpu(mbm_pmu_to_free, cpu) = NULL;
>> >
>> > What's the point of this? If there is still something to be free'd its
>> leaked.
>> > Otherwise that's redundant.
>> per_cpu(mbm_pmu_to_free, cpu) = NULL; is removed
>>
>> > > + mbm_hrtimer_init(pmu);
>> > > + }
>> > > + return 0;
>> >
>> > s/0/NOTIFY_OK/ because you return that value directly.
>> >
>> You mean I return the â??return codeâ??
>
> ?
>
> You should be using NOTIFY_OK here so that you follow the notifier API
> convention.
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/