Re: [RFC 08/10] platform/x86/intel/ifs: Add IFS sysfs interface
From: Dan Williams
Date: Mon Mar 07 2022 - 14:15:33 EST
On Mon, Mar 7, 2022 at 11:09 AM Joseph, Jithu <jithu.joseph@xxxxxxxxx> wrote:
>
>
>
> On 3/7/2022 9:38 AM, Dan Williams wrote:
> > On Fri, Mar 4, 2022 at 12:42 PM Joseph, Jithu <jithu.joseph@xxxxxxxxx> wrote:
> >>
> >>
> >>
>
> >>>> + */
> >>>> +static ssize_t details_show(struct device *dev,
> >>>> + struct device_attribute *attr,
> >>>> + char *buf)
> >>>> +{
> >>>> + unsigned int cpu = dev->id;
> >>>> + int ret;
> >>>> +
> >>>> + if (down_trylock(&ifs_sem))
> >>>> + return -EBUSY;
> >>>
> >>> What is the ifs_sem protecting? This result is immediately invalid
> >>> after the lock is dropped anyway, so why hold it over reading the
> >>> value? You can't prevent 2 threads racing each other here.
> >>
> >> percpu thread running scan_test_worker() will update per_cpu(ifs_state, cpu).scan_details. (before signalling this thread to run, this lock would be acquired)
> >> This is to protect against the scenario where if the percpu thread is running a test and if at the same time a user is querying its status, they would see busy.
> >
> > That begs the question why would userspace be polling this file? Is it
> > because it does not know when a test completes otherwise? How does it
> > know that the result it is seeing is from the test it ran and not some
> > other invocation to start a new test?
>
> I think I should have explained the need for locking at a higher level .
> It is to make sure that only one of the below action happens at any time
>
> 1. single percpu test
> 2. all-cpu test (tests all cores sequentially)
> 3. scan binary reload
> 4. querying the status
>
> (There are h/w reasons for why we restrict to a single IFS test at any time)
> If 4 happens while 1 or 2 is in progress , we return busy. My earlier explanation was trying to explain why we are preventing 4 when 1 or 2 is in progress.
>
> As to the question of why would a user do 4 while 1 or 2 is in progress, there is no reason for him to do that, both the run_test (percpu and global) are blocking.
> But if he issues a test from one shell and uses another shell to query the status (while it is still in progress) he will see busy.
...and what about the race that the semaphore cannot solve of test run
launch racing collection of previous run results?
>
>
> >>>
> >>> Just have a CPU mask as an input parameter and avoid needing to hang
> >>> ifs sysfs attributes underneath /sys/device/system/cpu/ifs.
> >>
> >> The percpu sysfs has the additional function of providing percpu status and details.
> >
> > That still does not answer the question about the input parameter for
> > selecting cores.
>
> see below
>
> >
> >> The global interface is unable to provide the status and details for all the cores in the system. It does give a summary, which
> >> guides the user to the appropriate percpu status/details
> >
> > It does not sound like the global sysfs interface is all that useful,
> > especially as it just expands the window for the global results to
> > become out of sync with the per-cpu results. With a uevent the tool
> > can get called to handle results on per-cpu / per-test,chunk basis
> > atomically. I.e. a udev rule like:
>
> The global/percpu design was chosen after some thought on how a user might want to test his system. And the sysfs files are grouped accordingly.
> To start with he might want to see if everything is fine with his system (all cores). The global interface is for this. He can do that with a single line command
>
> echo 1 > /sys/devices/system/cpu/ifs/run_test
>
> if "/sys/devices/system/cpu/ifs/status" says anything other than pass, he can identify the failed/untested cores using "/sys/devices/system/cpu/ifs/cpu_fail_list" and "cpu_untested_list".
>
> Now to understand why a particular core failed or retest that particular core, the percpu interface is present.
> (So global and percpu views are a simple and convenient way to expose the available functionality and we went with that … I agree that an alternative was to provide an input parameter for selecting cores)
>
> The whole testing can be done with simple interaction with sysfs file … without the need for any elaborate user space tool.
I have not asked for an "elaborate" user tool, sysfs events and
uevents allow for basic shell script tooling. Now, a more polished
tool is certainly welcome, but a sample script is better than nothing.
> Perhaps we can add recommended testing flow in the Documentation
Documentation is good for describing architecture internals and
motivation, a sample user tool / script is good for validating ABI fit
for purpose.