Re: [PATCH 3/3] resource: Introduce resource cache

From: Nadav Amit
Date: Wed Jun 19 2019 - 16:40:31 EST


> On Jun 19, 2019, at 6:00 AM, Bjorn Helgaas <bhelgaas@xxxxxxxxxx> wrote:
>
> On Tue, Jun 18, 2019 at 12:40 AM Nadav Amit <namit@xxxxxxxxxx> wrote:
>>> On Jun 17, 2019, at 10:33 PM, Nadav Amit <namit@xxxxxxxxxx> wrote:
>>>
>>>> On Jun 17, 2019, at 9:57 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <namit@xxxxxxxxxx> wrote:
>>>>
>>>>> For efficient search of resources, as needed to determine the memory
>>>>> type for dax page-faults, introduce a cache of the most recently used
>>>>> top-level resource. Caching the top-level should be safe as ranges in
>>>>> that level do not overlap (unlike those of lower levels).
>>>>>
>>>>> Keep the cache per-cpu to avoid possible contention. Whenever a resource
>>>>> is added, removed or changed, invalidate all the resources. The
>>>>> invalidation takes place when the resource_lock is taken for write,
>>>>> preventing possible races.
>>>>>
>>>>> This patch provides relatively small performance improvements over the
>>>>> previous patch (~0.5% on sysbench), but can benefit systems with many
>>>>> resources.
>>>>
>>>>> --- a/kernel/resource.c
>>>>> +++ b/kernel/resource.c
>>>>> @@ -53,6 +53,12 @@ struct resource_constraint {
>>>>>
>>>>> static DEFINE_RWLOCK(resource_lock);
>>>>>
>>>>> +/*
>>>>> + * Cache of the top-level resource that was most recently use by
>>>>> + * find_next_iomem_res().
>>>>> + */
>>>>> +static DEFINE_PER_CPU(struct resource *, resource_cache);
>>>>
>>>> A per-cpu cache which is accessed under a kernel-wide read_lock looks a
>>>> bit odd - the latency getting at that rwlock will swamp the benefit of
>>>> isolating the CPUs from each other when accessing resource_cache.
>>>>
>>>> On the other hand, if we have multiple CPUs running
>>>> find_next_iomem_res() concurrently then yes, I see the benefit. Has
>>>> the benefit of using a per-cpu cache (rather than a kernel-wide one)
>>>> been quantified?
>>>
>>> No. I am not sure how easy it would be to measure it. On the other hander
>>> the lock is not supposed to be contended (at most cases). At the time I saw
>>> numbers that showed that stores to âexclusive" cache lines can be as
>>> expensive as atomic operations [1]. I am not sure how up to date these
>>> numbers are though. In the benchmark I ran, multiple CPUs ran
>>> find_next_iomem_res() concurrently.
>>>
>>> [1] https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsigops.org%2Fs%2Fconferences%2Fsosp%2F2013%2Fpapers%2Fp33-david.pdf&amp;data=02%7C01%7Cnamit%40vmware.com%7Ca2706c5ab2c544283f3b08d6f4b6152b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C636965460234022371&amp;sdata=cD7Nhs4jcJGMD7Lav6D%2BC6E5Sei0DiWhKXL7vz2cVHA%3D&amp;reserved=0
>>
>> Just to clarify - the main motivation behind the per-cpu variable is not
>> about contention, but about the fact the different processes/threads that
>> run concurrently might use different resources.
>
> IIUC, the underlying problem is that dax relies heavily on ioremap(),
> and ioremap() on x86 takes too long because it relies on
> find_next_iomem_res() via the __ioremap_caller() ->
> __ioremap_check_mem() -> walk_mem_res() path.

I donât know much about this path and whether it is painful. The path I was
regarding is during page-fault handling:

- handle_mm_fault
- __handle_mm_fault
- do_wp_page
- ext4_dax_fault
- ext4_dax_huge_fault
- dax_iomap_fault
- dax_iomap_pte_fault
- vmf_insert_mixed_mkwrite
- __vm_insert_mixed
- track_pfn_insert
- lookup_memtype
- pat_pagerange_is_ram

But indeed track_pfn_insert() in x86 specific. I guess the differences are
due to the page-table controlling the cachability in x86 (PAT), but I donât
know much about other architectures and whether they have similar
cachability controls in the page-tables.