Re: [PATCH] Documentation/page_tables: Add info about MMU/TLB and Page Faults
From: Fabio M. De Francesco
Date: Thu Aug 03 2023 - 13:08:45 EST
On venerdì 28 luglio 2023 13:53:01 CEST Fabio M. De Francesco wrote:
> Extend page_tables.rst by adding a section about the role of MMU and TLB
> in translating between virtual addresses and physical page frames.
> Furthermore explain the concept behind Page Faults and how the Linux
> kernel handles TLB misses. Finally briefly explain how and why to disable
> the page faults handler.
Hello everyone,
I'd be grateful to anyone who wanted to comment on / or formally review this
patch. At the moment I've only had comments by Jonathan Cameron on RFC v2
(https://lore.kernel.org/all/20230723120721.7139-1-fmdefrancesco@xxxxxxxxx/
#t).
Does anybody else want to contribute?
Thanks in advance,
Fabio
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Ira Weiny <ira.weiny@xxxxxxxxx>
> Cc: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> Cc: Jonathan Corbet <corbet@xxxxxxx>
> Cc: Linus Walleij <linus.walleij@xxxxxxxxxx>
> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> Cc: Mike Rapoport <rppt@xxxxxxxxxx>
> Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx>
> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@xxxxxxxxx>
> ---
>
> This has been an RFC PATCH in its 2nd version for a week or so. I received
> comments and suggestions on it from Jonathan Cameron (thanks!), and so it
has
> now been modified to a real patch. I hope that other people want to add
their
> comments on this document in order to further improve and extend it.
>
> The link to the thread with the RFC PATCH v2 and the messages between
Jonathan
> and me start at
> https://lore.kernel.org/all/20230723120721.7139-1-fmdefrancesco@xxxxxxxxx/#r
>
> Documentation/mm/page_tables.rst | 105 +++++++++++++++++++++++++++++++
> 1 file changed, 105 insertions(+)
>
> diff --git a/Documentation/mm/page_tables.rst
> b/Documentation/mm/page_tables.rst index 7840c1891751..6ecfd6d2f1f3 100644
> --- a/Documentation/mm/page_tables.rst
> +++ b/Documentation/mm/page_tables.rst
> @@ -152,3 +152,108 @@ Page table handling code that wishes to be
> architecture-neutral, such as the virtual memory manager, will need to be
> written so that it traverses all of the currently five levels. This style
> should also be preferred for
> architecture-specific code, so as to be robust to future changes.
> +
> +
> +MMU, TLB, and Page Faults
> +=========================
> +
> +The `Memory Management Unit (MMU)` is a hardware component that handles
> virtual +to physical address translations. It may use relatively small
caches
> in hardware +called `Translation Lookaside Buffers (TLBs)` and `Page Walk
> Caches` to speed up +these translations.
> +
> +When a process wants to access a memory location, the CPU provides a
virtual
> +address to the MMU, which then uses the MMU to check access permissions and
> +dirty bits, and if possible it resolves the physical address and consents
the
> +requested type of access to the corresponding physical address.
> +
> +If the TLBs have not yet any recorded translations, the MMU may use the
Page
> +Walk Caches and complete or restart the page tables walks until a physical
> +address can finally be resolved. Permissions and dirty bits are checked.
> +
> +In the context of a virtual memory system, like the one used by the Linux
> +kernel, each page of memory has associated permission and dirty bits.
> +
> +The dirty bit for a page is set (i.e., turned on) when the page is written
> +to. This indicates that the page has been modified since it was loaded into
> +memory. It probably needs to be written on disk or other cores may need to
> +be informed about previous changes before allowing further operations.
> +
> +If nothing prevents it, eventually the physical memory can be accessed and
> +the requested operation on the physical frame is performed.
> +
> +There are several reasons why the MMU can't find certain translations. It
> +could happen because the process is trying to access a range of memory that
> is +not allowed to, or because the data is not present into RAM.
> +
> +When these conditions happen, the MMU triggers page faults, which are types
> +of exceptions that signal the CPU to pause the current process and run a
> special +function to handle the mentioned page faults.
> +
> +One cause of page faults is due to bugs (or maliciously crafted addresses)
> and +happens when a process tries to access a range of memory that it
doesn't
> have +permission to. This could be because the memory is reserved for the
> kernel or +for another process, or because the process is trying to write to
> a read-only +section of memory. When this happens, the kernel sends a
> Segmentation Fault +(SIGSEGV) signal to the process, which usually causes
the
> process to terminate. +
> +An expected and more common cause of page faults is an optimization called
> "lazy +allocation". This is a technique used by the Kernel to improve memory
> efficiency +and reduce footprint. Instead of allocating physical memory to a
> process as soon +as it's requested, the Kernel waits until the process
> actually tries to use the +memory. This can save a significant amount of
> memory in cases where a process +requests a large block but only uses a
small
> portion of it.
> +
> +A related technique is called "Copy-on-Write" (CoW), where the Kernel
allows
> +multiple processes to share the same physical memory as long as they're
only
> +reading from it. If a process tries to write to the shared memory, the
kernel
> +triggers a page fault and allocates a separate copy of the memory for the
> +process. This allows the Kernel to save memory and avoid unnecessary data
> +copying and, by doing so, it reduces latency and space occupation.
> +
> +Now, let's see how the Linux kernel handles these page faults:
> +
> +1. For most architectures, `do_page_fault()` is the primary interrupt
handler
> + for page faults. It delegates the actual handling of the page fault to +
> `handle_mm_fault()`. This function checks the cause of the page fault and +
> takes the appropriate action, such as loading the required page into +
> memory, granting the process the necessary permissions, or sending a +
> SIGSEGV signal to the process.
> +
> +2. In the specific case of the x86 architecture, the interrupt handler is
> + defined by the `DEFINE_IDTENTRY_RAW_ERRORCODE()` macro, which calls
> + `handle_page_fault()`. This function then calls either
> + `do_user_addr_fault()` or `do_kern_addr_fault()`, depending on whether
> + the fault occurred in user space or kernel space. Both of these
functions
> + eventually lead to `handle_mm_fault()`, similar to the workflow in other
> + architectures.
> +
> +`handle_mm_fault()` (likely) ends up calling `__handle_mm_fault()` to carry
> +out the actual work of allocation of the page tables. It works by using
> +several functions to find the entry's offsets of the 4 - 5 layers of tables
> +and allocate the tables it needs to. The functions that look for the offset
> +have names like `*_offset()`, where the "*" is for pgd, p4d, pud, pmd, pte;
> +instead the functions to allocate the corresponding tables, layer by layer,
> +are named `*_alloc`, with the above mentioned convention to name them after
> +the corresponding types of tables in the hierarchy.
> +
> +At the very end of the walk with allocations, if it didn't return errors,
> +`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via
> +`do_fault()` performs one of `do_read_fault()`, `do_cow_fault()`,
> +`do_shared_fault()`. "read", "cow", "shared" give hints about the reasons
> +and the kind of fault it's handling.
> +
> +The actual implementation of the workflow is very complex. Its design
allows
> +Linux to handle page faults in a way that is tailored to the specific
> +characteristics of each architecture, while still sharing a common overall
> +structure.
> +
> +To conclude this brief overview from very high altitude of how Linux
handles
> +page faults, let's add that page faults handler can be disabled and enabled
> +respectively with `pagefault_disable()` and `pagefault_enable()`.
> +
> +Several code path make use of the latter two functions because they need to
> +disable traps into the page faults handler, mostly to prevent deadlocks.[1]
> +
> +[1] mm/userfaultfd: Replace kmap/kmap_atomic() with kmap_local_page()
> +https://lore.kernel.org/all/20221025220136.2366143-1-ira.weiny@xxxxxxxxx/
> --
> 2.41.0