Category: Linux Kernel

Handling the Hardware Cache and the TLB

  • Cache and Translation Lookup Buffers are two mechanisms that can help boost efficiency a lot due to the decrease in lookup time for the addresses

 

Handling the Hardware Cache

  • Hardware cache is addressed by cache lines
    • L1_CACHE_BYTES macro yields the size of a cache line in bytes
      • Pentium 4: 128, Pre Pentium 4: 32
  • Cache hits can be maximized by doing the following
    1. Putting the most frequently used fields of a data structure in low offsets in the data structures so they can be cached on the same line
    2. Large data structures are stored in a way such that all cache lines are used uniformly
  • Cache sync is done by the hardware and not the kernel on x86 thus all cache flushes are performed by hardware
  • On hardware that does not support it the kernel will perform cache sync

 

Handling the TLB

  • The kernel decides when the mapping between a linear and physical address is invalid therefore kernel will perform TLB flushing
  • Therefore processors cannot sync TLB cache automatically

Capture.PNG

  • Processors will usually only offer very limited TLB flushing methods (Intel=2)
    1. All Pentium models flush all TLB entries in non global pages when cr3 is reloaded
    2. Pentium Pro and later use the  invlpg assembly instruction to invalidate a single TLB entry mapping given a linear address
  • The following functions use the assembly instructions mentioned above
    • These macros also are very important in implementing architecture independent TLB flushing
Capture.PNG

The Linux kernel uses invlpg in these functions for x86 Intel

  • flush_tlb_pgtables method is missing from Table 2-12: in the 80 × 86
    architecture nothing has to be done when a page table is unlinked from its parent
    table, thus the function implementing this method is empty

 

TLB Flushing

  • The CPU running the function sends a interprocessor interrupt to all other CPUs which forces them to run the TLB invalidating function
  • In general process switch=TLB invalidating time for local page tables
  • Kernel assigns a page frame to a User Mode process and stores its physical address into a Page Table entry.
    • it must flush any local TLB entry that refers to the corresponding linear address
  • There are some exceptions where the kernel will not flush TLB
    1. Switching between 2 user mode processes that share the same page tables
    2. Switching between a user mode process and a kernel thread
      • Ch9: Kernels do not have their own page tables
      • No kernel thread will access User mode address space

 

Lazy TLB Mode

  • If CPUs are sharing page tables lazy mode will delay flushing on as many CPUs as long as possible
  •  Case User(Non lazy)->Kernel Thread(Lazy)->User(Different Page Table Non lazy)
    1. When a CPU begins running a kernel thread it enables lazy TLB mode
    2. When the CPU switches back to a regular process with a different set of page tables hardware auto flushes TLB
    3. Kernel sets CPU to non lazy TLB mode
  • Case Kernel Thread(Lazy)->User (Same Page Tables Non Lazy)
    1. If the new user process has the same page tables..
    2. Any deferred TLB invalidation must be done by the kernel
    3. The kernel achieves this by invalidating all non-global TLB entries (reload cr3)

 

Data structures for Lazy TLB Mode

  • cpu_tlbstate: Part of NR_CPUS structure (# cpus default 32)
    • state field is set to TLB_LAZY when entering
    • state field is set to TLB_OK when leaving
    • NR_CPUS consists of active_mm field pointing to all active memory descriptors and TLB_OK/LAZY
  • cpu_vm_mask: Part of the active memory descriptor
    • stores the indices of all CPUs in the system  including the one that is entering in lazy TLB mode
    • When a CPU wants to invalidate TLB entries of all CPUs just send interprocessor interupt to all CPUs in this field of the memory

 

  • When a CPU receives an Interprocessor Interrupt for TLB flushing and verifies that it affects the set of page tables of its current process
  • It checks whether the  state field of its cpu_tlbstate element is equal to TLBSTATE_LAZY.
    • In this case, the kernel refuses to invalidate the TLB entries and removes the CPU index from the cpu_vm_mask field of the memory descriptor. This has two consequences: 
      1. As long as the CPU remains in lazy TLB mode, it will not receive other interprocessor Interrupts related to TLB flushing
      2.  If the CPU switches to another process that is using the same set of page tables
        as the kernel thread that is being replaced, the kernel invokes
        _ _flush_tlb() to
        invalidate all non-global TLB entries of the CPU

Kernel Page Tables

  • Kernel maintains a set of page tables for its own use
    • It is rooted at the “master kernel Page Global Directory”
  • The highest entries in the master kernel Page Global Directory are used as templates of PGDs of all other processes
  • At this point the CPU is still in real mode and paging is note enabled
  • The kernel initalizes its own address space is set up in 2 steps
    1. Create a limited address space to store the kernel and data structures
      • Contains data and code segments, initial page tables, 128 kb dynamic data structures
    2. Takes all existing RAM and sets up all page tables (PAGE_OFFSET and above)

Provisional kernel Page Tables (Step 1: Still Real Mode no PAE)

  • Note: Assume the kernel can fit in 8mB of RAM (2 pages).  And the mapping is to be easily addressed in both real and protected mode
  • The linear address space is set up as follows
    1. The provisional PGD is first initialized statically at compilation
      • Contained in swapper_pg_dir var
      • Stored starting a pg0 right after _end (end of uninitialized kernel data)
    2. The Page Tables are initialized by startup_32()
  • This mapping is achieved by filling all entries in swapper_pg_dir with zeros except for entries 0, 1, 768 and 769 (2 pages with the addresses specified above)
    • The following flags must be set: Present, Read/Write, and User/Supervisor
    • The following flags must be cleared: Accessed, Dirty, PCD, PWD, and Page Size
  • Recall in real mode there is a one to one mapping to physical addresses.
    • Therefore the mappings to be made is
      • 0x000000000x007fffff to 0x000000000x007fffff for user mode addresses
      • 0x000000000x007fffff to 0xc00000000xc07fffff for kernel mode addresses
  • First we create a mapping
  • Second we initialize all other page tables
  • Recall this is done using the setup_32() function
    • This function also enables the paging unit
    • This function essentially loads the physical address of swapper_pg_dir into cr3 and setting PG flag (paging) in cr0
movl $swapper_pg_dir-0xc0000000,%eax

movl %eax,%cr3   /* set the page table pointer.. */

movl %cr0,%eax

orl $0x80000000,%eax

movl %eax,%cr0   /* ..and set paging (PG) bit */</span>

The highest 128 MB of linear addresses are left available for several kinds of mappings (see sections “FixMapped Linear Addresses” later in this chapter and “Linear Addresses of Noncontiguous Memory Areas” in
Chapter 8). The kernel address space left for mapping the RAM is thus 1 GB – 128 MB = 896 MB.

Final Kernel Page Table when RAM is < 896 MB (Step 2)

  • The final mapping provided by the kernel page tables must transform linear
    addresses starting from
    0xc0000000 into physical addresses starting from 0

    • This is done using two macros
      • __pa: Linear to physical starting from PAGE_OFFSET
      • __va: Physical to linear starting from PAGE_OFFSET
  • Next we initialize all other pages (master kernel is still in swapper_pg_dir)
    1. Invokes pagetable_init() to set up the Page Table entries properly
      • pagetable_init() depends on system config aka RAM and CPU
      • This case < 896 MB of RAM
    2. Writes the physical address of swapper_pg_dir in the cr3 control register.
    3. If the CPU supports PAE and if the kernel is compiled with PAE support, sets
      the
      PAE flag in the cr4 control register
    4. Invokes _ _flush_tlb_all() to invalidate all TLB entries
  • For this case since we don’t need PAE the swapper_pg_dir can be initialized as follows
pgd = swapper_pg_dir + pgd_index(PAGE_OFFSET);  /* 768th page frame */
phys_addr = 0x00000000; /*start counting from 0 physical address*/
while (phys_addr < (max_low_pfn * PAGE_SIZE)) { /*while not hit the end of low mem*/
    pmd = one_md_table_init(pgd); /* returns pgd itself */
    set_pmd(pmd, _ _pmd(phys_addr | pgprot_val(_ _pgprot(0x1e3))));
    /* 0x1e3 == Present, Accessed, Dirty, Read/Write, Page Size, Global */
    phys_addr += PTRS_PER_PTE * PAGE_SIZE; /* 0x400000 */
    /*PTRS_PER_X is the number of entries at each level pointers per _*/
    ++pgd;
}
  • Identity mapping of the first megabytes of physical memory (startup_32() is required to complete the initialization phase of the kernel.
  • When this mapping is no longer necessary, the kernel clears the corresponding page table entries by invoking the zap_low_mappings() function.
  • Note we have not yet discussed fix-mapped linear addresses

Final Kernel Page Table when RAM is between 896 MB and 4GB

  • RAM cannot be mapped entirely into kernel linear address space
  • The solution is to map 896 megabytes into kernel address space like before
    • When a program needs to address parts other than those 896 mb some other linear address interval must be mapped into the current space
      • This means changing the value of some page table entries
      • (Dynamic remapping Ch8)
  • The 896 mb which are mapped are initialized in the same way as above

Final Kernel Page Table when RAM is > 4GB

  • This means 3 things
    1. CPU supports PAE
    2. More than 4GB of RAM installed
    3. Kernel compiled with PAE
  • With PAE this becomes a 3 level paging problem
    • And instead of relying on dynamic remapping we can directly map
pgd_idx = pgd_index(PAGE_OFFSET); /* 3 */
for (i=0; i<pgd_idx; i++)
    set_pgd(swapper_pg_dir + i, _ _pgd(_ _pa(empty_zero_page) + 0x001));
    /* 0x001 == Present */
pgd = swapper_pg_dir + pgd_idx;
phys_addr = 0x00000000;

for (; i<PTRS_PER_PGD; ++i, ++pgd) {
    pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
    set_pgd(pgd, _ _pgd(_ _pa(pmd) | 0x001)); /* 0x001 == Present */

    if (phys_addr < max_low_pfn * PAGE_SIZE)
    for (j=0; j < PTRS_PER_PMD /* 512 */
            && phys_addr < max_low_pfn*PAGE_SIZE; ++j) {
        set_pmd(pmd, _ _pmd(phys_addr |
                pgprot_val(_ _pgprot(0x1e3))));
        /* 0x1e3 == Present, Accessed, Dirty, Read/Write, Page Size, Global */
        phys_addr += PTRS_PER_PTE * PAGE_SIZE; /* 0x200000 */
    }
}
swapper_pg_dir[0] = swapper_pg_dir[pgd_idx];

  • First three entries in the Page Global Directory corresponding to the user linear address space with the address of an empty page (empty_zero_page).
  • The fourth entry is initialized with the address of a (pmd) allocated by invoking alloc_bootmem_low_pages().
  • The first 448 entries in the PMD  are filled with the physical address of the first 896 MB of RAM.
    • (there are 512 entries, but the last 64 are reserved for noncontiguous memory allocation; see the section “Noncontiguous Memory Area Management” in Chapter 8)
  • PAE also supports large 2-MB pages and global pages. Whenever possible, Linux uses large pages toreduce the number of Page Tables thus chooses 2MB pages
  • Fourth Page Global Directory entry is then copied into the first entry, so as to
    mirror the mapping of the low physical memory in the first 896 MB of the linear
    address space.

    • This mapping is required in order to complete the initialization of SMP (symmetric multi processing) systems: when it is no longer necessary, the kernel clears the corresponding page table entries by invoking the zap_low_mappings() function, as in the previous cases

Fix-Mapped Linear Addresses

  • Initial part of the 4th gigabyte is used to map to the same physical addresses as 0x00000000
  • At least 128mB is left for noncontiguous memory allocation and fix mapped linear addresses
    • Noncontiguous memory allocation is a way to dynamically allocate and deallocate pages of memory (Ch8)
    • Fix mapped linear addresses: A constant linear address (eg. 0xffffc000) is mapped to any arbitrary physical address.  Each fix mapped linear address maps to one page frame
      • More efficient for variable pointers. Constant linear addresses require one less memory access (paging)
      • Since it is a constant linear address no need to check for null (?)
  • Fix-mapped linear addresses are stored in the enum fixed_addresses data structure
    • An explanation enumerated data structs can be found here
    • Essentially each fix mapped linear address is stored as an integer index in this structure
  • Conversion from a fix mapped linear address to a virtual address is performed by the fix_to_virt() function
    • Note: UL=Unsigned Long
    • Inline functions are copied and pasted at compilation where they are called instead of referenced to another included object file
    • The computation places fix mapped linear addresses at the end of the 4th gigabyte of linear addresses
inline unsigned long fix_to_virt(const unsigned int idx){
    if (idx >= _ _end_of_fixed_addresses)
        _ _this_fixmap_does_not_exist();
    return (0xfffff000UL - (idx << PAGE_SHIFT));
}
  • For example in the case of FIX_TO_APIC_BASE_0, the input to fix_to_virt==3
  • Note also these are const unsigned ints if the compiler sees that the input given is not constant or greater than the __this_fixmap_does_not_exist idx it will issue an error
  • The compiler computes 0xfffff000-(3<<PAGE_SHIFT) and replaces the fix_to_
    virt()
    function call with the constant linear address 0xffffc000.
enum fixed_addresses {
    FIX_HOLE,
    FIX_VSYSCALL,
    FIX_APIC_BASE,
    FIX_IO_APIC_BASE_0,
    [...]
    _ _end_of_fixed_addresses
};
  • set_fixmap(idx, phys) and set_fixmap_nocach(idx, phys) are two macros which help create enum fixed_addresses
    • The second function sets the PCD flag disabling hardware cache
  • clear_fixmap(idx) removes the linking between a fix mapped linear address and the physical address

Process Page Tables

  • Linear addresses are divided into 2 sections
    1. User/Kernel Mode: These addresses can both be allotted to processes in user or kernel  mode. They range from 0x00000000 to 0xbfffffff
      • Kernel may need to access User Mode linear addresses to retrive or store data
    2. Kernel Only Mode: These addresses can only be allotted to processes in kernel mode.  They range from 0xc0000000 to 0xffffffff
  • PAGE_OFFSET macro has value 0xc0000000 and is used as an offset into kernel linear address space

 

  • The first entries of the Page Global Directory map linear addresses lower than 0xc0000000
    • For PAE disabled. 768 4Mb pages (768*4 = ~ 3GB using 3:1 User Kernel ratio)
      • A page table entry requires 32 bits, so 1024 of them (covering 4MB of virtual address space) can be stored in one page. If the virtual address space is 3GB (as it is on many x86 systems), 768 pages would be required to hold all of the page table entries.
    • For PAE enabled. 3 Entries (PDPT). Only 3 because 3:1 User to Kernel ratio
      • All three levels of page tables are present. The page global directory (PGD) contains only four entries, each corresponding to 1GB of virtual address space; the PGD is indexed using the top two bits of the virtual address.

Page Table Handling (Bunches of Func/Macro)

For a more detailed description please see the book.

 

Page Table Entry Macros

  • pte_t, pmd_t, pud_t, pgd_t: Describe each level of paging’s format. 64 bit with PAE 32 bit without PAE.
    • These are often just uint they are declared as structs for protection so they will not be used inappropiately
    • Addressing more than 4 Gb of RAM uses 4 extra bits (36 on address bus) entries for that will be stored as addition vars (?) on the struct
    • pgprot_t: Protecion flags are stored as part of the struct
  • pgprot_t: Represents protection flags associated with a single entry.  64 bit with PAE 32 without
  • __pte, __pmd, __pud, __pgd, __pgprot: These 5 macros convert an uint into a page ___ entry type
  • pte_val, pmd_val, pud_val, pgd_val, pgprot_val: Convert a page ___ entry into a uint type

Page Table Entry Macros and Functions for Read/Modify

  • pte_none, pmd_none, pud_none, and pgd_none: Yield 1 if entry is 0 otherwise yield 0
  • pte_clear, pmd_clear, pud_clear, and pgd_clear: Clear entry in corresponding page table stops any process from accessing the linear address used by that page table entry.
  • ptep_get_and_clear(): Returns previous entry at page table entry and clears it
  • set_pte, set_pmd, set_pud, and set_pgd: Write value to specific page table entry
    • Note when PAE is enabled set_pte_atomic will ensure that 64 bit architectures will be written automatically (?)
  • pte_same(a,b): returns 1 if if a, b refer to the same page and have same access priveleges.  0 otherwise.
  • pmd_large(e): Returns 1 if Page Middle Directory uses large pages (2Mb, 4Mb) else 0
Capture.PNG

pmd is used by 32 bit PAE, pmd is ignored by 32 bit no PAE, pmd and pud are ignored when using 32 bit no pae.  Therefore only pmd_bad varies in 0 or 1 because it is the only end point in contact with page frames depending on the setup.

  • pmd_bad: Value is 1 if entry points to bad page table. 0 otherwise
    • What defines bad?
      1. Page not in main memory (present flag cleared)
      2. Page only allows read access (Read/write flag cleared)
      3. Accessed or Dirty cleared.  (Note: Linux forces these flags to always be set for every existing Page Table should never happen)
  • pud_bad, pgd_bad : Always yields 0.
  • pte_bad: Is not defined.  Legal for a Page Table entry to refer to a page that is not present in main memory, not writable, or not accessible at all.
  • pte_present: Yields 1 if either present Flag or Page Size is 1 in a Page Table entry
    • Page Size flag in Page Table entries has no meaning for the paging unit.  The kernel, however, marks Present equal to 0 and Page Size equal to 1 for
      the pages present in main memory but without read, write, or execute privileges.

      • Access to such pages triggers a Page Fault exception because Present is
        cleared, and the kernel can detect that the fault is not due to a missing page by checking the value of Page Size.
  • pmd_present:  Yields the value 1 if the Present flag of the corresponding
    entry is equal to 1 aka in main memory

Creating and Deleting Page Directories

  • 2/3 Level paging Page Upper Directory is always mapped as a single entry in the Page Global Directory
  • 2 Level Paging (No PAE 32 bit)
    • Using the picture above the only 2 relevant levels of paging are the Page Global Directory (Page Directory) and Page Table (Page Table)
    • Recall in the discussion of paging in Linux this means Upper/Middle directories are just one entry in the PGD. and thus can be easily allocated
    • Page Tables allocation is more complex.
      • It may not exist yet.  Allocate a new page frame fill with zeros and then add entry
  • 3 Level Paging (PAE enabled 32 bit)
    • When enabled the Page Global Directory becomes the PDPT and thus 4 Page Middle Directories must be allocated

 

The following are some functions with definitions which will be important later for now just know they exist.

Page Flag Reading Functions


pte_user(); //Reads User/Supervisor Flag

pte_read(); //Reads the User/Supervisor flag (pages on the 80 × 86 processor can-
not be protected against reading)

pte_write(); //Reads the Read/Write flag

pte_exec(); //Reads the User/Supervisor flag (pages on the 80x 86 processor cannot be
protected against code execution)

pte_dirty(); //Reads the Dirty flag

pte_young(); //Reads the Accessed flag

pte_file(); //Reads the Dirty flag (when the Present flag is cleared and the Dirty flag
is set, the page belongs to a non-linear disk file mapping; see Chapter 16)

Page Flag Setting Functions


mk_pte_huge( ); // Sets the Page Size and Present flags of a Page Table entry

pte_wrprotect( ); //Clears the Read/Write flag

pte_rdprotect( ); //Clears the User/Supervisor flag

pte_exprotect( ); //Clears the User/Supervisor flag

pte_mkwrite( ); //Sets the Read/Write flag

pte_mkread( ); //Sets the User/Supervisor flag

pte_mkexec( ); //Sets the User/Supervisor flagpte_mkclean( ); //Clears the Dirty flag

pte_mkdirty( ); //Sets the Dirty flag

pte_mkold( ); //Clears the Accessed flag (makes the page old)

pte_mkyoung( ); //Sets the Accessed flag (makes the page young)

pte_modify(p,v); //Sets all access rights in a Page Table entry p to a specified value v

ptep_set_wrprotect(); //Like pte_wrprotect(), but acts on a pointer to a Page Table entry

ptep_set_access_flags(); //If the Dirty flag is set, sets the page’s access rights to a specified value and invokes flush_tlb_page() (see the section “ Translation Lookaside Buffers (TLB)” later in this chapter)
ptep_mkdirty(); //Like pte_mkdirty() but acts on a pointer to a Page Table entry

ptep_test_and_clear_dirty(); //Like pte_mkclean() but acts on a pointer to a Page Table entry and returns the old value of the flag

ptep_test_and_clear_young() Like pte_mkold() but acts on a pointer to a Page Table entry and returns the old value of the flag

Macros Acting on Page Table Entries

  • Combines page address and flags to make page table entry
  • Or the reverse and extract page address from page table entry
pgd_index(addr); //Yields the index (relative position) of the entry in the Page Global Directory that maps the linear address addr.

pgd_offset(mm, addr); //Receives as parameters the address of a memory descriptor cw (see Chapter 9) and a linear address addr. The macro yields the linear address of the entry in a Page Global Directory that corresponds to the address addr; the Page Global Directory is found through a pointer within the memory descriptor.

pgd_offset_k(addr); //Yields the linear address of the entry in the master kernel Page Global Directory that corresponds to the address addr (see the later section “ Kernel Page Tables”).

pgd_page(pgd); //Yields the page descriptor address of the page frame containing the Page
Upper Directory referred to by the Page Global Directory entry pgd. In a two-or three-level paging system, this macro is equivalent to pud_page() applied to the folded Page Upper Directory entry.

pud_offset(pgd, addr); //Receives as parameters a pointer pgd to a Page Global Directory entry and a linear address addr. The macro yields the linear address of the entry in a
Page Upper Directory that corresponds to addr. In a two- or three-level paging system, this macro yields pgd, the address of a Page Global Directory entry.

pud_page(pud); //Yields the linear address of the Page Middle Directory referred to by the Page Upper Directory entry pud. In a two-level paging system, this macro is equivalent to pmd_page() applied to the folded Page Middle Directory entry.

pmd_index(addr); //Yields the index (relative position) of the entry in the Page Middle Directorythat maps the linear address addr.

pmd_offset(pud, addr); //Receives as parameters a pointer pud to a Page Upper Directory entry and a linear address addr. The macro yields the address of the entry in a Page Middle Directory that corresponds to addr. In a two-level paging system, it yields pud, the address of a Page Global Directory entry.

pmd_page(pmd); //Yields the page descriptor address of the Page Table referred to by the Page Middle Directory entry pmd. In a two-level paging system, pmd is actually an entry of a Page Global Directory.

mk_pte(p,prot); //Receives as parameters the address of a page descriptor p and a group of
access rights prot, and builds the corresponding Page Table entry.

pte_index(addr); //Yields the index (relative position) of the entry in the Page Table that maps the linear address addr.

pte_offset_kernel(dir, addr); //Yields the linear address of the Page Table that corresponds to the linear address addr mapped by the Page Middle Directory dir. Used only on the
master kernel page tables (see the later section “Kernel Page Tables”).

pte_offset_map(dir, addr); //Receives as parameters a pointer dir to a Page Middle Directory entry and a linear address addr; it yields the linear address of the entry in the Page Table that corresponds to the linear address addr. If the Page Table is kept in high memory, the kernel establishes a temporary kernel mapping (see the section “Kernel Mappings of High-Memory Page Frames”in Chapter 8), to be released by means of pte_unmap. The macros pte_offset_map_nested and pte_unmap_nested are identical, but they use a different temporary kernel mapping.

pte_page(x); //Returns the page descriptor address of the page referenced by the Page Table
entry x .

pte_to_pgoff(pte); //Extracts from the content pte of a Page Table entry the file offset corresponding to a page belonging to a non-linear file memory mapping (see the section “Non-Linear Memory Mappings” in Chapter 16).

pgoff_to_pte(offset); //Sets up the content of a Page Table entry for a page belonging to a non-linear file memory mapping.

Page Allocation Functions


pgd_alloc(mm); //Allocates a new Page Global Directory; if PAE is enabled, it also allocates the three children Page Middle Directories that map the User Mode linear addresses. The argument mm (the address of a memory descriptor) is ignored on the 80x 86 architecture.

pgd_free( pgd); //Releases the Page Global Directory at address pgd; if PAE is enabled, it also releases the three Page Middle Directories that map the User Mode linear addresses.

pud_alloc(mm, pgd, addr); //In a two- or three-level paging system, this function does nothing: it simply returns the linear address of the Page Global Directory entry pgd.

pud_free(x); //In a two- or three-level paging system, this macro does nothing.

pmd_alloc(mm, pud, addr); //Defined so generic three-level paging systems can allocate a new Page Middle Directory for the linear address addr. If PAE is not enabled, the function simply returns the input parameter pud— that is, the address of the entry in the Page Global Directory. If PAE is enabled, the function returns the linear address of the Page Middle Directory entry that maps the linear address addr. The argument cw is ignored.

pmd_free(x); //Does nothing, because Page Middle Directories are allocated and deallocated
together with their parent Page Global Directory.

pte_alloc_map(mm, pmd, addr); //Receives as parameters the address of a Page Middle Directory entry pmd and a linear address addr, and returns the address of the Page Table entry corresponding to addr. If the Page Middle Directory entry is null, the function allocates a new Page Table by invoking pte_alloc_one( ). If a new Page Table is allocated, the entry corresponding to addr is initialized and the User/Supervisor flag is set. If the Page Table is kept in high memory, the kernel establishes a temporary kernel mapping (see the section “Kernel Mappings of High-Memory Page Frames” in Chapter 8), to be released by pte_unmap.

pte_alloc_kernel(mm, pmd, addr); //If the Page Middle Directory entry pmd associated with the address addr is null, the function allocates a new Page Table. It then returns the linear address of the Page Table entry associated with addr. Used only for master kernel page tables (see the later section “Kernel Page Tables”).

pte_free( pte); //Releases the Page Table associated with the pte page descriptor pointer.

pte_free_kernel(pte); //Equivalent to pte_free(), but used for master kernel page tables.

clear_page_range(mmu,start,end); //Clears the contents of the page tables of a process from linear address start to end by iteratively releasing its Page Tables and clearing the Page Middle Directory entries.

The Linear Address Fields

For more detailed descriptions please refer to the book.

  • Macros Used for Page Table Handling
    • PAGE_SHIFT
      • Length of offset field 12 bits for 4kb pages
    • PMD_SHIFT
      • Log of size of area Page Middle Directory maps to
      • Length of offset+table field of linear address 12+10=22 bits for PAE off
      • PAE on 12+9=21 bits
    • PMD_SIZE
      • Size mapped by single entry in Page Middle Directory
    • PMD_MASK
      • Mask offset and table fields of a linear address
      • 0xffc00000 PAE disabled
      • 0xffe00000 PAE enabled
    • LARGE_PAGE_SIZE
      • For when system uses large pages (2 levels of paging to 1 level)
      • Size of a large page=PMD_SIZE=2*PMD_SHIFT
    • LARGE_PAGE_MASK
      • Same as normal mask except is for more bits
    • PUD_SHIFT (always equal to PMD_SHIFT)
      • Log of size of area Page Upper Directory entry maps to
      • Should be 10 bits (?)
    • PUD_SIZE (always 2 or 4mb (2^10*1^12=middle dir*offset))
      • Size of area mapped by single entry in Page Upper Directory
    • PGDIR_SHIFT
      • Log of size of area Page Global Directory entry maps to
      • 22 for PAE disabled and 2 level paging 10 bits for Page Table and 12 for offset (4mB)
      • 30 for PAE enabled and 3 level paging 9 from Page Table 9 from Middle Page Table 12 from offset (1gB)
    • PGDIR_SIZE
      • Size of memory mapped to by a Page Global Directory entry
    • PGDIR_MASK
      • 0xffc00000 PAE Disabled
      • 0xc0000000 PAE Enabled
    • PTRS_PER_PTE, PTRS_PER_PMD, PTRS_PER_PUD, and PTRS_PER_PGD
      • Compute the number of entries in the Page Table, Page Middle Directory, Page Upper Directory, and Page Global Directory. They yield the values 1,024, 1, 1, and 1,024, respectively, when PAE is disabled; and the values 512, 512, 1, and 4, respectively, when PAE is enabled.
        • Recall halved due to the extra level of paging of the PDPT

Paging in Linux

Capture.PNG

4 level paging scheme

  • 32 bit arch
    • 2 paging levels (no PAE)
      • Doesn’t use Page Upper Directory and Page Middle Directory initialized to all 0s
      • Keep in mind base address is retained for Page Uppder/Middle Directories to make sure the code can execute on 64 bit systems without crashing
        • This is done by setting both tables to have one entry only and mapping them to the Page Global Directory as usual
    • 3 paging levels (using PAE)
      • Page Global directory ==PDPT (Page Directory Pointer Table)
      • Page Upper directory not used
      • Page Middle directory ==Page Directory
      • Page Table remains the same
  • 64 bit arch
    • 4 paging levels
      • Page Global Directory->Page Upper Directory->Page Middle Directory->Page Table
  • Each process has its own Page Global Directory and Page Tables
    • When switching processes cr3 is loaded with the correct address of the Page Global Directory of the new process so proper memory reads may be performed

Why 4 level scheme for 64 bit?

  1. Each process gets private physical address space protecting against addressing errors
  2. Distinguish pages (groups of data) from page frames (physical addresses in main memory)
    • Allows for a page of data stored in a page frame to be saved to disk and reloaded into another page frame (allows for virtual memory (ch17?))