"Boleh jadi Allah membuka pintu ketaatan bagimu, tetapi tidak membuka pintu pengabulan (diterimanya ketaatan itu). Boleh jadi Allah menakdirkanmu berbuat dosa, tapi ternyata ia menjadi sebab sampainya tujuan (kepada-Nya)."

Kernel 2.4 Memory Addressing

by danang.wijanarko@gmail.com

 

Logical Address

The logical address is address that programmers commonly understand. It is composed by segment and offset.

Linear Address

This is address that logically can be addressed by a single 32-bit unsigned integer. Commonly known as virtual address. It strech until 4 GB (0x00000000 - 0xffffffff).

Physical Address

Is used to address the real memory chips. Represented by 32-bit unsigned integer too. This is how the address translation works.

Logical Address → (Segmentation Unit) → Linear Address → (Paging Unit) → Physical Address

Multiprocessor share the same memory, since read and write to RAM chip must be performed serially, a unit called memory arbiter is inserted between bus and every RAM chip. Even uniprocessor use memory arbiter too, since they include DMA processor that operates concurrently with CPU. From programming point of view this arbiter is hidden since it is managed by hardware circuits.

Segmentation in hardware

Segmentation registers

Logical address consist of 2 parts:

  1. segment identifier (16-bit field called segment selector)
  2. offset (32-bit field relative addres within the segment).

Segment registers hold segment selectors. There are 6 of them:

  1. cs: code segment register
  2. ss: stack segment register
  3. ds: data segment register
  4. es: extra segment register
  5. fs: extra segment register
  6. gs: extra segment register

Segmentation descriptors

We can have many segments. Each segment is represented by 8-byte field called segment descriptor that describe caracteristic and property of the pointed segment. Segment descriptors (note: this is plural) are stored either in GDT or LDT that are pointed by gdtr and ldtr processor register. Only 1 GDT is needed, but each process is permitted to have each own LDT as bonus.

Segment descriptors that are widely uses:

  • Code Segment Descriptor, Referring to the code segment and it may be included either in GDT or LDT.
  • Data Segment Descriptor, Referring to the data segment and it may be included either in GDT or LDT.
  • Task Segment Descriptor, Referring to the task state segment (TSS) and it can only appear in GDT. TSS is used to save the contents of the processor registers.
  • Local Descriptor Table Descriptor, Referring to the segment containing an LDT and it can only appear in GDT

Fast access to segment descriptors

To speed up address translation from.

logical address → linear address

Processor provides an additional non-programmable register (where the segment descriptor is loaded for fast access) for each of the 6 programmable segmentation registers. Each non-programmable register contains the 8-byte segment descriptor (described previously) specified by the segment selector contained in the corresonding segmentation register. Each time segment selector is loaded in a segmentation register, the corresonding segment descriptor is loaded from memory into the matching nonprogrammable register. From now on translation of logical address referring to that segment can be performed without accessing the GDT or LDT stored in memory to obtain the matching segment descriptor. The processor can just refer directly to the CPU register containing the segment descriptor. Each segment selector includes the following fields:

  • Index to GDT or LDT.
  • TI (Table Indicator), indicating whether using GDT or LDT.
  • RPL (Requestor Privilege Level), indicating CPL (Current Privilege Level) of the CPU when the corresonding segment selector is loaded into the cs register.

Accesses to the GDT or LDT are necessary only when the contents of the segmentation register change.1-11-1 The GDT and LDT is offsetted by 8-bytes, since segment descriptor is 8-bytes long. The first entry of the GDT is always set to 0 to ensure that logical address with a null segment selector will be consider invalid.

Segmentation unit

The segmentation unit performs the following operations:
Examines the TI field of the segment selector to determine using GDT or LDT.
Computes the address of the segment descriptor from index field of the segment selector. The index field is multiplied by 8 (size of segment descriptor), and the result is added to the content of gdtr or ldtr register.

Adds the offset of the logical address to the base field of the segment descriptor selected.1-21-2 Notice that because of the non-programmable register associated with the segmentation register, the first 2 operations need to be performed only when a segmentation register has been changed.

Segmentation in linux

Implementation

LDT are not used by the kernel, although a system call called modify_ldt() exists that allows processes to create their own LDTs. This is usefull to applications (such as wine) that execute segment-oriented microsoft windows. Here are the segments used by kernel:

  • __KERNEL_CS, the segment selector macro that point to the corresonding segment descriptor that point to the corresonding kernel code segment.
  • __KERNEL_DS, the segment selector macro that point to the corresonding segment descriptor that point to the corresonding kernel data segment.
  • __USER_CS, the segment selector macro that point to the corresonding segment descriptor that point to the corresonding user code segment.
  • __USER_DS, the segment selector macro that point to the corresonding segment descriptor that point to the corresonding user code segment.
  • A TSS for each processor. All TSS are sequentially stored in the init_tss array.
  • A default LDT that store in default_ldt variable. THe default LDT includes a single entry consisting of a null segment descriptor. Each processor has its own LDT segment descriptor, which is usually points to the common default LDT segment.
  • 4 segments related to the APM support.

This is how it is looked.

1-31-3The CPL of the CPU indicates whether the processor is in user or kernel mode and is specified by the RPL field of the segment selector stored in the cs register. Whenever the CPL is changed, some of segmentation register also need to be changed. For example, when CPL is 0, the ds register must contain the segment selector of the kernel data segment, and the ss must refer to the kernel mode stack inside the kernel data segment. Otherwise, when CPL is 3, the ds register must contain the segment selector of the user data segment, and the ss must refer to the user mode stack inside the user data segment.

Paging in hardware

Overview

Besides translating

linear address → physical address

The paging unit also checks the requested access type against the access right of the linear address. If the memory access is not valid then it generates a Page Fault exception.

Linear addresses are grouped into fixed-length intervals called pages, where contiguous linear addresses within a page are mapped into contiguous physical addresses called page frames (sometimes called physical pages). Page frames is what paging unit thinks about all of RAM that are partitioned into fixed-length.

The data structures that do this paging operation are called page tables, and they are stored in main memory, must be properly initialized before enabling the paging unit. Paging is enabled by setting the PG flag of cr0 register.

Regular paging

From 80386, the paging unit of intel processors handles 4 KB pages.1-51-5 2 level scheme is used to reduce the amount of RAM required for per process page tables (just thinks how big it is (about 2^20 entries at i.e., 4 bytes per entry is 4 MB of RAM) when we use only 1 level, and the user does not use all the addresses anyway). 2 level scheme reduces the memory by requiring page tables only for those virtual memory regions actually used by a process.

So, each active process must have a page directory assigned to it, but no need to allocate RAM for all page tables at once. Allocate RAM for a page table only when the process effectively needs it.

cr3 register holds the physical address of the page directory that currently used.i The entries of page directories and page tables have the same structure.

Extended paging

Starting with Pentium model, Intel introduces extended paging, which allows page frames to be 4 MB instead of 4 KB in size. Extended paging is enabled by setting the PSE flag of cr4 register and Page Size flag of a page directory entry. Extended paging is used to translate large contiguous linear address into corresonding physical ones without page tables thus save memory and preserve TLB entries.

Hardware protection scheme

The segmentation unit uses 4 privileges levels, while the paging unit uses 2 levels privileges. Read and write access are controlled by flags on page directory and page table.

3-level paging

In 64-bit architecture (HP's Alpha, Intel's Itanium, Sun's Ultra-SPARC), 2-level paging is no longer suitable. For test case, we use 64-bit architecture and we suppose to use 16 KB for the page size. The page size of 16 KB is 2^14, so the offset field on linear address is 14 bit. This leaves 50 bit for page directory and page table fields. If we chose to set 25 bit for each, it means both will needs 2^25 (more than 32 million entries) for each process. Even RAM is getting sheaper, this is too much wasting space.

HP's Alpha's solution is the following.

  • Only 43 least significant bits of address are used. So the 21 most significant bits are always set to 0.
  • Page frames is 8 KB is 2^13, so the offset fields is 13 bits long.
  • 3-level paging are introduced so that the remaining 30 bits can be split into three 10-bit fields. Thus, the page tables include 2^10 = 1024 entries as in the 2-level regular paging.

The physical address extension (PAE) paging

Begining with Pentium Pro processor, Intel introduced a mechanism called Physical Address Extension (PAE). When PAE is enabled, processors allows several sizes of pages, 4 KB, 2 MB, and 4 MB. Main problem with this is that the linear addresses are still 32-bit long (forcing programmers to reuse the same linear address to map different areas of RAM).

Hardware cache

CPU and dynamic RAM (DRAM) chips have different clock rates, that's why hardware cache memories were introduced to reduce the speed mismatch. This solution is implemented by introducing a smaller and faster memory that contains the most recently used code and data. For this purpose, a new unit called the line was introduced. It consist of a number of contiguous bytes that are transferred in burst mode between the slow DRAM and the fast on-chip static RAM (SRAM) used to implement caches. So cache lines is entry within the cache, each cache line has a reference to a location in memory. There are 3 types subset of lines:

  1. Dirrect mapped: a line in main memory is always stored at the exact line in the cache.
  2. Fully associative: any line in memory can be stored at any location in the cache.
  3. N-way set associative: any line in memory can be stored in any one of N lines of the cache. For example, a line of memory can be stored in 2 different lines in 2-way set associative cache.

1-71-7If addresses that are required does exist, it is called cache hit, otherwise it called cache miss. This state is the result of cache controller that examinning the tags in the line that match the address appropriated. In the write operation there are 2 strategies. write-through, the controller always writes into both RAM and the cache line. writhe-back, only the cache line is updated and the contents of RAM are left unchanged, and after write back, RAM must eventually be updated.

It's a bit hardware specific and will not discussed further away.

 

Translation Lookaside Buffers (TLB)

TLB is used to speed up the linear address translation. When linear address address is used for the first time, the corresonding physical address is computed through slow access to the page tables in RAM. This physical address is then stored in a TLB entry so that further references to the same linear address can be quickly translated. When cr3 is modified, the hardware automatically invalidates all entries of TLB.

Paging in linux

Linux uses 3-level paging so paging is feasible on 64-bit architectures. 3 types of paging table.

  1. Page Global Directory
  2. Page Middle Directory
  3. Page Table

1-81-8When using 32-bit architectures, linux essentially eleminates the Page Middle Directory field by saying that it contains zero bit. However the position of the Page Middle Directory in the sequence of pointers is kept so that the same code can work on 32-bit or 64-bit architectures. This is done by setting the number of entries in Page Middle Directory to 1 and mapping this single entry into the proper entry of the Page Global Directory.

When Linux uses PAE in the 32-bit architectures, the condition would be like this.

  1. Linux's Page Global Directory is the x86's Page Directory Pointer Table
  2. Linux's Page Middle Directory is the x86's Page Directory
  3. Linux's Page Table is the x86's Page Table

Reserved page frames

The kernel's code and data structures are stored in a group of resserved page frames. These page frames can never by dynamically assigned and swapped to disk. Linux Kernel is installed in RAM starting from physical address 0x00100000. The reason why it is not installed in the first available megabyte of RAM are.

  • Page frame 0 is used by BIOS to store system hardware configuration based on POST.
  • Physical address rang 0x000a0000 - 0x000fffff is used for graphics stuff.
  • Additional page frames within the first megabyte may be reserved for specific computer models.

Typical physical address map for a computer having 128 MB:

Start End Type Comment
0x00000000 0x0009ffff Usable -
0x000a0000 0x000effff Graphic Some BIOS may not provide it
0x00100000 0x07feffff Usable -
0x07ff0000 0x07ff2fff ACPI data / (Usable) Initially stores information about the hardware devices of the system written by BIOS in the POST phase. Kernel then copies such information in a suitable kernel data structure, and then consider these frames usable.
0x07ff3000 0x07ffffff ACPI NVS Mapped on ROM chips of the hardware devices
0xffff0000 0xffffffff Reserved Mapped by hardware to the BIOS's ROM chip

To avoid loading kernel into groups of noncontiguous page frames, Linux prefers to skip the first megabyte of RAM. This is how the first 2 MB of RAM are filled by Linux:  See the System.map file.

1-91-9

Process page tables

The linear address space of a process is devided into 2 parts:

  1. 0x00000000 - 0xbfffffff (3 GB for User VM): can be addressed when the process in the User or Kernel Mode. The content of the first entries of the Page Global Directory that map linear addresses < 0xc0000000 (the first 768 entries with PAE disabled) is process specific (it depends on each process, it means that these entries are different between each process's Page Global Directory).
  2. 0xc0000000 - 0xffffffff (1 GB for Kernel VM): can be addressed only when the process in the Kernel Mode. The remaining entries of Page Global Directory that map linear address >= 0xc0000000 should be the same for all processes and equal to the corresonding entries of the kernel master Page Global Directory (it means that these entries are same for all process's Page Global Directory).

In the User Mode, process issues linear addresses < 0xc0000000, but when runs in the Kernel Mode, it is executing kernel code and the linear addresses issued are >= 0xc0000000. Even running in kernel Mode, sometimes kernel need to access the Ueer Mode linear address space to retrieve or store data.

Kernel page tables

The kernel maintains a set of Page Tables and rooted on a Page Global Directory called master kernel Page Global Directory. After initialization, this set of Page Tables are never used by any process or kernel thread anymore (because this set of Page Tables just the initialization). But the higest entries of the master kernel Page Global Directory (that map linear address >= 0xc0000000) are the reference model for the corresonding entries of every regular process's Page Global Directory in the system.

Kernel initializes its own Page Tables in 2 steps:

1. Kernel creates 8 MB address space (enough for it to install itself in RAM), Provisional kernel page tables.

These is called provisional kernel page tables and initialized by the startup_32() defined in arc/i386/kernel/head.S. In this stage Page Middle Directory entries equal to the Page Global Directory entries. PAE support is not enabled at this stage.

swapper_pg_dir holds the Page Global Directory, while the two Page Tables that providing 8 MB address space of RAM are contained in the pg0 and pg1.

This step is aimed for the kernel to be able to addess 8 MB of RAM by either linear addresses that is identical to the physical ones or 8 MB worth of linear addresses, starting from 0xc0000000. This is implemented by mapping both linear address 0x00000000 - 0x007fffff (aimed for User Mode) and 0xc0000000 - 0xc07fffff (aimed for Kernel Mode) into physical addresses 0x00000000 - 0x007fffff.

The kernel creates the desired mapping by filling all swapper_pg_dir entries with zeroes, except for entries 0 and 1 entries for linear address 0x00000000 - 0x007fffff then 0x300 (decimal 768) and 0x301 (decimal 769) entries for linear address 0xc0000000 - 0xc07fffff.

The entries 0 and 0x300 is set to physical address of pg0. The entries 1 and 0x301 is set to physical address of pg1.

2. Kernel takes advantages of all RAM and sets up the paging tables properly.

Final mapping provided by kernel must transform linear address starting 0xc0000000 to physical address 0x00000000. __pa macro is used to convert a linear address starting fom PAGE_OFFSET (0xc0000000) to the corresonding physical address, the reverse is _va macro. Final kernel page table mapping is initialized by function paging_init(), which does the following:

  1. Invokes pagetable_init() to set up the Page Table entries properly.
  2. Writes the physical address of swapper_pg_dir in the cr3 control register.
  3. Invokes flush_tlb_all() to invalidates all TLB entries.

The actions performed by pagetable_init() depend on both the amount of RAM and the CPU model.

The basic problem here is, the kernel can just address 1 GB of virtual addresses, which can translate to a maximum of 1 GB of physical memory. This is because the kernel directly maps all available kernel virtual space addresses to the available physical memory. The Kernel VM has a special window of 128 MB reserved for vmalloc() use for dynamic remapping (look the explanation below). So the kernel has in fact 1024 - 128 = 896 MB of RAM directly mapped.

And the question is:

What about if the available RAM is > 896 MB ?
How the kernel addresses them, because Kernel VM only giving the chance to map max 1 GB (and even worst, in fact only 896 MB)?

But before the solution. In Linux, the memory available from all banks is classified into "nodes". These nodes indicate how much memory each bank has.

Memory in each node is divided into "zones". The zones currently defined are ZONE_DMA, ZONE_NORMAL and ZONE_HIGHMEM.

ZONE_DMA is used by some devices for data transfer and is mapped in the lower physical memory range (up to 16 MB).

Memory in the ZONE_NORMAL region is mapped by the kernel in the upper region of the linear address space. Most operations can only take place in ZONE_NORMAL, so this is the most performance critical zone. ZONE_NORMAL goes from 16 MB to 896 MB.

To address memory from 1 GB onwards, the kernel has to map pages from high memory into ZONE_NORMAL.

Some area of memory is reserved for storing several kernel data structures that store information about the memory map and page tables. This on x86 is 128 MB. Hence, of the 1 GB physical memory the kernel can access, 128 MB is reserved. This means that the kernel virtual address in this 128 MB is not mapped to physical memory. This leaves a maximum of 896 MB for ZONE_NORMAL. So, even if one has 1 GB of physical RAM, just 896 MB will be actually available.

Final kernel Page Table (RAM < 896 MB)

The address range of the Kernel VM and 32-bit physical addresses is sufficient to address all the available RAM, and there is no need to activate the PAE mechanism.

Final kernel Page Table (896 <= RAM < 4096 MB )

Now the address range of the Kernel VM is not sufficient anymore to map all the available RAM. This means that the pages in ZONE_HIGHMEM have to be mapped in ZONE_NORMAL before they can be accessed.

The reserved space which we talked about earlier (in case of x86, 128 MB) has an area in which pages from high memory are mapped into the kernel address space.

To create a permanent mapping, the kmap() function is used. Since this function may sleep, it may not be used in interrupt context. Since the number of permanent mappings is limited (if not, we could've directly mapped all the high memory in the address space), pages mapped this way should be un-map by kunmap() when no longer needed.

Temporary mappings can be created via kmap_atomic(). This function doesn't block, so it can be used in interrupt context. kunmap_atomic() un-maps the mapped high memory page. A temporary mapping is only available as long as the next temporary mapping. However, since the mapping and un-mapping functions also disable / enable preemption, it's a bug to not kunmap_atomic a page mapped via kmap_atomic.

Final kernel Page Table (4096 MB <= RAM )

Although PAE handles 36-bit physical addresses, the linear addresses are still 32-bit addresses. So besides enabling the PAE, the 3-level paging model is used.

The kernel initializes the first 3 entries in the Page Global Directory corresonding to the user linear address space with the address of an empty page (empty_zero_page). The 4th entry is initialized with the address of a Page Middle Directory (pmd). The first 448 entries in the Page Middle Directory (from 512 total entries, the last 64 are reserved for noncontiguous memory allocation) are filled with the physical address of the first 896 MB of RAM.

For noticement, whenever possible, linux uses large pages to reduce the number of Page Tables.

Fix mapped linear addresses

As described erlier, that 128 MB of linear addresses are always left available for the kernel to implement noncontiguous memory allocation and fix-mapped linear addresses. The noncontiguous memory allocation will be discussed later.

Fix-mapped linear addresses can map any physical addresses, while the linear addresses in the initial portion of the 4th GB map the linear physical addresses.

Handling the hardware cache and the TLB

These stuff play important role on boosting the performance of modern architectures. The goal of handling these stuffs is to reduce the number of cache and TLB misses.

Handling the hardware cache

L1_CACHE_BYTES holds the size of a cache line in bytes. To optimize the cache hit rate, the kernel considers the architecture in making the following decisions:

  • The most frequently used fields of a data structure are placed a the low offset within the data structure so they can be cached in the same line.
  • When allocating a large set of data structures, the kernel tries to store each of them in memory so that all cache lines are used uniformly.
  • When performing a process switch, the kernel has a small preference for processes that use the same set of Page Tables as the previously running process.

Handling the TLB

When process switch happens, the Page Tables are changing, and the Local TLB entries relative to the old Page Tables must be flushed (automatically done when kernel writes the address of the new Page Global Directory into cr3). In some cases, kernel succeeds in avoiding TLB flushes, like in these cases:

  • Process switch between 2 regular process that use the same set of Page Tables.
  • Process switch between a regular process and a kernel thread. In fact, the kernel threads do not have their own set Page Tables, rather they use the set of Page Tables owned by the regular process that was scheduled last for the execution of the CPU.

Another cases in which the kernel needs to flush some entries in a TLB:

  • When kernel assigns a page frame to a User Mode process and stores ots physical address into Page Table entry, it must flush any local TLB entry that refers to the corresonding linear address.
  • On multiprocessors systems, the kernel must also flush the same TLB entry on the CPUs that are using the same set of Page Tables, if any.

To avoid useless TLB flushing in multiprocessor systems, the kernel uses a technique called lazy TLB mode. The basic idea is if several CPUs are using the same Page Tables and a TLB entry must be flushed on all of them, then TLB flushing may, in some cases, be delayed on CPUs running kernel threads.

When some CPU starts running a kernel thread, the kernel sets it into lazy TLB mode. When requests (from other CPU) are issued to clear some TLB entries, each CPU in lazy TLB mode does not flush the corresponding entries, however, the CPU remembers that its current process is running on a set of Page Tables whose TLB entries for the User Mode addresses are invalid. As soon as the CPU in lazy TLB mode switches to a regular process with a different set of Page Tables, the hardware automatically flushes the TLB entries, and the kernel sets the CPU back in nonlazy TLB mode. However, if a CPU in lazy TLB mode switches to a regular process that owns the same set of Page Tables used by the previously running kernel thread, then any deferred TLB invalidation must be effectively applied by the kernel. This "lazy" invalidation is effectively achieved by flushing all nonglobal TLB entries of the CPU. Some extra data structures are needed to implement the lazy TLB mode.

When a CPU starts executing a kernel thread, the kernel sets the state field of its cpu_tlbstate element to TLBSTATE_LAZY, moreover, the cpu_vm_mask field of the active memory descriptor stores the indices of all CPUs in the system, including the one that is entering in lazy TLB mode. When another CPU wants to invalidate the TLB entries of all CPUs relative to a given set of Page Tables, it delivers an Interprocessor Interrupt to all CPUs whose indices are included in the cpu_vm_mask field of the corresponding memory descriptor.

When a CPU receives an Interprocessor Interrupt related to TLB flushing and verifies that it affects the set of Page Tables of its current process, it checks whether the state field of its cpu_tlbstate element is equal to TLBSTATE_LAZY, if yes, the kernel refuses to invalidate the TLB entries and removes the CPU index from the cpu_vm_mask field of the memory descriptor. This has two consequences:

  1. Until the CPU remains in lazy TLB mode, it will not receive other Interprocessor Interrupts related to TLB flushing.
  2. If the CPU switches to another process that is using the same set of Page Tables as the kernel thread that is being replaced, the kernel invokes local_flush_tlb to invalidate all nonglobal TLB entries of the CPU.

 

- d

"Bekatul Port" Personal Enterprise Number (PEN) registered @ IANA OID

ASN.1 Notation: {iso(1) identified-organization(3) dod(6) internet(1) private(4) enterprise(1) 30347}

Dot Notation: 1.3.6.1.4.1.1.30347

IRI Notation: oid:/ISO/Identified-Organization/6/1/4/1/30347


View Stat Counter