Storage Allocation
Information stored in memory is used in many
different ways. Some possible classifications are:
- Role in Programming Language:
- Instructions (specify the operations to be performed and
the operands to use in the operations).
- Variables (the information
that changes as the program runs: locals, owns, globals, parameters, dynamic
storage).
- Constants (information that is used as operands, but that
never changes: pi for example).
- Changeability:
- Read-
only: (code, constants).
- Read & write: (variables).
Why is identifying non-changing memory useful or
important?
- Initialized:
- Code,
constants, some variables: yes.
- Most variables:
no.
- Addresses vs. Data: Why is this distinction useful or important?
- Binding time:
- Static:
arrangement determined once and for all, before the program starts running. May
happen at compile-time, link-time, or load-time.
- Dynamic:
arrangement cannot be determined until runtime, and may change.
Note that the classifications overlap: variables may
be static or dynamic, code may be read-only or read&write, etc.
The compiler, linker, operating system, and run-time
library all must cooperate to manage this information and perform allocation.
When a process is running, what does its memory look
like? It is divided up into areas of stuff that the OS treats similarly, called
segments. In Unix, each process has three segments:
- Code (called "text" in Unix terminology)
- Initialized data
- Uninitialized data
- User's dynamically linked libraries (shared objects (.so) or
dynamically linked libraries (.dll))
- Shared libraries (system dynamically linked libraries)
- Mapped files
- Stack(s)
In some systems, can have many different kinds of
segments.
One of the steps in creating a process is to load its
information into main memory, creating the necessary segments. Information comes
from a file that gives the size and contents of each segment (e.g. a.out in
Unix). The file is called an object file. See man 5 a.out for format of Unix
object files.
Division of responsibility between various portions
of system:
- Compiler: generates one object file for each source code file
containing information for that file. Information is incomplete, since each
source file generally uses some things defined in other source files.
- Linker: combines all of the object files for one program into a
single object file, which is complete and self-sufficient.
- Operating system: loads object files into memory, allows several
different processes to share memory at once, provides facilities for processes
to get more memory after they have started running.
- Run-time library: provides dynamic allocation routines, such as
calloc and free in C.
Dynamic Memory Allocation
Why is not static allocation sufficient for
everything? Unpredictability: cannot predict ahead of time how much memory, or
in what form, will be needed:
- Recursive procedures. Even regular procedures are hard to predict
(data dependencies).
- OS does not know how many jobs there will be or which programs
will be run.
- Complex data structures, e.g. linker symbol table. If all storage
must be reserved in advance (statically), then it will be used inefficiently
(enough will be reserved to handle the worst possible case).
Need dynamic memory allocation both for main memory
and for file space on disk.
Two basic operations in dynamic storage management:
Dynamic allocation can be handled in one of two
general ways:
- Stack allocation (hierarchical): restricted, but simple and
efficient.
- Heap allocation: more general, but less efficient, more difficult
to implement.
Stack organization: memory allocation and freeing are
partially predictable (as usual, we do better when we can predict the future).
Allocation is hierarchical: memory is freed in opposite order from allocation.
If alloc(A) then alloc(B) then alloc(C), then it must be free(C) then free(B)
then free(A).
- Example: procedure call. Program calls Y, which calls X. Each call
pushes another stack frame on top of the stack. Each stack frame has space for
variable, parameters, and return addresses.
- Stacks are also useful for lots of other things: tree traversal,
expression evaluation, top-down recursive descent parsers, etc.
A stack-based organization keeps all the free space
together in one place.
Heap organization: allocation and release are
unpredictable. Heaps are used for arbitrary list structures, complex data
organizations. Example: payroll system. Do not know when employees will join and
leave the company, must be able to keep track of all them using the least
possible amount of storage.
- Inevitably end up with lots of holes. Goal: reuse the space in
holes to keep the number of holes small, their size large.
- Fragmentation: inefficient use of memory due to holes that are too
small to be useful. In stack allocation, all the holes are together in one big
chunk.
- Refer to Knuth volume 1 for detailed treatment of what
follows.
- Typically, heap allocation schemes use a free list to keep track
of the storage that is not in use. Algorithms differ in how they manage the free
list.
- Best fit: keep linked list of
free blocks, search the whole list on each allocation, choose block that comes
closest to matching the needs of the allocation, save the excess for later.
During release operations, merge adjacent free blocks.
- First fit:
just scan list for the first hole that is large enough. Free excess. Also merge
on releases. Most first fit implementations are rotating first
fit.
- Bit Map: used for allocation of storage that comes in fixed-size
chunks (e.g. disk blocks, or 32-byte chunks). Keep a large array of bits, one
for each chunk. If bit is 0 it means chunk is in use, if bit is 1 it means chunk
is free. Will be discussed more when talking about file systems.
Pools: keep a separate allocation pool for each
popular size. Allocation is fast, no fragmentation.
Reclamation Methods: how do we know when memory can
be freed?
- It is easy when a chunk is only used in one place.
- Reclamation is hard when information is shared: it cannot be
recycled until all of the sharers are finished. Sharing is indicated by the
presence of pointers to the data (show example). Without a pointer, cannot
access (cannot find it).
Two problems in reclamation:
- Dangling pointers: better not recycle storage while it is still
being used.
- Core leaks: Better not "lose" storage by forgetting to free it
even when it cannot ever be used again.
Reference Counts: keep track of the number of
outstanding pointers to each chunk of memory. When this goes to zero, free the
memory. Example: Smalltalk, file descriptors in Unix. Works fine for
hierarchical structures. The reference counts must be managed automatically (by
the system) so no mistakes are made in incrementing and decrementing them.
Garbage Collection: storage is not freed explicitly
(using free operation), but rather implicitly: just delete pointers. When the
system needs storage, it searches through all of the pointers (must be able to
find them all!) and collects things that are not used. If structures are
circular then this is the only way to reclaim space. Makes life easier on the
application programmer, but garbage collectors are incredibly difficult to
program and debug, especially if compaction is also done. Examples: Lisp,
capability systems.
How does garbage collection work?
- Must be able to find all objects.
- Must be able to find all pointers to objects.
- Pass 1: mark. Go through all pointers that are known to be in use:
local variables, global variables. Mark each object pointed to, and recursively
mark all objects it points to.
- Pass 2: sweep. Go through all objects, free up those that are not
marked.
Garbage collection is often expensive: 20% or more of
all CPU time in systems that use it.
Sharing Main Memory
Issues:
- Want to let several processes coexist in main memory.
- No process should need to be aware of the fact that memory is
shared. Each must run regardless of the number and/or locations of processes.
- Processes must not be able to corrupt each other.
- Efficiency (both of CPU and memory) should not be degraded badly
by sharing. After all, the purpose of sharing is to increase overall efficiency.
Relocation: draw a simple picture of memory with some
processes in it.
- Because several processes share memory, we cannot predict in
advance where a process will be loaded in memory. This is similar to a
compiler's inability to predict where a subroutine will be after linking.
- Relocation adjusts a program to run in a different area of memory.
Linker is an example of static relocation used to combine modules into programs.
We now look at relocation techniques that allow several programs to share one
main memory.
Static software relocation, no protection:
- Lowest memory holds OS.
- Processes are allocated memory above the OS.
- When a process is loaded, relocate it so that it can run in its
allocated memory area (just like linker: linker combines several modules into
one program, OS loader combines several processes to fit into one memory; only
difference is that there are no cross-references between processes).
- Problem: any process can destroy any other process and/or the
operating system.
- Examples: early batch monitors where only one job ran at a time
and all it could do was wreck the OS, which would be rebooted by an operator.
Many of today's personal computers also operate in a similar fashion.
Static relocation with protection keys (IBM S/360
approach):
- Protection Key = a small integer stored with each chunk of memory.
The chunks are likely to be 1k-4k bytes.
- Keep an extra hardware register to identify the current process.
This is called the process id, or PID. 0 is reserved for the operating system's
process id.
- On every memory reference, check the PID of the current process
against the key of the memory chunk being accessed. PID 0 is allowed to touch
anything, but any other mismatch results in an error trap.
- Additional control: who is allowed to set the PID? How does OS
regain control once it has given it up?
- This is the scheme used for the IBM S/360 family. It is safe but
inconvenient:
- Programs have to be
relocated before loading. In some systems (e.g. MPS) this requires complete
relinking. Expensive.
- Cannot share information between two
processes very easily
- Cannot swap a process out to secondary
storage and bring it back to a different location
Dynamic memory relocation: instead of changing the
addresses of a program before it is loaded, we change the address dynamically
during every reference.
- Under dynamic relocation, each program-generated address (called a
logical or virtual address) is translated in hardware to a physical, or real
address. This happens as part of each memory reference.
- Show how dynamic relocation leads to two views of memory, called
address spaces. With static relocation we force the views to coincide so that
there can be several levels of mapping.
Base and Bounds, Segmentation
Base & bounds relocation:
- Two hardware registers: base address for process, bounds register
that indicates the last valid address the process may generate.
Each process must be allocated contiguously in real
memory.
- On each memory reference, the virtual address is compared to the
bounds register, then added to the base register. A bounds violation results in
an error trap.
- Each process appears to have a completely private memory of size
equal to the bounds register plus 1. Processes are protected from each other. No
address relocation is necessary when a process is loaded.
- Typically, the OS runs with relocation turned off, and there are
special instructions to branch to and from the OS while at the same time turning
relocation on and off. Modification of the base and bounds registers must also
be controlled.
- Base & bounds is cheap -- only 2 registers -- and fast -- the
add and compare can be done in parallel.
- Explain how swapping can work.
- Examples: CRAY-1.
Problem with base&bound relocation:
- Only one segment. How can two processes share code while keeping
private data areas (e.g. shared editor)? Draw a picture to show that it cannot
be done safely with a single-segment scheme.
Multiple segments.
- Permit process to be split between several areas of memory. Each
area is called a segment and contains a collection of logically-related
information, e.g. code or data for a module.
- Use a separate base and bound for each segment, and also add a
protection bit (read/write).
- Each memory reference indicates a segment and offset in one or
more of three ways:
- Top bits of
address select segment, low bits the offset. This is the most common, and the
best.
- Or, segment is selected implicitly by the operation being
performed (e.g. code vs. data, stack vs. data).
- Or, segment is
selected by fields in the instruction (as in Intel x86 prefixes).
(The last two alternatives are kludges used for
machines with such small addresses that there is not room for both a segment
number and an offset)
Segment table holds the bases and bounds for all
the segments of a process.
Show memory mapping procedure, involving table
lookup + add + compare. Example: PDP-10 with high and low segments selected by
high-order address bit.
Segmentation example: 8-bit segment number, 16-bit
offset.
- Segment table (use above picture -- all numbers in hexadecimal):
- Code in segment 0 (addresses are virtual):
- 0x00242:mov 0x60100,%r1
- 0x00246:st %r1,0x30107
- 0x0024A:b 0x20360
- Code in segment 2:
- 0x20360:ld [%r1+2],%r2
- 0x20364:ld [%r2],%r3
- ...
- 0x203C0:ret
Advantage of segmentation: segments can be swapped
and assigned to storage independently.
Problems:
- External fragmentation: segments of many different sizes.
- Segments may be large, have to be allocated contiguously.
- (These problems also apply to base and bound schemes)
Example: in PDP-10's when a segment gets larger, it
may have to be shuffled to make room. If things get really bad it may be
necessary to compact memory.
Paging
Goal is to make allocation and swapping easier, and
to reduce memory fragmentation.
- Make all chunks of memory the same size, call them pages. Typical
sizes range from 512-8k bytes.
- For each process, a page table defines the base address of each of
that process' pages along with read/only and existence bits.
- Page number always comes directly from the address. Since page
size is a power of two, no comparison or addition is necessary. Just do table
lookup and bit substitution.
- Easy to allocate: keep a free list of available pages and grab the
first one. Easy to swap since everything is the same size, which is usually the
same size as disk blocks to and from which pages are swapped.
- Problems:
- Internal
fragmentation: page size does not match up with information size. The larger the
page, the worse this is.
- Table space: if pages are small, the table
space could be substantial. In fact, this is a problem even for normal page
sizes: consider a 32-bit address space with 1k pages. What if the whole table
has to be present at once? Partial solution: keep base and bounds for page
table, so only large processes have to have large tables.
- Efficiency of access: it may take one overhead reference for every
real memory reference (page table is so big it has to be kept in memory).
Two-Level (Multi-Level) Paging
Use two levels of mapping to make tables
manageable.
Segmentation and Paging
Use two levels of mapping, with logical sizes for
objects, to make tables manageable.
- Each segment contains one or more pages.
- Segment correspond to logical units: code, data, stack. Segments
vary in size and are often large. Pages are for the use of the OS; they are
fixed size to make it easy to manage memory.
- Going from paging to P+S is like going from single segment to
multiple segments, except at a higher level. Instead of having a single page
table, have many page tables with a base and bound for each. Call the material
associated with each page table a segment.
System 370 example: 24-bit virtual address space, 4
bits of segment number, 8 bits of page number, and 12 bits of offset. Segment
table contains real address of page table along with the length of the page
table (a sort of bounds register for the segment). Page table entries are only
12 bits, real addresses are 24 bits.
- If a segment is not used, then there is no need to even have a
page table for it.
- Can share at two levels: single page, or single segment (whole
page table).
Pages eliminate external fragmentation, and make it
possible for segments to grow without any reshuffling.
If page size is small compared to most segments,
then internal fragmentation is not too bad.
The user is not given access to the paging tables.
If translation tables are kept in main memory,
overheads could be very high: 1 or 2 overhead references for every real
reference.
Another example: VAX.
- Address is 32 bits, top two select segment. Three base-bound
pairs define page tables (system, P0, P1).
- Pages are 512 bytes long.
- Read-write protection information is contained in the page table
entries, not in the segment table.
- One segment contains operating system stuff, two contain stuff
of current user process.
- Potential problem: page tables can get big. Do not want to have
to allocate them contiguously, especially for large user processes. Solution:
- System base-bounds pairs are physical
addresses, system tables must be contiguous.
- User base-bounds pairs
are virtual addresses in the system space. This allows the user page tables to
be scattered in non-contiguous pages of physical memory.
- The result
is a two-level scheme.
In current systems, you will see three and even
four-level schemes to handle 64-bit address spaces.
Translation Buffers and Inverted Page Tables
Problem with segmentation and paging: extra memory
references to access translation tables can slow programs down by a factor of
two or three. Too many entries in translation tables to keep them all loaded in
fast processor memory.
We will re-introduce fundamental concept of locality:
at any given time a process is only using a few pages or segments.
Translation Lookaside Buffer
Solution: Translation Lookaside Buffer (TLB). A
translation buffer is used to store a few of the translation table entries. It
is very fast, but only remembers a small number of entries. On each memory
reference:
- First ask TLB if it knows about the page. If so, the reference
proceeds fast.
- If TLB has no info for page, must go through page and segment
tables to get info. Reference takes a long time, but give the info for this page
to TLB so it will know it for next reference (TLB must forget one of its current
entries in order to record new one).
TLB Organization: Show picture of black box.
Virtual page number goes in, physical page location comes out. Similar to a
cache, usually direct mapped.
TLB is just a memory with some comparators. Typical
size of memory: 128 entries. Each entry holds a virtual page number and the
corresponding physical page number. How can memory be organized to find an entry
quickly?
- One possibility: search whole table from start on every
reference.
- A better possibility: restrict the info for any given virtual
page to fall in exactly one location in the memory. Then only need to check that
one location. E.g. use the low-order bits of the virtual page number as the
index into the memory. This is the way real TLB's work.
Disadvantage of TLB scheme: if two pages use the
same entry of the memory, only one of them can be remembered at once. If process
is referencing both pages at same time, TLB does not work very well.
Example: TLB with 64 (100 octal) slots. Suppose the
following virtual pages are referenced (octal): 621, 2145, 621, 2145, ... 321,
2145, 321, 621.
TLBs are a lot like hash tables except simpler
(must be to be implemented in hardware). Some hash functions are better than
others.
- Is it better to use low page number bits than high ones?
- Is there any way to improve on the TLB hashing function?
Another approach: let any given virtual page use
either of two slots in the TLB. Make memory wider, use two comparators to check
both slots at once.
- This is as fast as the simple scheme, but a bit more expensive
(two comparators instead of one, also have to decide which old entry to replace
when bringing in a new entry).
- Advantage: less likely that there will be conflicts that degrade
performance (takes three pages falling in the same place, instead of two).
- Explain names:
- Direct
mapped.
- Set associative.
- Fully associative.
Must be careful to flush TLB during each context
swap. Why?
In practice, TLB's have been extremely successful
with 95% or great hit rates for relatively small sizes.
Inverted Page Tables
As address spaces have grown to 64 bits, the side of
traditional page tables becomes a problem. Even with two-level (or even three or
four!) page tables, the tables themselves can become too large.
A solution (used on the IBM Power4 and others) to
this problem has two parts:
- A physical page table instead of a logical one. The physical page
table is often called an inverted page table. This table contains one entry per
page frame. An inverted page table is very good at mapping from physical page to
logical page number (as is done by the operating system during a page fault),
but not very good at mapping from virtual page number to physical page number
(as is done on every memory reference by the processor).
- A TLB fixes the above problem. Since there is no other hardware or
registers dedicated to memory mapping, the TLB can be quite a bit larger so that
missing-entry faults are rare.
With an inverted page table, most address
translations are handled by the TLB. When there is a miss in the TLB, the
operating is notified (via an interrupt) and TLB miss-handler is invoked.
Shadow Tables
The operating system can sometimes be thought of as
an extension of the abstractions provided by the hardware. However, when the
table format is defined by the hardware (such as for a page table entry), you
cannot change that format. So, what do you do if you wanted to store additional
information, such as last reference time or sharing pointer, in each entry?
The most common solution is a technique that is
sometimes called a shadow table. The idea of a shadow is simple (and familiar to
Fortran programmers!):
- Consider the hardware defined data structure as an array.
- For the new information that you want to add, define a new
(shadow) array.
- There is one entry in the shadow array for each entry in the
hardware array.
- For each new item you want to add to the data structure, you add
a new data member to the shadow array.
For example, consider the hardware defined page
table to be an array of structures:
struct Page_Entry {
unsigned PageFrame_hi : 10; // 42-bit page frame number
unsigned PageFrame_mid : 16;
unsigned PageFrame_low : 16;
unsigned UserRead : 1;
unsigned UserWrite : 1;
unsigned KernelRead : 1;
unsigned KernelWrite : 1;
unsigned Reference : 1;
unsigned Dirty : 1;
unsigned Valid : 1;
}
struct Page_Entry pageTable[TABLESIZE];
If you wanted to added a couple of data members,
you cannot simply change it to the following:
struct Page_Entry {
unsigned PageFrame_hi : 10;
unsigned PageFrame_mid : 16;
unsigned PageFrame_low : 16;
unsigned UserRead : 1;
unsigned UserWrite : 1;
unsigned KernelRead : 1;
unsigned KernelWrite : 1;
unsigned Reference : 1;
unsigned Dirty : 1;
unsigned Valid : 1;
Time_t lastRefTime;
PageList *shared;
}
Instead, you would define a a second array based on
this type:
struct Page_Entry { struct PE_Shadow {
unsigned PageFrame_hi : 10; Time_t lastRefTime;
unsigned PageFrame_mid : 16; PageList *shared;
unsigned PageFrame_low : 16; }
unsigned UserRead : 1;
unsigned UserWrite : 1;
unsigned KernelRead : 1;
unsigned KernelWrite : 1;
unsigned Reference : 1;
unsigned Dirty : 1;
unsigned Valid : 1;
}
struct Page_Entry pageTable[TABLESIZE];
struct PE_Shadow pageShadow[TABLESIZE];
Virtual Memory, Page Faults
Problem: how does the operating system get
information from user memory? E.g. I/O buffers, parameter blocks. Note that the
user passes the OS a virtual address.
- In some cases the OS just runs unmapped. Then all it has to do is
read the tables and translate user addresses in software. However, addresses
that are contiguous in the virtual address space may not be contiguous
physically. Thus I/O operations may have to be split up into multiple blocks.
Draw an example.
- Suppose the operating system also runs mapped. Then it must
generate a page table entry for the user area. Some machines provide special
instructions to get at user stuff. Note that under no circumstances should users
be given access to mapping tables.
- A few machines, most notably the VAX, make both system information
and user information visible at once (but the user cannot touch system stuff
unless the program is running with special kernel protection bit set). This
makes life easy for the kernel, although it does not solve the I/O problem.
So far we have disentangled the programmer's view of
memory from the system's view using a mapping mechanism. Each sees a different
organization. This makes it easier for the OS to shuffle users around and
simplifies memory sharing between users.
However, until now a user process had to be
completely loaded into memory before it could run. This is wasteful since a
process only needs a small amount of its total memory at any one time
(locality). Virtual memory permits a process to run with only some of its
virtual address space loaded into physical memory.
The Memory Hierarchy
The idea is to produce the illusion of a memory
with the size of the disk and the speed of main memory.
Data can be in registers (very fast), caches
(fast), main memory (not so fast, or disk (slow). Keep the things that you use
frequently as close to you (and as fast to access) as possible.
The reason that this works is that most programs
spend most of their time in only a small piece of the code. Give Knuth's
estimate of 90% of the time in 10% of the code. Introduce again the principle of
locality.
Page Faults
If not all of process is loaded when it is running,
what happens when it references a byte that is only in the backing store?
Hardware and software cooperate to make things work anyway.
- First, extend the page tables with an extra bit "present". If
present is not set then a reference to the page results in a trap. This trap is
given a special name, page fault.
- Any page not in main memory right now has the "present" bit
cleared in its page table entry.
- When page fault occurs:
- Operating system brings page into memory.
- Page
table is updated, "present" bit is set.
- The process is
continued.
Continuing process is very tricky, since it may
have been aborted in the middle of an instruction. Do not want user process to
be aware that the page fault even happened.
- Can the instruction just be skipped?
- Suppose the instruction is restarted from the beginning. How is
the "beginning" located?
- Even if the beginning is found, what about instructions with
side effects, like:
ld [%r2], %r2
- Without additional information from the hardware, it may be
impossible to restart a process after a page fault. Machines that permit
restarting must have hardware support to keep track of all the side effects so
that they can be undone before restarting.
- Forest Baskett's approach for the old Zilog Z8000 (two
processors, one just for handling page faults)
- IBM 370 solution (execute long instructions twice).
- If you think about this when designing the instruction set, it
is not too hard to make a machine virtualizable. It is much harder to do after
the fact. VAX is example of doing it right.
Effective Access Time Calculation
We can calculate the estimated cost of page faults
by performing an effective access time calculation. The basic idea is that
sometimes you access a location quickly (there is no page fault) and sometimes
more slowly (you have to wait for a page to come into memory). We use the cost
of each type of access and the percentage of time that it occurs to compute the
average time to access a word.
Let:
- h = fraction of time that a reference does not require a page
fault.
- tmem = time it takes to read a word from memory.
- tdisk = time it takes to read a page from disk.
then
- EAT = h * tmem + (1 - h) * tdisk.
If there a multiple classes of memory accesses,
such as no disk access, one disk access, and two disk access, then you would
have a fraction (h) and access time (t) for each class of access.
Note that this calculation is the same type that
computer architects use to calculate memory performance. In that case, their
access classes might be (1) cached in L1, (2) cached in L2, and (3) RAM.
Page Selection and Replacement
Once the hardware has provided basic capabilities
for virtual memory, the OS must make two kinds of scheduling decisions:
- Page selection: when to bring pages into memory.
- Page replacement: which page(s) should be thrown out, and when.
Page selection Algorithms:
- Demand paging: start up process with no pages loaded, load a
page when a page fault for it occurs, i.e. until it absolutely MUST be in
memory. Almost all paging systems are like this.
- Request paging: let user say which pages are needed. The trouble
is, users do not always know best, and are not always impartial. They will
overestimate needs.
- Prepaging: bring a page into memory before it is referenced
(e.g. when one page is referenced, bring in the next one, just in case). Hard to
do effectively without a prophet, may spend a lot of time doing wasted work.
Page Replacement Algorithms:
- Random: pick any page at random (works surprisingly well!).
- FIFO: throw out the page that has been in memory the longest.
The idea is to be fair, give all pages equal residency.
- MIN: naturally, the best algorithm arises if we can predict the
future.
- LFU: use the frequency of past references to predict the future.
- LRU: use the order of past references to predict the future.
Example: Try the reference string A B C A B D A D B
C B, assume there are three page frames of physical memory. Show the memory
allocation state after each memory reference.
Note that MIN is optimal (cannot be beaten), but
that the principle of locality states that past behavior predicts future
behavior, thus LRU should do just about as well.
Implementing LRU: need some form of hardware
support, in order to keep track of which pages have been used recently.
- Perfect LRU? Keep a register for each page, and store the system
clock into that register on each memory reference. To replace a page, scan
through all of them to find the one with the oldest clock. This is expensive if
there are a lot of memory pages.
- In practice, nobody implements perfect LRU. Instead, we settle
for an approximation which is efficient. Just find an old page, not necessarily
the oldest. LRU is just an approximation anyway (why not approximate a little
more?).
Clock Algorithm, Thrashing
This is an efficient way to approximate LRU.
Clock algorithm: keep "use" bit for each page frame,
hardware sets the appropriate bit on every memory reference. The operating
system clears the bits from time to time in order to figure out how often pages
are being referenced. Introduce clock algorithm where to find a page to throw
out the OS circulates through the physical frames clearing use bits until one is
found that is zero. Use that one. Show clock analogy.
Fancier algorithm: give pages a second (third?
fourth?) chance. Store (in software) a counter for each page frame, and
increment the counter if use bit is zero. Only throw the page out if the counter
passes a certain limit value. Limit = 0 corresponds to the previous case. What
happens when limit is small? large?
Some systems also use a "dirty" bit to give
preference to dirty pages. This is because it is more expensive to throw out
dirty pages: clean ones need not be written to disk.
What does it mean if the clock hand is sweeping very
slowly?
What does it mean if the clock hand is sweeping very
fast?
If all pages from all processes are lumped together
by the replacement algorithm, then it is said to be a global replacement
algorithm. Under this scheme, each process competes with all of the other
processes for page frames. A per process replacement algorithm allocates page
frames to individual processes: a page fault in one process can only replace one
of that process' frames. This relieves interference from other processes. A per
job replacement algorithm has a similar effect (e.g. if you run vi it may cause
your shell to lose pages, but will not affect other users). In per-process and
per-job allocation, the allocations may change, but only slowly.
Thrashing: consider what happens when memory gets
overcommitted.
- Suppose there are many users, and that between them their
processes are making frequent references to 50 pages, but memory has 40 pages.
- Each time one page is brought in, another page, whose contents
will soon be referenced, is thrown out.
- Compute average memory access time.
- The system will spend all of its time reading and writing pages.
It will be working very hard but not getting anything done.
- Thrashing was a severe problem in early demand paging systems.
Thrashing occurs because the system does not know
when it has taken on more work than it can handle. LRU mechanisms order pages in
terms of last access, but do not give absolute numbers indicating pages that
must not be thrown out.
What can be done?
- If a single process is too large for memory, there is nothing the
OS can do. That process will simply thrash.
- If the problem arises because of the sum of several processes:
- Figure out how much memory each
process needs.
- Change scheduling priorities to run processes in
groups whose memory needs can be satisfied.
Working Sets
Working Sets are a solution proposed by Peter
Denning. An informal definition is "the collection of pages that a process is
working with, and which must thus be resident if the process is to avoid
thrashing." The idea is to use the recent needs of a process to predict its
future needs.
- Choose tau, the working set parameter. At any given time, all
pages referenced by a process in its last tau seconds of execution are
considered to comprise its working set.
- A process will never be executed unless its working set is
resident in main memory. Pages outside the working set may be discarded at any
time.
Working sets are not enough by themselves to make
sure memory does not get overcommitted. We must also introduce the idea of a
balance set:
- If the sum of the working sets of all runnable processes is
greater than the size of memory, then refuse to run some of the processes (for a
while).
- Divide runnable processes up into two groups: active and inactive.
When a process is made active its working set is loaded, when it is made
inactive its working set is allowed to migrate back to disk. The collection of
active processes is called the balance set.
- Some algorithm must be provided for moving processes into and out
of the balance set. What happens if the balance set changes too frequently?
As working sets change, corresponding changes will
have to be made in the balance set.
Problem with the working set: must constantly be
updating working set information.
- One of the initial plans was to store some sort of a capacitor
with each memory page. The capacitor would be charged on each reference, then
would discharge slowly if the page was not referenced. Tau would be determined
by the size of the capacitor. This was not actually implemented. One problem is
that we want separate working sets for each process, so the capacitor should
only be allowed to discharge when a particular process executes. What if a page
is shared?
- Actual solution: take advantage of use bits
- OS maintains idle time value for each page: amount of CPU
time received by process since last access to page.
- Every once in a
while, scan all pages of a process. For each use bit on, clear page's idle time.
For use bit off, add process' CPU time (since last scan) to idle time. Turn all
use bits off during scan.
- Scans happen on order of every few
seconds (in Unix, tau is on the order of a minute or more).
Other questions about working sets and memory
management in general:
- What should tau be?
- What if it is too large?
- What if it is too
small?
- What algorithms should be used to determine which processes are in
the balance set?
- How do we compute working sets if pages are shared?
- How much memory is needed in order to keep the CPU busy? Note than
under working set methods the CPU may occasionally sit idle even though there
are runnable processes.