5.1 Introduction
- We aim to introduce a way for to creating the illusion to have large amounts of fast memory using a very small memory
- The idea is based on the idea that code or data is not access all its code or data at once with equal probability
- This is known as the principle of locality
- Programs access a relatively small portion of their address space at any instant of time
- Temporal Locality: Locality in time
- Something referenced will probably be referenced soon again
- Spatial Locality: Locality in space
- If something is accessed there is a high chance things around it will be soon accessed as well
- Loops show high amounts of temporal locality, the variables used in the loop are going to show up again in the next iteration
- Array show high spatial locality as arrays are stored sequentially in memory
- Locality is taken advantage by implementing a memory hierarchy
- Faster memory is stored closer to the processor
- Upper memory is faster lower memory is slower
- Data is also hierarchical a level closer to the processor is generally a subset of any level further away
- Data is only copied between 2 adjacent levels regardless of levels
- The minimum unit of information that can exist in a piece of memory is known as a block or a line
- Looking at two levels of memory a hit is when data is found in the upper memory, conversely a miss is when upper memory does not contain the data
- On a miss lower memory is accessed to retrieve the data
- Hit rate: % of memory accesses found in upper memory
- Miss rate: 1-hit rate
- Hit time: Time required to access upper level of the memory hierarchy (time required to determine hit or miss)
- Miss Penalty: The amount of time to replace a block in the upper level with the corresponding block from the lower level
- Most of the miss penalty comes from accessing lower memory
5.2 Memory Technologies
- Four primary technologies for memory
- DRAM (Dynamic Random Access Memory): Main Memory
- SRAM (Static Random Access Memory): For caches
- $$ SRAM >> DRAM $$ because DRAM takes less area per bit of memory and larger memory for same amount of material
- Flash Memory: Non volatile memory
- Magnetic Disk: ………
- SRAM Technology
- ICs that are memory arcades with (usually) a single access port that can either provide read or write
- Access time is very close to cycle time
- No refresh time (uses transistors not capacitors)
- SRAM directly exists on the processor
- 6-8 transistors per bit
- DRAM Technology
- Built with capacitors and thus need to be refreshed (keep charge on capacitor). Accessed with 1 transistor per bit
- Refresh is performed by reading contents and writing it back directly
- Charge only lasts several milliseconds
- Refresh is performed by row to save time
- DRAM buffers rows (like SRAM) to improve access times
- S(ynchronous)DRAM contains a clock to eliminate the time a memory and processor require to synchronize and removing the requirement to use additional address bits to address all the bits in a row sequentially
- The clock transfers the successive bits in a burst
- DDR SDRAM is fastest DDR stands for double data rate
- Transfers double as much bandwidth as one would expect for clock/data width
- Data transfer happens on both rise and fall of the clock
- Can read/write from multiple banks at once, using one address
- Known as address interleaving
- Flash Memory
- Disk Memory
5.3 The Basics of Caches
- Cache: Extra level of hierarchy between processor and main memory
- If a value is not in cache when the CPU calls for it it is retrieved from memory and then stored in the cache
- The cache is always smaller than main memory therefore we need a system to map larger memory into smaller memory
- Direct Mapped Cache: The simplest type of cache
- Each memory location is mapped directly to exactly one location in the cache
- Almost all direct mapped caches use the following mapping to find a block
- Block Idx = (Block address) modulo (Number of blocks in the cache)
- If modulo is a power of two modulo can be computed using a low order log
- Example using an 8 block cache
- 3 lowest bits are used as the modulo
- But how do we solve the problem of whether we know a requested word is in the cache or not?
- This is done by adding tags to the cache
- Tags contain address information required to identify whether a word in cache corresponds to the requested word
- Tags are the upper part of the address that are used as an index into the cache
- In the figure above, we only need 2 of the upper bits in the 5 bit memory address to use as a tag as the bottom three were used to index into the cache
- Note: I believe the tags should always be chosen last after determining the modulo
- Marking cache blocks to whether they have valid information
- Eg. Upon startup/after some instructions not all cache entries may be filled but still contain information which may be interpreted as data
- Thus we add a valid bit to mark if the cache entry contains a valid address
- Accessing a Cache
- Below is an example of an size 8 cache
- Just as a reminder, low order bits -> entry in cache and the high bits -> tag used to identify if the cached data is correct
- In MIPS the addresses are aligned to multiples of four bytes thus the least significant two bits of every address specify a byte within a word
- For a 32 bit address, direct mapped cache, 2^n blocks in the cache and block size is 2^m words
- We use n bits for the index, m bits for the word within the block and m+2 bits for each byte in the block
- This gives us the size of the tag field as the following size
- 32-(n+m+2)
- The total number of bits in a direct mapped cache is
- 2^n * (block size + tag size + valid field size)
- Accounting for the valid bit
- 2^n * (2^m * 32 + (32-n-m-2) +1) = total number of bits
- Note the 32 accounts for the fact that there are 32 bits in a word
- The convention is to exclude the size of the tag and valid field in the size of the cache
- This gives the example picture to be 1024 * 32 / 8 = 4096 byte cache
- Increasing the block size will decrease hit rate to a certain extent
- This exploits spatial locality as you are loading data near the specified address
- Once a critical point is reached the number of entries in a cache becomes too small and miss rate will increase again
- Cache Miss Penalty
- Latency two first word
- Transfer time for the rest of the block
- Handling Cache Misses
- Control unit needs to process the Missa and then fetch the data from lower level memory
- Interrupts differ from cache misses because the interrupt only needs to save the register values when performing a context switch
- In the case of a cache miss the entire processor must be halted in order to perform a read of lower memory and then store.
- Only afterwards is the processor allowed to continue execution
- Instruction and data cache misses can almost be handled identically
- Send the origin PC value (current PC-4 for MIPS -2 for LC3) to the memory
- The reason we need to do this is because the fetch1 state in a processor will automatically increment the PC and in order to perform step 4 where we resume the processor execution we need to continue as if the fetch1 state of the cache missed instruction never happened
- Instruct main memory to perform a read and wait for the memory to complete its access
- Writ the cache entry, putting the data from memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and turning the valid bit on
- Restart the instruction execution at the first step, which will refetch the instruction, the time finding it in the cache
- Handling Writes
- The slow Write-Through Scheme
- On a store instruction, the value inside the cache is updated and then immediately written back to main memory using the full address as well
- This is slow because we do this 100% of the time
- If there is a program with 10% stores, if CPI on a cache hit is 1.0, and writing to memory takes 100 cycles the average required amount amount of cycles to complete a store instruction is
- 1.0+10%*100=11 cycles
- One solution is a write buffer
- The buffer stores the data needed to be written into memory (both cache and buffer now have the correct data)
- Then later if an instruction writes to the same location in main memory the buffer is cleared by writing the previous value into main memory and both cache and buffer are updated to reflect the current state of the memory location
- However even if the rate of writes is less than the rate of which memory can accept them processor stalling can still occur
- This can occur if the writes come in bursts in which the aforementioned situation will be inevitable and introduce large amounts of delay
- The better Write-Back Scheme
- When a write occurs it is only written to the highest level cache
- The only time in which a memory write to a lower level memory needs to occur is if the entry in the cache needs to be replaced
- Write buffers can also be used in a write-back scheme
- An Example Cache: The Intensity Fast Math Processor
- Summary
5.4 Measuring and Improving Cache Performance
- Calculating Stall Cycles
- CPU time = (CPU execution clock cycles + Memory-stall clock cycles) * Clock cycle time
- These are the two most significant time eaters, we also make the assumption that cache accesses are part of normal CPU execution
- Memory-stall Cycles
- Total Memory-stall clock cycles = (Read stall cycles + Write stall cycles)
- Write-Through Stall Cycles
- There are two major contributors to stalls
- Write misses
- Buffer stalls
- Write-buffer stalls depend on both proximity of writes and frequency and is hard to evaluate as a function
- Conditions for ignoring write-buffer stalls
- Reasonable write buffer depth (4 or more words)
- Memory capable of accepting writes at least twice the average write frequency of programs
- Write-Back Stall Cycles
- Potential stalls occur during the replacement of a cache entry
- Under the assumption of negligible write-buffer stalls
- Incorporating Hit Time Into the Equation
- So far we have been ignoring the time that a cache hit takes
- In reality this value is nonzero and can be nontrivial due to certain factors
- One thing that can increase the hit time is increasing the cache size
- It is possible for the hit time to increase to several cycles and even dominate the AMAT term if sealed with incorrectly
- Therefore we define a new average memory access time (AMAT)
- Fully Associative Cache
- Any block can be placed in any location in the cache
- There is no method to indexing into the fully associative cache and thus all the tags must be searched and matched with all entries in the cache in order to check if there is a cache hit
- Set Associative Cache
- In the middle of fully associative and direct map
- An n-way associative cache has a number of sets each with n blocks
- Each block in memory then maps to a unique set and placed in any of the elements in that set
- This results in the calculation of the set value being
- This is contrasting with the direct map where it is modulo # blocks in cache
- The advantage of increasing associativity is that it usually decreases the miss rate
- Refer to page 405 for a good example of how misses and associativity in caches work
- Locating a Block in the Cache For Set Associative Caches
- Let an address be decomposed as follows
- Tag: Used to check every cache block within the appropriate set to see if it matches the block address from the processor
- Index: Select the correct set
- Block offset: Access data within block
- Increasing set associativity by a factor of 2 increases the number of blocks per set by 2 and halves the number of sets
- Therefore the index decreases by one bit and increases the tag size by 1 bit
- An example of a 4 way associative cache
- Content Addressable Memory (CAM): Combines comparison and storage into a single device
- Instead of passing the address pass the data directly and perform a comparison using that
- Used for high associativity caches
- Choosing Which Block to Replace
- LRU is the most commonly used replacement policy it removes the oldest entry from the cache and introduces the new one into the previous spot
- As associativity increases this algorithm becomes harder to implement
- Reducing the Miss Penalty Using Multilevel Caches
- The goal is to reduce miss penalty when one has a miss on the primary cache
- This is done by introducing a secondary cache which acts a secondary level before main memory
- Thus the primary cache focuses on hit times while the secondary cache focuses on cache misses
- The design of the primary and secondary cache will differ
- Primary: Smaller, lower associativity, and possibly smaller block size to decrease hit times
- Secondary: Larger, higher associativity, and possibly larger block sizes to decrease cache misses
- The miss penalty is now a function of the miss in the primary and secondary cache giving us the following
- Refer to page 410 for an example
- Software Optimization via Blocking
- The idea behind blocking is that if we can store all the memory we will be using in a specific loop completely into the cache then we do not need to worry about cache misses on every iteration on the block until the entire block must be changed
- This will reduce the amount of memory access time
- An example of this is matrix multiplication by blocking aka work on sub matrices one at a time instead of performing the complete matrix multiplication
- In the case of C=AB where A and B are 8×8 matrices and block_size=2
- A and C are accessed in block_size slivers and B is accessed in block_size x block_size blocks
- The products of A and B will accumulate into a sliver of C
- We will iterate through all the rows/columns of A and C before we throw out B and reload the L1 cache
- C11=A11B11+A12B21
- C12=A11B12+A12B22
- C21=A21B11+A22B21
- C22=A21B12+A22B22
- A: Enjoys spatial locality as we are loading the next index of the sliver with the multiplication. A also enjoys temporal complexity because we use the same row block_size times
- C: Enjoys spatial locality for the same reason as A
- B: Enjoys temporal locality the same block_size x block_size block is accessed block_size^2 in succession
- In the case of blocked matrix multiplication if the matrices we are operating on are small enough to fit in the cache the normal method will be faster, once the matrices are big enough the blocked version will be faster
- Blocksize can also affect performance if the size of the block can directly fit in registers minimizing the loads and stores in the program