Category: Computer Organization

Ch 5.1-5.4 (HP1)

5.1 Introduction

  • We aim to introduce a way for to creating the illusion to have large amounts of fast memory using a very small memory
  • The idea is based on the idea that code or data is not access all its code or data at once with equal probability
  • This is known as the principle of locality
    • Programs access a relatively small portion of their address space at any instant of time
    • Temporal Locality: Locality in time
      • Something referenced will probably be referenced soon again
    • Spatial Locality: Locality in space
      • If something is accessed there is a high chance things around it will be soon accessed as well
    • Loops show high amounts of temporal locality, the variables used in the loop are going to show up again in the next iteration
    • Array show high spatial locality as arrays are stored sequentially in memory
  • Locality is taken advantage by implementing a memory hierarchy
    • Screen Shot 2017 09 22 at 11 03 56 AM
    • Faster memory is stored closer to the processor
    • Upper memory is faster lower memory is slower
    • Data is also hierarchical a level closer to the processor is generally a subset of any level further away
  • Data is only copied between 2 adjacent levels regardless of levels
  • The minimum unit of information that can exist in a piece of memory is known as a block or a line
  • Looking at two levels of memory a hit is when data is found in the upper memory, conversely a miss is when upper memory does not contain the data
    • On a miss lower memory is accessed to retrieve the data
    • Hit rate: % of memory accesses found in upper memory
    • Miss rate: 1-hit rate
    • Hit time: Time required to access upper level of the memory hierarchy (time required to determine hit or miss)
    • Miss Penalty: The amount of time to replace a block in the upper level with the corresponding block from the lower level
      • Most of the miss penalty comes from accessing lower memory

5.2 Memory Technologies

  • Four primary technologies for memory
    • DRAM (Dynamic Random Access Memory): Main Memory
    • SRAM (Static Random Access Memory): For caches
      • $$ SRAM >> DRAM $$ because DRAM takes less area per bit of memory and larger memory for same amount of material
    • Flash Memory: Non volatile memory
    • Magnetic Disk: ………
  • SRAM Technology
    • ICs that are memory arcades with (usually) a single access port that can either provide read or write
    • Access time is very close to cycle time
      • No refresh time (uses transistors not capacitors)
    • SRAM directly exists on the processor
    • 6-8 transistors per bit
  • DRAM Technology
    • Built with capacitors and thus need to be refreshed (keep charge on capacitor).  Accessed with 1 transistor per bit
    • Refresh is performed by reading contents and writing it back directly
      • Charge only lasts several milliseconds
      • Refresh is performed by row to save time
    • DRAM buffers rows (like SRAM) to improve access times
    • S(ynchronous)DRAM contains a clock to eliminate the time a memory and processor require to synchronize and removing the requirement to use additional address bits to address all the bits in a row sequentially
      • The clock transfers the successive bits in a burst
    • DDR SDRAM is fastest DDR stands for double data rate
      • Transfers double as much bandwidth as one would expect for clock/data width
      • Data transfer happens on both rise and fall of the clock
      • Can read/write from multiple banks at once, using one address
      • Known as address interleaving
    • Screen Shot 2017 09 22 at 11 38 51 AM
  • Flash Memory
  • Disk Memory

5.3 The Basics of Caches

  • Cache: Extra level of hierarchy between processor and main memory
  • If a value is not in cache when the CPU calls for it it is retrieved from memory and then stored in the cache
  • The cache is always smaller than main memory therefore we need a system to map larger memory into smaller memory
  • Direct Mapped Cache: The simplest type of cache
    • Each memory location is mapped directly to exactly one location in the cache
    • Almost all direct mapped caches use the following mapping to find a block
      • Block Idx = (Block address) modulo (Number of blocks in the cache)
      • If modulo is a power of two modulo can be computed using a low order log 
  • Example using an 8 block cache
    • 3 lowest bits are used as the modulo

Screen Shot 2017 09 23 at 12 30 33 AM

  • But how do we solve the problem of whether we know a requested word is in the cache or not?
    • This is done by adding tags to the cache
    • Tags contain address information required to identify whether a word in cache corresponds to the requested word
    • Tags are the upper part of the address that are used as an index into the cache
    • In the figure above, we only need 2 of the upper bits in the 5 bit memory address to use as a tag as the bottom three were used to index into the cache
    • Note: I believe the tags should always be chosen last after determining the modulo
  • Marking cache blocks to whether they have valid information
    • Eg. Upon startup/after some instructions not all cache entries may be filled but still contain information which may be interpreted as data
    • Thus we add a valid bit to mark if the cache entry contains a valid address
  • Accessing a Cache
    • Below is an example of an size 8 cache
    • Screen Shot 2017 09 23 at 1 16 47 AMScreen Shot 2017 09 23 at 1 16 55 AMScreen Shot 2017 09 23 at 1 17 04 AMScreen Shot 2017 09 23 at 1 17 11 AM
    • Just as a reminder, low order bits -> entry in cache and the high bits -> tag used to identify if the cached data is correct
    • In MIPS the addresses are aligned to multiples of four bytes thus the least significant two bits of every address specify a byte within a word
    • For a 32 bit address, direct mapped cache, 2^n blocks in the cache and block size is 2^m words
      • We use n bits for the index, m bits for the word within the block and m+2 bits for each byte in the block
      • This gives us the size of the tag field as the following size
        • 32-(n+m+2)
      • The total number of bits in a direct mapped cache is
        • 2^n * (block size + tag size + valid field size)
      • Accounting for the valid bit
        • 2^n * (2^m * 32 + (32-n-m-2) +1) = total number of bits
        • Note the 32 accounts for the fact that there are 32 bits in a word
      • Screen Shot 2017 09 23 at 3 02 31 AM
      • The convention is to exclude the size of the tag and valid field in the size of the cache
        • This gives the example picture to be 1024 * 32 / 8 = 4096 byte cache
    • Increasing the block size will decrease hit rate to a certain extent
      • This exploits spatial locality as you are loading data near the specified address
      • Once a critical point is reached the number of entries in a cache becomes too small and miss rate will increase again
    • Cache Miss Penalty
      1. Latency two first word
      2. Transfer time for the rest of the block
  • Handling Cache Misses
    • Control unit needs to process the Missa and then fetch the data from lower level memory
    • Interrupts differ from cache misses because the interrupt only needs to save the register values when performing a context switch
    • In the case of a cache miss the entire processor must be halted in order to perform a read of lower memory and then store.
      • Only afterwards is the processor allowed to continue execution
    • Instruction and data cache misses can almost be handled identically
      1. Send the origin PC value (current PC-4 for MIPS -2 for LC3) to the memory
        • The reason we need to do this is because the fetch1 state in a processor will automatically increment the PC and in order to perform step 4 where we resume the processor execution we need to continue as if the fetch1 state of the cache missed instruction never happened
      2. Instruct main memory to perform a read and wait for the memory to complete its access
      3. Writ the cache entry, putting the data from memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and turning the valid bit on
      4. Restart the instruction execution at the first step, which will refetch the instruction, the time finding it in the cache
  • Handling Writes
    • The slow Write-Through Scheme
      • On a store instruction, the value inside the cache is updated and then immediately written back to main memory using the full address as well
      • This is slow because we do this 100% of the time
        • If there is a program with 10% stores, if CPI on a cache hit is 1.0,  and writing to memory takes 100 cycles the average required amount amount of cycles to complete a store instruction is
        • 1.0+10%*100=11 cycles
      • One solution is a write buffer
        • The buffer stores the data needed to be written into memory (both cache and buffer now have the correct data)
        • Then later if an instruction writes to the same location in main memory the buffer is cleared by writing the previous value into main memory and both cache and buffer are updated to reflect the current state of the memory location
      • However even if the rate of writes is less than the rate of which memory can accept them processor stalling can still occur
        • This can occur if the writes come in bursts in which the aforementioned situation will be inevitable and introduce large amounts of delay
    • The better Write-Back Scheme
      • When a write occurs it is only written to the highest level cache
      • The only time in which a memory write to a lower level memory needs to occur is if the entry in the cache needs to be replaced
      • Write buffers can also be used in a write-back scheme
  • An Example Cache: The Intensity Fast Math Processor
  • Summary

5.4 Measuring and Improving Cache Performance

  • Calculating Stall Cycles
    • CPU time = (CPU execution clock cycles + Memory-stall clock cycles) * Clock cycle time
      • These are the two most significant time eaters, we also make the assumption that cache accesses are part of normal CPU execution
    • Memory-stall Cycles
      • Total Memory-stall clock cycles = (Read stall cycles + Write stall cycles)
      • NewImage
      • Write-Through Stall Cycles
        • There are two major contributors to stalls
          1. Write misses
          2. Buffer stalls
        • NewImage
        • Write-buffer stalls depend on both proximity of writes and frequency and is hard to evaluate as a function
        • Conditions for ignoring write-buffer stalls
          • Reasonable write buffer depth (4 or more words)
          • Memory capable of accepting writes at least twice the average write frequency of programs
      • Write-Back Stall Cycles
        • Potential stalls occur during the replacement of a cache entry
        • Under the assumption of negligible write-buffer stalls
        • NewImage
    • Incorporating Hit Time Into the Equation
      • So far we have been ignoring the time that a cache hit takes
      • In reality this value is nonzero and can be nontrivial due to certain factors
        • One thing that can increase the hit time is increasing the cache size
        • It is possible for the hit time to increase to several cycles and even dominate the AMAT term if sealed with incorrectly
      • Therefore we define a new average memory access time (AMAT)
        • NewImage
    • Fully Associative Cache
      • Any block can be placed in any location in the cache
      • There is no method to indexing into the fully associative cache and thus all the tags must be searched and matched with all entries in the cache in order to check if there is a cache hit
    • Set Associative Cache
      • In the middle of fully associative and direct map
      • An n-way associative cache has a number of sets each with n blocks
      • Each block in memory then maps to a unique set and placed in any of the elements in that set
      • This results in the calculation of the set value being
        • NewImage
      • This is contrasting with the direct map where it is modulo # blocks in cache
      • Screen Shot 2017 09 23 at 5 20 58 AM
      • Screen Shot 2017 09 23 at 5 25 54 AM
      • The advantage of increasing associativity is that it usually decreases the miss rate
      • Refer to page 405 for a good example of how misses and associativity in caches work
  • Locating a Block in the Cache For Set Associative Caches
    • Let an address be decomposed as follows
      • NewImage
      • Tag: Used to check every cache block within the appropriate set to see if it matches the block address from the processor
      • Index: Select the correct set
      • Block offset: Access data within block
    • Increasing set associativity by a factor of 2 increases the number of blocks per set by 2 and halves the number of sets
    • Therefore the index decreases by one bit and increases the tag size by 1 bit
    • An example of a 4 way associative cache
    • Screen Shot 2017 09 23 at 8 27 58 AM
    • Content Addressable Memory (CAM): Combines comparison and storage into a single device
      • Instead of passing the address pass the data directly and perform a comparison using that
      • Used for high associativity caches
  • Choosing Which Block to Replace
    • LRU is the most commonly used replacement policy it removes the oldest entry from the cache and introduces the new one into the previous spot
    • As associativity increases this algorithm becomes harder to implement
  • Reducing the Miss Penalty Using Multilevel Caches
    • The goal is to reduce miss penalty when one has a miss on the primary cache
    • This is done by introducing a secondary cache which acts a secondary level before main memory
    • Thus the primary cache focuses on hit times while the secondary cache focuses on cache misses
      • The design of the primary and secondary cache will differ
        • Primary: Smaller, lower associativity, and possibly smaller block size to decrease hit times
        • Secondary: Larger, higher associativity, and possibly larger block sizes to decrease cache misses
    • The miss penalty is now a function of the miss in the primary and secondary cache giving us the following
      • NewImage
      • Refer to page 410 for an example
  • Software Optimization via Blocking
    • The idea behind blocking is that if we can store all the memory we will be using in a specific loop completely into the cache then we do not need to worry about cache misses on every iteration on the block until the entire block must be changed
    • This will reduce the amount of memory access time
    • An example of this is matrix multiplication by blocking aka work on sub matrices one at a time instead of performing the complete matrix multiplication
    • In the case of C=AB where A and B are 8×8 matrices and block_size=2
      • A and C are accessed in block_size slivers and B is accessed in block_size x block_size blocks
      • The products of A and B will accumulate into a sliver of C
      • We will iterate through all the rows/columns of A and C before we throw out B and reload the L1 cache
      • C11=A11B11+A12B21
      • C12=A11B12+A12B22
      • C21=A21B11+A22B21
      • C22=A21B12+A22B22
      • A: Enjoys spatial locality as we are loading the next index of the sliver with the multiplication.  A also enjoys temporal complexity because we use the same row block_size times
      • C: Enjoys spatial locality for the same reason as A
      • B: Enjoys temporal locality the same block_size x block_size block is accessed block_size^2 in succession
      • In the case of blocked matrix multiplication if the matrices we are operating on are small enough to fit in the cache the normal method will be faster, once the matrices are big enough the blocked version will be faster
      • Screen Shot 2017 09 23 at 9 21 48 AM
    • Blocksize can also affect performance if the size of the block can directly fit in registers minimizing the loads and stores in the program

Ch 1.8 (HP2)

1.8 Measuring, Reporting, and Summarizing Performance

  • Wall-clock time, response time, elapsed time: Latency to complete a task including disk accesses, memory accesses, I/O, OS overhead and more
  • CPU Time: Time the processor is computing
    • The time a processor is computing is the time the CPU is used for all various programs running and not just one
    • Therefore response time seen by the user is not CPU time where response time only accounts for one specific program
  • Benchmarks
    • Should evaluate true performance on real life usage rather than synthetic benchmarks
      • Do not use kernels, toy programs, synthetic benchmarks
    • Often benchmark suites are used so that things like different implementations will not take into account the things that one can optimize to gain performance
  • Reporting Performance Results
    • Guiding principle for performance measurements should be reproducibility
  • Summarizing Performance Results

Ch 1.6-1.7 (HP1)

1.6 Performance

  • Defining Performance
    • Two different measures of performance
      • Execution/Response Time: Time between the start and completion of a task
        • NewImage
      • Throughput/Bandwidth: Total amount of work done in a given time
  • Measuring Performance
    • Program execution time is measured in seconds per program
    • CPU Execution Time: Time the CPU spends computing for a specific task and does not include time spent waiting for I/O or running other programs as a CPU may work on more than one task at once
      • User CPU Time: Time spent in the program
      • System CPU Time: Amount of time running system calls
      • Book uses system performance as CPU Time and performance to elapsed time on an unloaded system
    • Clock Cycle/Period: Discrete time intervals a CPU works with
      • NewImage
  • CPU Performance and Its Factors
    • NewImage
    • Improve performance via decreasing # clock cycles
    • Improve performance via decreasing clock cycle time
  • Instruction Performance
    • Another metric measures the performance in the amount of clock cycles required to run the program
    • NewImage
    • Average Clock Cycles per Instruction: CPI
  • The Classic CPU Performance Equation
    • NewImage

 

    • Shows 3 different metrics for CPU time in terms of instruction count
    • CPI can be obtained via software or hardware counters
    • Instruction count depends on the architecture and not the implementation
    • CPI depends on many things including memory design and processor structure and types of instructions that dominate execution
    • Some CPUs vary the clock rate to save power
    • Some instructions take < 1 CPI because things can be parallelized
      • Screen Shot 2017 09 22 at 5 52 24 AM
1.7 The Power Wall
  • Recently we have reached the practical power limit for cooling commodity microprocessors
  • Therefore the clock rate has been increasing slowly and companies are focusing more on the decrease of power consumption
  • Energy usage is better measured joules than watts (joules/second)
  • In CMOS transistors, primary energy consumption is dynamic energy
    • Dynamic Energy: Energy used from switching between 0-1 
    • Dynamic energy depends on the capacitive loading of each transistor and voltage applied
    • Energy is proportional to Capacitive Load * Voltage^2
      • This equation is the energy of a pulse: 0->1->0 or 1->0->1
    • The energy of a single transition is
      • 1/2 Capacitive Load * Voltage^2
    • Power Single Transition=CV^2/2 * Frequency Switched
      • Frequency switched is a function of clock rate
      • Capacitive load per transistor is a function of both # transistors connected to output (fanout) and the technology used
    • The reason why processor speeds used to grow much faster than power consumption is because power is a function of voltage^2 which was reduced at about 15% per generation

Appendix A (HP2)

A.1 Introduction

A.2 Classifying Instruction Set Architectures

  • Type of internal storage in a processor
    • Stack: Operands are implicitly on top of the stack
      • Stack computer must compute all computations in one order
    • Accumulator: One operand is implicitly in the accumulator
    • Registers: Only explicit  operands either in memory or registers
      • Operations can be done in multiple orders eg. parantheses
        • (A*B)-(B*C)-(A*D) multiplications can be done in any order
      • Registers are general purpose, faster than memory and reduces memory traffic speeding up stuff
      • Code density imporves as registers cost less bits than memory addresses
  • Screen Shot 2017-09-12 at 3.26.13 AM.png
    • Register-memory: Can access memory as part of any instruction
    • Load-store: Can only access memory only with load/store instructions
  • Operations required for: C=A+B
    • Screen Shot 2017-09-12 at 3.26.42 AM.png
  • Two major instruction set characteristics for GPR architectures for ALU stuff
    • 2 vs 3 operands (source/dest, source or dest, source, source)
    • How many operands may be memory addresses
    • Figure A.3  and A.4 have common combinations of register/memory/stuff

A.3 Memory Addressing

  • Inter preting Memory Addresses
    • Little Endian vs Big Endian
      • Screen Shot 2017-09-12 at 3.37.28 AM

        Little Endian

      • Screen Shot 2017-09-12 at 3.37.31 AM

        Big Endian

    • Memory Alignment: This is done to optimize access speeds. If misaligned memory access may take multiple aligned memory references
      • Aligned if: dddr mod size = 0
      • Screen Shot 2017-09-12 at 3.41.22 AM.png
  • Addressing Modes: https://en.wikipedia.org/wiki/Addressing_mode
    • Addressing modes specify constants and registers in addition to locations in memory
    • Effective Address: When a memory location is used, the actual memory address specified by the addressing mode is called the effective effective address
    • Figure A.6 has a list of addressing modes and NAMES
    • PC-Relative Addressing: Addressing modes that depend on the program counter
      • Mostly used for specifying code addresses or control transfer instructions
    • Different modes can have significantly less instruction counts and as a result of complexity can be complex and increase average clock cycles per instruction (CPI)
  • Displacement Addressing Mode
    •    +------+-----+-----+----------------+
         | load | reg | base|     offset     |  reg := RAM[base + offset]
         +------+-----+-----+----------------+
      
         (Effective address = offset + contents of specified base register)
  • Immediate or Literal Addressing Mode
    •    +------+-----+-----+----------------+
         | add  | reg1| reg2|    constant    |    reg1 := reg2 + constant;
         +------+-----+-----+----------------+
  • Summary: Memory Addressing
    • New architectures should support the minimum of
      • Displacement, immediate, and register indirect addressing modes

A.4 Type and Size of Operands

A.5 Operations in the Instruction Set

  • Most commonly executed instructions are the simple operations of an instruciton set

A.6 Instrucitons for Control Flow

There are 4 types of control flow change.  Conditional branches, jump,s procedure calls, procedure returns.

  • Addressing Modes for Control Flow Instructions
    • Destination address of control flow instruction must always be specified
    • Most common way is to offset PC
      • Allows for position independence where it can run anywhere with non address dependent things
    • If target is not known at compile time need to specify target dynamically
      • One way is to name register that will contain address
        • This way of jumping allows for important features
          • Case or switch statements
          • Virtual functions or methods
          • High-order funcitons/function pointers where functions are passed as argugments
          • Dynamically shared libraries where libraries are loaded and linked
  • Conditional Branch Options
    • There are different ways to decide when to branch
      • Condition Code: Test special bits
        • Advantage: Sometimes condition is set for free
        • Disadvantage: CC is an extra state.  Can constrain ordering of instrucitons since they pass information from one isntruction to a branch
      • Condition Register: Test arbitrary register with result of comparison
        • Advantage: Simple
        • Disadavantage: Uses a register
      • Compare and Branch: Compare is part of branch, often compare is limited to subset
        • Advantage: One instruciton rather than two for a branch
        • May be too much work per instruction for pipelined execution
  • Procedure Invocation Options
    • Callee Saving: Called procedure must save registers it wants to use
    • Caller Saving: Calling procedure must save registers that it wants preserved
      • Must be used at times when global variables are accessed by both callee and caller

A.7 Encoding an Instruction Set

Number of registers and addressing modes impact the encoding of instructions considering the number of bits needed to be allocated to that in the instruciton.  One also needs to take into account how easy piplined implementations take to the size of the instruction

  • Reduced Code Size in RISCs
    • Due to less complex instructions the RISC processors have larger amounts of code
    • Since embedded is used in embedded this began to be a problem
    • Now some hybrids offer both 16 and 32 bit instructions (40% reduction)
    • IBM compresses standard instruction set and adds hardware to decompress instructions when fetched from memory on a cache miss

A.8 Crosscutting Issues: The Role of Compilers

  • The Structure of Recent Compilers
    • Screen Shot 2017-09-13 at 1.55.23 AM.png
    • Complexity of writing a correct compiler limits optimization
    • Multiple pass structure helps reduce complexity
      • But transform code in certain ways before others
      • This is known as the phase-ordering problem
    • One optimization is the global common subexpression elimination
      • If an assignment or result occurs multiple times save the result and reuse it later
      • It is better to save it in a regsiter because later loads may be slow
      • However limiting a register might be even slower
    • Styles of transformation
      • High Level Optimization: Done on source output is sent to later optimization passes
      • Local Optimizations: Optimize code with a straight-line code fragment (basic block compiler)
      • Global Optimizations: Extend local optimizations across branches aimed at optimizing loops
      • Register Allocation: Associates registers with operands
      • Processor Dependent Optimizations: Specific to the Architecture
  • Register Allocation
    • Important in speediung up code and making other optimizations useful it is thus one of the most important
    • For 16+ registers and additional FP registers use Graph coloring
      • NP Complete problem and takes exponential time
      • The idea is to create a graph where no two adjacent nodes (variables) have the same color (registers)
    • Because graph coloring is slow there are near linear heuristic algorithms for smaller number of registers
  • Impact of Optimizations on Performance
    • Screen Shot 2017-09-13 at 2.10.03 AM.pngN.M=Not Measured
  • The Impact of Compiler Technology on the Architect’s Decisions
    • How are variables allocated and addressed?
    • How many registers are needed to allocate variables appropiately?
    • Look at where high-level programs allocate data
      • Stack: For local variables, addressed relative to stack pointer, grown or shrunk on procedure call or return, used for activation record. Mostly scalars or single variables
      • Global data area: Allocate statically declared objects.  Usually arrays or other aggregate data structures.
      • Heap: Allocate dynamic objects that do not adhere to stack discipline.  Accessed with pointers and usually not scalars
    • Register allocation is much more effective for stack allocated objects than for global variables
      • And is essentially impossible for heap allocated objects because of pointer access
    • Aliased variables: Multiple ways to refer to the address of a variable
      • This causes them to be illegal to put in register
      • Most heap variables are aliesed
      •  p = &a-- gets address of a in p
           a = ...-- assigns to a directly
           *p = ...-- uses p to assign to a
           ...a...-- accesses a
    • *p would be illegal and generate incorrect code
    • Aliasing makes it difficult  to decide what a pointer points to and some compilers will not allocate local variables to a register if pointers are present
  • How the Architect Can Help the Comiler Writer
    • Complexity arises in large complex programs and is hard to compile for because compilers make decisions one step at a time about which code sequence is best
    • The best advice is “Make the frequent cases fast and rare cases correct”
    • Some guidelines on designing an architecture
      • Provide Regularity: Operations, data types, and addressing modes should be orthogonal
        • Two aspects of an architecture are orthogonal if they are independent
        • Eg. Every operation can be used with any addressing mode
        • The opposite would be certain registers can be used only for certain instructions
      • Provide Primitives, not solutions: Don’t specialize your architecture for a specific high-levle language.  This solution may not work for other languages
      • Simplify trade-offsd among alternatives: Help compiler writers decide between alternative solutions.
        • Eg. In register-memory architectures and the problem of deciding how many times how many times a variable must be referenced before moved to a register
      • Provide instructions that bind the quantities known at compile time as constants

A.9 Putting It All Together: The MIPS Architecture

GOOD FOR PRACTICE READING SKIP FOR NOW

A.10 Fallacies and Pitfalls

  • Erroneous beliefs
  • Pitfalls
    • Designing a “high-level” instruction set feature specifically oriented to supporting a high-level language structure
      • These instructions often do more work than is required and may not match the requirements of some languages
    • Innovating at the instruction set architecture to reduce code size without accounting for the compiler
      • Often compilers can do a much better job at compressing code than ISA innovations can
      • If any optimizations are to be made consider optimization on the output of compilers
  • Fallacies
    • There is such a thing as a typical program
      • No program is alike and do not act like synthetic benchmarks
      • Optimizing or providing instructions for the “typical program” may cause the instruction to be underused
    • An architecture with flaws cannot be successful
      • Appendix K
      • x86… nuff said
    • You can design a flawless architecture
      • Tradeoffs exist and are very real

Ch 2.16-2.18 (HP1)

2.16 Real Stuff: ARMv7 (32-bit) Instructions

ARM is a RISC instruction set and similar to MIPS. The principle difference between MIPS and ARM is that MIPS has more registers.

  • Addressing Modes
    • Unlike MIPS, ARM does not reserve a register to contain 0
    • 9 addressing modes
      • Screen Shot 2017-09-12 at 2.22.39 AM.png
    • Refer to figures 2.32 for a list of possible instructions
  • Compare and Conditional Branch
    • MIPS uses contents of registers to evaluate conditional branches
    • ARM uses 4 condition codes
      • negative, zero, carry, overflow
      • Set on any arithmetic or logical instruction
      • Setting CC is optional
    • CMP subtracts one operand from the other and sets condition codes based on the result
    • CMN (Compare negative) adds one operand to the other and sets condition codes based on the result
    • TST performs logical AND on two operands to set all condition codes BUT overflow
    • TEQ uses XOR to setall condition codes BUT overflow
    • Refer to figure 2.34 for instruction formats of ARM and MIPS
  • Unique Features of ARM
    • As mentioned above ARM does not have dedicated zero reg therefore it has separate opcodes for the same MIPS functions
      $zero
    • ARM has support for multiword arithmetic
    • ARM has 12 bit immediate field
      • 8 LSB are ZEXTed to 32 bit then rotated right the number of bits specified in the first four bits of the field multiplied by two
      • Allows representation of all powers of two in a 32 bit word
    • Operand shifting is not limited to immediates

2.17 Real Stuff: x86 Instructions

Unlike RISC this is a CISC (Complex) instruction set and designed to reduce the number of instructions executed by a program.  This can cause instructions to be slower.  This may be due to slower clocks or requiring more clock cycles for simpler tasks.

  • Evolution of the Intel x86
  • x86 Registers and Data Addressing Modes
    • Instructions must have one operand to serve as source and destination
    • Unlike MIPS/ARM one of the operands can be in memory
    • Screen Shot 2017-09-12 at 2.46.55 AM.png
    • Not all general purpose registers can be used for everything for example ESP and EBP have special purposes
  • x86 Integer Operations
    • Byte: 8 bits, Word: 16 bits, Double Word: 32 bits, Quad Word: 64 bits
    • 4 Major classes of instructions
      • Data movement instructions (mov, push, pop)
      • Arithmetic and logic (test, integer, decimal ops)
      • Control flow (conditional branches, unconditional jumps, calls returns)
        • x86 uses condition codes or flags like ARMv7
        • Condition codes are set as a side effect of an operation
      • String instructions including string move and string compare
        • Not used most of the time often slower than software equivalent
  • x86 Instruction Encoding
    • Complex encoding of instructions with many different instruction formats
      • Length vary from 1 byte w/o operands and up to 15 bytes
    • Opcode byte usually contains a bit saying whether operand is 8 or 32 bits
      • Sometimes opcode may include addressing mode and register (often register=register op immediate)
    • A postbyte or other opcode byte may contain addressing information
      • Labeled as: mod, reg, r/m
      • Used for many instructions that address memory
        • Base plus scaled index mode
    • Screen Shot 2017-09-12 at 3.03.35 AM.png
  • x86 Conclusion
    • x86 makes up for complexity and hard to build stuff for by large market share
    • Also the most frequently used x86 architectural components are not too difficult to implement
    • AMD and Intel have been rapidly improving performance of integer programs

2.18 Real Stuff: ARMv8 (64-bit Instructions)

  • 64 bit upgrade in x86 was essentially just “cosmetic changes” to make registers 64 bits wide
  • ARMv8 did a complete overhaul (similar to MIPS)
  • See textbook for a full list of stuff