Lecture 9

project will be published soon and due end of next week

will be small and will be asked to do similar things to in class
however, please try to type stuff ourselves - don't just copy, understand how class code works and create code from it
- most techniques will be the same, but don't copy/paste
will be asked to write a function and do some measurements
- will take an array of numbers
- function replaces every number with the next prime number greater than or equal to it
- example: [4, 3, 8, 5, 20] should go to [5, 3, 11, 5, 23]
- may start thinking about smart approaches, but advice - don't be smart
  - e.g., don't try to memoize prime numbers - want to have a lot of computational work, compute them every time from the ground up
  - may write function isPrime (don't use C++'s built-in function)
  - the idea is to do parallelization similar to what we've done before, where different portions of the array are done by different threads
    - difference is to try to have workload be more intense, so we can see the difference that we weren't able to see in class because the example work was too simple
  - if benefits do not show performance benefit, not end of the world - not necessarily looking for specific performance benefit
    - main purpose is to practice tools and do some measurements
reminder: when doing parallel programming, try to avoid race conditions
- the vector that we use is guaranteed to be shared between multiple threads
- may ask self: can we modify vector simultaneously from multiple threads?
  - vector is shared among multiple threads
  - however, each piece of the vector is handled by different threads
  - since different portions are being modified, it's ok
  - kind of goes against intuition at surface level but we never deal simultaneously with the same piece of the shared vector
  - may be asked to (inefficiently) synchronize access to vector across threads just to gain experience with it

cache

last time, discussed memory and memory speed

slow in terms of both latency and throughput
- latency: takes many more cycles to get data from memory than from registers
- throughput: memory bus may be limited in amount of data that can flow through at a time

to limit latency, can use cache:

device that increases the average speed of cpu/ram interaction
small and fast memory unit that stores recently used pieces of data
idea:
- cpu wants to get a piece of data, sends request to ram memory
  - (this is what CPU imagines)
- in reality, cpu is connected, to cache, and cache is connected to ram
- when cpu asks for data from that address, cache checks whether it has it
  - if has it, it is a cache hit - no ram memory reading, immediately returns to cpu the data from the required address (fast)
  - if it does not, it is a cache miss - has to read memory, finds piece of data at the address, and copies the piece of data into the cache, and delivers to the cpu (slow)
- programs:
  - tend to deal with the same areas of ram memory and same values for some time
  - often performs multiple actions with same piece of data
  - often move to next location in memory
  - if something like this happens, value is typically already in cache: "cache locality"/"spatial locality"
- different types of cache misses
  - code miss, conflict miss, etc.
  - however, meaning is the same - something we want is not in cache for one reason or another
- when we read data from ram, typically we actually read a larger block at a time
  - something like 64 bytes, called a cache line or cache block
  - ask for memory at some address, but cache actually receives neighbors as well
    - oses/programs may try to align on cache line boundaries
- sample numbers for cache hits vs. misses:
  - full cache miss: 200 cycles
  - cache hit in L1 cache: 3 cycles
- example: on computer, create 2d array, 1000x1000 elements, then sum all elements
  - will end up with nested loop
  - can do things in a natural way, summing elements row by row
  - or could do things in an unconventional way, column by column
  - amount of operations will be the same, but unnatural way will be slower due to cache misses, e.g. 5x difference

caches can be multilevel:

instead of one cache device, have multiple of them
torn between two extremes:
- want larger caches - larger cache means higher hit rate - will need to throw things out less frequently
- however, larger caches are slower - searching takes longer
so, have a large (slow) cache, make a smaller cache (that will be fast), and then have cpu
- cpu interacts with smaller/faster cache. if it's a miss in the small cache, consult larger/slower cache
- sometimes, may even have 3-level cache, where the smallest level can be split into two caches - one for data and one for instructions
  - instruction cache is connected to instruction fetching unit, and data cache is connected to memory operations component of cpu that orders loads and stores
- levels of cache: L3 (slowest), L2 (faster), L1 (fastest)

how does this differ for multi-core cpus?

may have single L3, L2 caches for all cores, but each cpu core may have its own L1 cache
cores may even each have their own L2 cache, depends on the cpu - but typically fastest cache is per-core, and slowest cache is shared among all cores

where in cache is memory placed, when it is retrieved by cache from a specific address in ram?

intuitive way: anywhere that we have empty space
- name of approach: "fully associative"
- however, when this happens, cache should be ready to find address anywhere
  - this makes finding/searching more complicated, so circuitry becomes more complicated, so cache is slower
- this approach may be acceptable if cache is larger, as long as average ram memory access is as fast as possible
- ideally, would be big and fast - but that's not the case
  - if we make bigger cache fully associative, becomes slower
solution: for every address for each cache line, will only be so many places that it can be found - n-way associative cache
- for "8-way associative" cache, there are only 8 positions where specific line can be found (cache line starting from specific address)
- cache may be big, but only 8 places where specific line can be found
- as soon as we introduce such a limitation, our search circuitry is much smaller and faster
- however, can result in imperfect cache utilization
  - if reading data multiple times, typically not common that will fall into the same "bank" of (e.g.) 8 cache lines, but if this does happen then previously read items in the bank may be evicted despite much of the cache space not being used
  - often does not happen naturally - only really seen in purposefully adversarial case
  - bank of cache line is typically taken from some bits of the memory address

certain data structures are more cache-friendly:

arrays are the most efficient, since they are literally contiguous blocks of memory
linked lists are typically worse:
- std::list, the doubly-linked list in C++, is less efficient
- std::deque is not as inefficient because at least some contiguous memory blocks, even if they are chained together in a linked-list manner

caches can also be divided into two different categories:

two approaches to constructing caches
- deal with cache behavior when we write to cache memory
- example: take something from ram memory into CPU, and increment it, and issue store instruction
- we know that cache receives instruction first
- cache has two options
first category: write-through
- as soon as cache receives write (store) instruction, cache immediately does it, sending instruction further to ram to modify this piece - good and simple
- however, if we are (e.g.) incrementing more elements sequentially, will need to write to ram continuously every time the write instruction is issued
  - program logic issues separate loading instructions for each consecutive element in the array, and separate storing instructions for each one of them
  - this is not efficient in terms of ram memory interactions
second category: write-back
- recognize that it is more efficient to delay writes to ram, and transfer changes in bulk - amortize write operations over time
- accumulate changes to cache line, and do writing whenever it is convenient
- cache remembers which line was updated
  - if we look at hardware, will see that cache consists of entries - each entry has:
    - data
    - address (maybe only portion of address that defines placement)
    - utility information
    - "dirty" bit - single bit that designates whether cache entry has been changed

note: for atomic operations, nobody will use cache or anything, because whole memory address bus is locked for everyone except the code that issued the atomic operation

(to be continued next time)