Lecture 8

(continuing with multithreading program from last time - finding max of large array)

#include <thread>
#include <iostream>
#include <vector>
#include <limits>
#include <chrono>
#include <algorithm>

constexpr int VECTOR_SIZE = 10'000'000;
constexpr int NUM_THREADS = 4;

void FindMax(const std::vector<int>& nums, int from, int upto, int& result) {
    int best = std::numeric_limits<int>::min();
    for (int i = from; i < upto; i++) {
        if (nums[i] > best) {
            best = nums[i];
        }
    }
    result = best;
}

void main() {
    std::vector<int> nums(VECTOR_SIZE);

    // pretend this is unsorted
    for (int i = 0; i < VECTOR_SIZE; i++)
        nums[i] = i + 1;

    // linear scan
    std::chrono::steady_clock::time_point start, finish;

    int result = 0;
    start = std::chrono::steady_clock::now();
    FindMax(nums, 0, VECTOR_SIZE, result);
    finish = std::chrono::steady_clock::now();
    
    std::cout << result << std::endl;

    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(
        finish - start
    ).count() << std::endl;

    std::vector<std::thread> threads;

    int threadPortion = VECTOR_SIZE / NUM_THREADS;
    std::vector<int> results(NUM_THREADS, 0);

    start = std::chrono::steady_clock::now();
    for (int i = 0; i < NUM_THREADS; i++) {
        // there is a problem here - fourth parameter
        // is integer reference variable - results[i]
        // thread library does not automatically convert
        // variables to references, because it is bug-prone
        // so, we need to explicitly treat the variable
        // as a reference ourselves
        threads.emplace_back(
            FindMax, nums, threadPortion * i,
            threadPortion * (i + 1), std::ref(results[i])
        );
    }

    // need to wait for other threads to finish their work
    for (int i = 0; i < NUM_THREADS; i++) {
        // block until threads[i] is finished
        threads[i].join();
    }

    // if we just do this immediately, threads may not be done yet,
    // so we had to wait above using join function
    result = *std::max_element(results.begin(), results.end());
    finish = std::chrono::steady_clock::now();

    std::cout << result << std::endl;

    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(
        finish - start
    ).count() << std::endl;

    return 0;
}

reminder - threads can be moved, but not copied.

thread objects represent a single thread
so, copying would mean creating a new thread - doesn't make sense, should just create thread explicitly
moving means stealing the thread from the previous object

notes on above program:

using wall time, not cpu time, so may be inaccurate
there are other tools for measuring performance/runtime - visual studio has some build in (not exam material) - but in general can be complicated
takeaway: measuring time is hard. usually will just measure elapsed time (wall clock time), but keep in mind that it can't fully represent cpu time and performance of program - may use alternative tools instead

on our computers, this program is actually slower with multithreading:

reason is that thread creation is expensive
note: should remember trick where we create breakpoint at thread creation and measure time to create a thread
- will be useful when measuring difference between multithreaded and singlethreaded solution
- on project - may be asked to measure amount of time to create a thread on the computer
amdahl's law does not take this additional time to create threads into account

other thing that may affect program performance: debug vs. release build

debug builds are much slower because many checks are introduced to assist with debugging - larger program and much slower
release build is for when development is finished - build for speed and efficiency - fast and small
profiling should be done in release build, not debug build
- no sense introducing multiple threads unless you ask your compiler to do things fast

note - for home project, don't just reproduce code from class 1:1 - but can use as a reference

consideration: what if we replace the best variable definition with just directly modifying result?

// create `best` var
int best = std::numeric_limits<int>::min();
// use `result` inplace
result = std::numeric_limits<int>::min();

this ends up being slower
reason: worse cache efficiency
- ram slower than cpu registers due to reading & writing - latency (e.g., memory stall)
- ram is also slow in the sense of throughput - memory bus can be filled
  - in a regular sequential program, this throughput issue is not typically encountered
  - example - nvidia gpu - each streaming multiprocessor can do around 7k operations with 32-bit floating point numbers per clock cycle
  - so, each can do around 39 trillion float point ops per second - 39 terabytes per second
  - however, we only have way less bandwidth - around 2000gbs or 2 terabytes. channel bandwidth is not even remotely sufficient
  - for most operations, are limited by memory throughput
caches make the latency problem less critical