I have a simple function which computes the sum of "n" numbers.
I am attempting to use threads to implement the sum in parallel. The code is as follows,
void Add(double &sum, const int startIndex, const int endIndex)
{
sum = 0.0;
for (int i = startIndex; i < endIndex; i++)
{
sum = sum + 0.1;
}
}
int main()
{
int n = 100'000'000;
double sum1;
double sum2;
std::thread t1(Add, std::ref(sum1), 0, n / 2);
std::thread t2(Add, std::ref(sum2), n / 2, n);
t1.join();
t2.join();
std::cout << "sum: " << sum1 + sum2 << std::endl;
// double serialSum;
// Add(serialSum, 0, n);
// std::cout << "sum: " << serialSum << std::endl;
return 0;
}
However, the code runs much slower than the serial version. If I modify the function such that it does not take in the sum variable, then I obtain the desired speed-up (nearly 2x).
I read several resources online but all seem to suggest that variables must not be accessed by multiple threads. I do not understand why that would be the case for this example.
Could someone please clarify my mistake?.
The problem here is hardware.
You probably know that CPUs have caches to speed up operations. These caches are many times faster then memory but they work in units called cachelines. Probably 64 byte on your system. Your 2 doubles are each 8 byte large and near certainly will end up being in the same 64 byte region on the stack. And each core in a cpu generally have their own L1 cache while larger caches may be shared between cores.
Now when one thread accesses sum1 the core will load the relevant cacheline into the cache. When the second thread accesses sum2 the other core attempts to load the same cacheline into it's own cache. And the x86 architecture is so nice trying to help you it will ask the first cache to hand over the cacheline so both threads always see the same data.
So while you have 2 separate variables they are in the same cache line and on every access that cacheline bounces from one core to the other and back. Which is a rather slow operation. This is called false sharing.
So you need to put some separation between sum1 and sum2 to make this work fast. See std::hardware_destructive_interference_size for what distance you need to achieve.
Another, and probably way simpler, way is to modify the worker function to use local variables:
void Add(double &sum, const int startIndex, const int endIndex)
{
double t = 0.0;
for (int i = startIndex; i < endIndex; i++)
{
t = t + 0.1;
}
sum = t;
}
You still have false sharing and the two threads will fight over access to sum1 and sum2. But now it only happens once and becomes irrelevant.
Related
So I want to optimize the sum of a really big array and in order to do that I have wrote a multi-threaded code. The problem is that with this code I'm getting better timing results using only one thread instead of 2 or 3 or 4 threads...
Can someone explain me why this happens?
(Also I've only started coding in C++ this semester, until then I only knew C, so I'm sorry for possible dumb mistakes)
This is the thread code
*localSum = 0.0;
for (size_t i = 0; i < stop; i++)
*localSum += v[i];
Main process code
int numThreads = atoi(argv[1]);
int N = 100000000;
// create the input vector v and put some values in v
vector<double> v(N);
for (int i = 0; i < N; i++)
v[i] = i;
// this vector will contain the partial sum for each thread
vector<double> localSum(numThreads, 0);
// create threads. Each thread will compute part of the sum and store
// its result in localSum[threadID] (threadID = 0, 1, ... numThread-1)
startChrono();
vector<thread> myThreads(numThreads);
for (int i = 0; i < numThreads; i++){
int start = i * v.size() / numThreads;
myThreads[i] = thread(threadsum, i, numThreads, &v[start], &localSum[i],v.size()/numThreads);
}
for_each(myThreads.begin(), myThreads.end(), mem_fn(&thread::join));
// calculate global sum
double globalSum = 0.0;
for (int i = 0; i < numThreads; i++)
globalSum += localSum[i];
cout.precision(12);
cout << "Sum = " << globalSum << endl;
cout << "Runtime: " << stopChrono() << endl;
exit(EXIT_SUCCESS);
}
There are a few things:
1- The array just isn't big enough. Vectorized streaming add will be really hard to beat. You need a more complex function than add to really see results. Or a very large array.
2- Related, the overhead of all the thread creation and joining is going to swamp any performance gains from the threading. Adding is really fast, and you can easily saturate the CPU's functional units. for the thread to help it can't even be a hyperthread on the same core, it would need to be on a different core entirely (as the hyperthreads would both compete for the floating point units).
To test this, you can try to create all the treads before you start the timer and stop them all after you stop the timer (have them set a done flag instead of waiting on the join).
3- All your localsum variables are sharing the same cache line. Better would be to make the localsum variable on the stack and put the result into the array instead of adding directly into the array: https://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html
If for some reason, you need to keep the sum observable to others in that array, pad the localsum vector entries like this so they don't share the same cache line:
struct localsumentry {
double sum;
char pad[56];
};
I tried to write an answer to How to get 100% CPU usage from a C program question by thread class. Here is my code
#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
using namespace std;
static int primes = 0;
void prime(int a, int b);
mutex mtx;
int main()
{
unsigned int nthreads = thread::hardware_concurrency();
vector<thread> threads;
int limit = 1000000;
int intrvl = (int) limit / nthreads;
for (int i = 0; i < nthreads; i++)
{
threads.emplace_back(prime, i*intrvl+1, i*intrvl+intrvl);
}
cout << "Number of logical cores: " << nthreads << "\n";
cout << "Calculating number of primes less than " << limit << "... \n";
for (thread & t : threads) {
t.join();
}
cout << "There are " << primes << " prime numbers less than " << limit << ".\n";
return 0;
}
void prime(int a, int b)
{
for (a; a <= b; a++) {
int i = 2;
while(i <= a) {
if(a % i == 0)
break;
i++;
}
if(i == a) {
mtx.lock();
primes++;
mtx.unlock();
}
}
}
But when I run it I get the following diagram
That is sinusoid. But when I run #Mysticial answer that uses openmp, I get this
I checked both program by ps -eLf and both of them uses 8 threads. Why I get this unsteady diagram and how can I get the same result as openmp does with thread?
There are some fundamental differences between Mystical's answer and your code.
Difference #1
Your code creates a chunk of work for each CPU, and lets it run to completion. This means that once a thread has finished, there will be a sharp drop in the CPU usage since a CPU will be idle while the other threads run to completion. This happens because scheduling is not always fair. One thread may progress, and finish, much faster than the others.
The OpenMP solution solves this by declaring schedule(dynamic) which tells OpenMP to, internally, create a work queue that all the threads will consume work from. When a chunk of work is finished, the thread that would have then exited in your code consumes another chunk of work and gets busy with it.
Eventually, this becomes a balancing act of picking adequately sized chunks. Too large, and the CPUs may not be maxed out toward the end of the task. Too small, and there can be significant overhead.
Difference #2
You are writing to a variable, primes that is shared between all of the threads.
This has 2 consequences:
It requires synchronization to keep prevent a data race.
It makes the cache on a modern CPU very unhappy since a cache flush is required before writes on one thread are visible to another thread.
The OpenMP solution solves this by reducing, via operator+(), the result of the individual values of primes each thread held into the final result. This is what reduction(+ : primes) does.
With this knowledge of how OpenMP is splitting up, scheduling the work, and combining the results, we can modify your code to behave similarly.
#include <iostream>
#include <thread>
#include <vector>
#include <utility>
#include <algorithm>
#include <functional>
#include <mutex>
#include <future>
using namespace std;
int prime(int a, int b)
{
int primes = 0;
for (a; a <= b; a++) {
int i = 2;
while (i <= a) {
if (a % i == 0)
break;
i++;
}
if (i == a) {
primes++;
}
}
return primes;
}
int workConsumingPrime(vector<pair<int, int>>& workQueue, mutex& workMutex)
{
int primes = 0;
unique_lock<mutex> workLock(workMutex);
while (!workQueue.empty()) {
pair<int, int> work = workQueue.back();
workQueue.pop_back();
workLock.unlock(); //< Don't hold the mutex while we do our work.
primes += prime(work.first, work.second);
workLock.lock();
}
return primes;
}
int main()
{
int nthreads = thread::hardware_concurrency();
int limit = 1000000;
// A place to put work to be consumed, and a synchronisation object to protect it.
vector<pair<int, int>> workQueue;
mutex workMutex;
// Put all of the ranges into a queue for the threads to consume.
int chunkSize = max(limit / (nthreads*16), 10); //< Handwaving came picking 16 and a good factor.
for (int i = 0; i < limit; i += chunkSize) {
workQueue.push_back(make_pair(i, min(limit, i + chunkSize)));
}
// Start the threads.
vector<future<int>> futures;
for (int i = 0; i < nthreads; ++i) {
packaged_task<int()> task(bind(workConsumingPrime, ref(workQueue), ref(workMutex)));
futures.push_back(task.get_future());
thread(move(task)).detach();
}
cout << "Number of logical cores: " << nthreads << "\n";
cout << "Calculating number of primes less than " << limit << "... \n";
// Sum up all the results.
int primes = 0;
for (future<int>& f : futures) {
primes += f.get();
}
cout << "There are " << primes << " prime numbers less than " << limit << ".\n";
}
This is still not a perfect reproduction of how the OpenMP example behaves. For example, this is closer to OpenMP's static schedule since chunks of work are a fixed size. Also, OpenMP does not use a work queue at all. So I may have lied a little bit -- call it a white lie since I wanted to be more explicit about showing the work being split up. What it is likely doing behind the scenes is storing the iteration that the next thread should start at when it comes available and a heuristic for the next chunk size.
Even with these differences, I'm able to max out all my CPUs for an extended period of time.
Looking to the future...
You probably noticed that the OpenMP version is a lot more readable. This is because it's meant to solve problems just like this. So, when we try to solve them without a library or compiler extension, we end up reinventing the wheel. Luckily, there is a lot of work being done to bring this sort of functionality directly into C++. Specifically, the Parallelism TS can help us out if we could represent this as a standard C++ algorithm. Then we could tell the library to distribute the algorithm across all CPUs as it sees fit so it does all the heavy lifting for us.
In C++11, with a little bit of help from Boost, this algorithm could be written as:
#include <iostream>
#include <iterator>
#include <algorithm>
#include <boost/range/irange.hpp>
using namespace std;
bool isPrime(int n)
{
if (n < 2)
return false;
for (int i = 2; i < n; ++i) {
if (n % i == 0)
return false;
}
return true;
}
int main()
{
auto range = boost::irange(0, 1000001);
auto numPrimes = count_if(begin(range), end(range), isPrime);
cout << "There are " << numPrimes << " prime numbers less than " << range.back() << ".\n";
}
And to parallelise the algorithm, you just need to #include <execution_policy> and pass std::par as the first parameter to count_if.
auto numPrimes = count_if(par, begin(range), end(range), isPrime);
And that's the kind of code that makes me happy to read.
Note: Absolutely no time was spent optimising this algorithm at all. If we were to do any optimisation, I'd look into something like the the Sieve of Eratosthenes which uses previous prime computations to help with future ones.
First, you need to realize that OpenMP usually has a fairly sophisticated thread pool under the covers, so matching it (exactly) will probably be at least somewhat difficult.
Second, it seems to me that before optimizing the threading, you should attempt to start with at least a halfway decent basic algorithm. In this case, the basic algorithm you're implementing is basically pretty awful. It's checking whether numbers are prime, but doing a lot of work that doesn't accomplish anything useful.
It's checking whether even numbers are prime. Other than 2, they're not. Ever.
It's checking whether odd numbers are divisible by even number. Again, they're not. Ever.
It's checking whether numbers are divisible by numbers larger then their square root. If there's no divisor smaller than the square root, there can't be one larger than the square root either.
Although it probably doesn't affect speed, I also find it a lot easier to have a function that checks whether a single number is prime, and just returns true/false to indicate the result, than to have somewhat elaborate code to figure out whether a preceding loop ran to completion or exited early.
You can optimize the algorithm by eliminating more than that, but that much doesn't strike me as "optimization" nearly so much as simply avoiding completely unnecessary pessimization.
At least in my opinion, it's also a bit easier (in this case) to use std::async to launch the threads. This lets us return a value from our thread (the count we want back) pretty easily.
So, let's start by fixing prime based on those observations:
int prime(int a, int b)
{
int count = 0;
if (a == 2)
++count;
if (a % 2 == 0)
++a;
auto check = [](int i) -> bool {
for (int j = 3; j*j <= i; j += 2)
if (i % j == 0)
return false;
return true;
};
for (a; a <= b; a+=2) {
if (check(a))
++count;
}
return count;
}
Now, let me point out that this is already enough faster (even single-threaded) that if we just wanted to get the job to finish 4 times faster (or so) that we'd get from perfect thread-scaling, we're done, even without using threading at all. For the limit you gave, this finishes in well under 1 second.
For the sake of argument, however, let's assume we want to get more, and make use of multiple cores too. One thing to realize here is that we generally want at least a few more threads than cores. The problem is fairly simple: with only one thread per core, we have nothing to make up for the fact that we haven't really distributed the load even between the threads--the thread processing the largest numbers has quite a bit more work to do than the thread processing the smallest numbers--but if we have (for example) a 4-core machine, as soon as one thread finishes, we can only use 75% of the CPU. Then when another thread finishes, it drops to 50%. Then 25%, and finally it finishes, using only one core.
We could probably do some computation to attempt to distribute the load more evenly, but it's a lot easier to just split the load into, say, six or 8 times as many threads as cores. This way the computation can continue using all the cores until there are only three threads remaining1.
Putting all that into code, we can end up with something like this:
int main() {
using namespace chrono;
int limit = 50000000;
unsigned int nthreads = 8 * thread::hardware_concurrency();
cout << "\nComputing multi-threaded:\n";
cout << "Number of threads: " << nthreads << "\n";
cout << "Calculating number of primes less than " << limit << "... \n";
auto start2 = high_resolution_clock::now();
vector<future<int>> threads;
int intrvl = limit / nthreads;
for (int i = 0; i < nthreads; i++)
threads.emplace_back(std::async(std::launch::async, prime, i*intrvl + 1, (i + 1)*intrvl));
int primes = 0;
for (auto &t : threads)
primes += t.get();
auto end2 = high_resolution_clock::now();
cout << "Primes: " << primes << ", Time: " << duration_cast<milliseconds>(end2 - start2).count() << "\n";
}
Note a couple of points:
This runs enough faster that I've increased the upper limit by a fairly large factor, so it'll run long enough that we can at least see it use 100% of the CPU time for a few seconds before it's done2.
I've added some timing code to get a little more accurate idea of how long it runs for.
At least when I run this, it seems to act about as we'd expect/hope: it uses 100% of the CPU time until it gets very close to the end, when it starts to drop just before finishing (i.e., when we have fewer threads to execute than we have cores to execute them).
In case you wonder how OpenMP avoids this: it usually uses a thread pool, so some number of iterations of the loop is dispatched to the thread pool as a task. This lets it produce a large number of tasks without having a huge number of threads contending for CPU time simultaneously.
With the upper limit you used, it finished on my machine in about 90 milliseconds, which isn't long enough for it to even make a noticeable blip on the CPU usage graph.
The OpenMP example is using "reduction" on the sum variable primes which means that each task sums its own local primes variable.
OpenMP adds the thread local copies of primes together at the end of the parallel part to get the grand total.
That means it does not need to lock.
As #Sam says, a thread will get put to sleep if it cannot acquire the mutex lock.
So in your case, the threads will spend a fair amount of time asleep.
If you don't want to use OpenMP, try static std::atomic<int> primes = 0; then you don't need the mutex lock and unlock.
Or you could simulate OpenMP reduction by using an array primes[numThreads] where thread i sums into primes[i] then sum primes[] at the end.
I am studying pthread but confused about how to use pthread to synchronize the functions.
For example, I have a simple code to do some operations on an array like following:
float add(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + numbers[i] +5;
}
return sum/5;
}
float subtract(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + numbers[i] -10;
}
return sum/5;
}
float mul(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + (float)numbers[i] * 1.5 ;
}
return sum/5;
}
float div(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + (float)numbers[i]/ 2;
}
return sum/5;
}
int main(){
int numbers [5] = { 34, 2, 77, 40, 12 };
float addition = add(numbers);
float subtraction = subtract(numbers);
float multiplication = mul(numbers);
float division = div(numbers);
cout << addition + subtraction + multiplication + division << endl;
return -1;
}
Since all the four functions are independent from each other and using the same input, how can I put each operation into one thread and let the functions(or threads) run at the same time?
I think if one day I have a very large array and run the program like above, it will spend a lot time but if I can make the functions run simultaneously, it will save a lot time.
First of all, I suspect, you are not clear how arrays are passed into functions. float subtract(int numbers[5]) does not tell anything about size of the passed array. It is equivalent of float subtract(int numbers[]) (no size), which, in turn, is equivalent to ``float subtract(int* numbers)` (pointer to int). You also have a bug in your function, since you do not initialize float before first use (as well as other functions).
Having this in mind, the whole substract function is better to be written like this:
float subtract(int* numbers, const size_t size) {
float sum = 0;
for(int i = 0; i < size; i++) {
sum = sum + numbers[i] -10;
}
return sum/5;
}
Now, once we cleared the function itself, we can tackle multithreading. I really suggest to ditch pthreads and instead use C++11 thread capability. That is especially true when you need to get the result back as a return value of the function. Doing it with pthreads would require too much typing. The relevant code to do this in C++ would be looking similar to this:
int numbers[] = {34, 2, 77, 40, 12}; // No need to provide array size when inited
auto sub_result = std::async(std::launch::async, &subtract, numbers, sizeof(numbers) / sizeof(*numbers);
auto div_result = ....
// rest of functions
std::cout << "Result of subtraction: " << div_result.get();
Now, this is a lot to grasp :) std::async is a way to run the function asynchronously without worring about multithreading at all. The task of threading is delegated to the compiler. It is a much cleaner way than using pthreads - see, invocation is not much different from normal function invocation! The only thing to keep in mind is that it returns so-called std::future object - an special object on which you can wait until the function which was run completes execution. It also has a get function, which waits until the function is completed and returns it's result. Nice, eh?
I have a c++ code which calculates factorial of int data type, addition of float data type and execution time of each function as follows:
long Sample_C:: factorial(int n)
{
int counter;
long fact = 1;
for (int counter = 1; counter <= n; counter++)
{
fact = fact * counter;
}
Sleep(100);
return fact;
}
float Sample_C::add(float a, float b)
{
return a+b;
}
int main(){
Sample_C object;
clock_t start = clock();
object.factorial(6);
clock_t end = clock();
double time =(double)(end - start);// finding execution time of factorial()
cout<< time;
clock_t starts = clock();
object.add(1.1,5.5);
clock_t ends = clock();
double total_time = (double)(ends -starts);// finding execution time of add()
cout<< total_time;
return 0;
}
Now , i want to have the mesure GFLOPs for "add " function. So, kindly suggest how will i caculate it. As, i am completly new to GFLOPs so kindly tell me wether we can have GFLOPs calculated for functions having only foat data types? and also GFLOPs value vary with different functions?
If I was interested in estimating the execution time of the addition operation I might start with the following program. However, I would still only trust the number this program produced to within a factor of 10 to 100 at best (i.e. I don't really trust the output of this program).
#include <iostream>
#include <ctime>
int main (int argc, char** argv)
{
// Declare these as volatile so the compiler (hopefully) doesn't
// optimise them away.
volatile float a = 1.0;
volatile float b = 2.0;
volatile float c;
// Preform the calculation multiple times to account for a clock()
// implementation that doesn't have a sufficient timing resolution to
// measure the execution time of a single addition.
const int iter = 1000;
// Estimate the execution time of adding a and b and storing the
// result in the variable c.
// Depending on the compiler we might need to count this as 2 additions
// if we count the loop variable.
clock_t start = clock();
for (unsigned i = 0; i < iter; ++i)
{
c = a + b;
}
clock_t end = clock();
// Write the time for the user
std::cout << (end - start) / ((double) CLOCKS_PER_SEC * iter)
<< " seconds" << std::endl;
return 0;
}
If you knew how your particular architecture was executing this code you could then try and estimate FLOPS from the execution time but the estimate for FLOPS (on this type of operation) probably wouldn't be very accurate.
An improvement to this program might be to replace the for loop with a macro implementation or ensure your compiler expands for loops inline. Otherwise you may also be including the addition operation for the loop index in your measurement.
I think it is likely that the error wouldn't scale linearly with problem size. For example if the operation you were trying to time took 1e9 to 1e15 times longer you might be able to get a decent estimate for GFLOPS. But, unless you know exactly what your compiler and architecture are doing with your code I wouldn't feel confident trying to estimate GFLOPS in a high level language like C++, perhaps assembly might work better (just a hunch).
I'm not saying it can't be done, but for accurate estimates there are a lot of things you might need to consider.
This might be some weird Linux quirk, but I'm observing very strange behavior.
The following code should compare a synchronized version of summing numbers with an async version. The thing is that I'm seeing a performance increase (it's not caching, it happens even when I split the code into two separate programs), while still observing the program as single-threaded (only one core is used).
strace does show some thread activity, but monitoring tools like top clones still show only one used core.
Second problem I'm observing is that if I increase the spawn ratio, the memory usage just explodes. What is the memory overhead of a thread? With 5000 threads I get ~10GB memory usage.
#include <iostream>
#include <random>
#include <chrono>
#include <future>
using namespace std;
long long sum2(const vector<int>& v, size_t from, size_t to)
{
const size_t boundary = 5*1000*1000;
if (to-from <= boundary)
{
long long rsum = 0;
for (;from < to; from++)
{
rsum += v[from];
}
return rsum;
}
else
{
size_t mid = from + (to-from)/2;
auto s2 = async(launch::async,sum2,cref(v),mid,to);
long long rsum = sum2(v,from,mid);
rsum += s2.get();
return rsum;
}
}
long long sum2(const vector<int>& v)
{
return sum2(v,0,v.size());
}
long long sum(const vector<int>& v)
{
long long rsum = 0;
for (auto i : v)
{
rsum += i;
}
return rsum;
}
int main()
{
const size_t vsize = 100*1000*1000;
vector<int> x;
x.reserve(vsize);
mt19937 rng;
rng.seed(chrono::system_clock::to_time_t(chrono::system_clock::now()));
uniform_int_distribution<uint32_t> dist(0,10);
for (auto i = 0; i < vsize; i++)
{
x.push_back(dist(rng));
}
auto start = chrono::high_resolution_clock::now();
long long suma = sum(x);
auto end = chrono::high_resolution_clock::now();
cout << "Sum is " << suma << endl;
cout << "Duration " << chrono::duration_cast<chrono::nanoseconds>(end - start).count() << " nanoseconds." << endl;
start = chrono::high_resolution_clock::now();
suma = sum2(x);
end = chrono::high_resolution_clock::now();
cout << "Async sum is " << suma << endl;
cout << "Async duration " << chrono::duration_cast<chrono::nanoseconds>(end - start).count() << " nanoseconds." << endl;
return 0;
}
Maybe you observe one core being used because the overlap between threads doing work simultaneously is too short to be noticeable. Summing 5mln values from a continuous area of memory should be very fast on modern hardware, so by the time parent finishes summing, child may have barely started and parent may be spending most or all of the time waiting for the result from the child. Have you tried to increase work unit to see if the overlap becomes noticeable?
Regarding increased performance: even if there is 0 overlap between threads because of too small work unit, multithreaded version can still benefit from additional L1 cache memory. For such a test, memory will likely be a bottleneck and sequential version will use only one L1 cache while multithreaded version will use as many as there are cores.
Have you checked the times that are being printed? On my machine, the serial time is under 1s at -O2, whilst the parallel sum time is several times faster. It's therefore entirely possible that the CPU usage is not enough for long enough for things like "top" to register, since they typically only refresh once per second.
If you increase the number of threads by reducing the count-per-thread, then you effectively increase the overhead of the thread management. If you have 5000 threads active, then your task will take 5000* min-thread-stack-size in additional memory. On my machine that's 20Gb!
Why don't you try increasing the size of the source container? If you make the parallel section take long enough, you'll see the corresponding parallel CPU usage. However, be prepared: summing integers is fast, and the time taken generating the random numbers can take an order of magnitude or two longer than the time to add the numbers together.