Below is a small program meant to parallelize the approximation of the 1/(n^2) series. Note the global parameter NUM_THREADS.
My issue is that increasing the number of threads from 1 to 4 (the number of processors my computer has is 4) does not significantly affect the outcomes of timing experiments. Do you see a logical flaw in the ThreadFunction? Is there false sharing or misplaced blocking that ends up serializing the execution?
#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
#include <string>
#include <future>
#include <chrono>
std::mutex sum_mutex; // This mutex is for the sum vector
std::vector<double> sum_vec; // This is the sum vector
int NUM_THREADS = 1;
int UPPER_BD = 1000000;
/* Thread function */
void ThreadFunction(std::vector<double> &l, int beg, int end, int thread_num)
{
double sum = 0;
for(int i = beg; i < end; i++) sum += (1 / ( l[i] * l[i]) );
std::unique_lock<std::mutex> lock1 (sum_mutex, std::defer_lock);
lock1.lock();
sum_vec.push_back(sum);
lock1.unlock();
}
void ListFill(std::vector<double> &l, int z)
{
for(int i = 0; i < z; ++i) l.push_back(i);
}
int main()
{
std::vector<double> l;
std::vector<std::thread> thread_vec;
ListFill(l, UPPER_BD);
int len = l.size();
int lower_bd = 1;
int increment = (UPPER_BD - lower_bd) / NUM_THREADS;
for (int j = 0; j < NUM_THREADS; ++j)
{
thread_vec.push_back(std::thread(ThreadFunction, std::ref(l), lower_bd, lower_bd + increment, j));
lower_bd += increment;
}
for (auto &t : thread_vec) t.join();
double big_sum;
for (double z : sum_vec) big_sum += z;
std::cout << big_sum << std::endl;
return 0;
}
From looking at your code, I suspect that ListFill is taking longer than ThreadFunction. Why pass a list of values to the thread instead of the bounds each thread should loop over? Something like:
void ThreadFunction( int beg, int end ) {
double sum = 0.0;
for(double i = beg; i < end; i++)
sum += (1.0 / ( i * i) );
std::unique_lock<std::mutex> lock1 (sum_mutex);
sum_vec.push_back(sum);
}
To maximize parallelism, you need to push as much work as possible onto the threads. See Amdahl's Law
In addition to dohashi's nice improvement, you can remove the need for the mutex by populating the sum_vec in advance in the main thread:
sum_vec.resize(4);
then writing directly to it in ThreadFunction:
sum_vec[thread_num] = sum;
since each thread writes to a distinct element and doesn't modify the vector itself there is no need to lock anything.
Related
I am trying to write a program to solve the producer consumer problem with threads in C++, and from what I can tell the program works fine until the very end when the threads are supposed to exit with the join() function. (The Product object is a simple data container).
#include <iostream>
#include <random>
#include <cstdlib>
#include <ctime>
#include <chrono>
#include <sstream>
#include <vector>
#include <stack>
#include <thread>
#include <mutex>
#include <atomic>
#include <condition_variable>
#include <Product.h>
using namespace std;
const int max_items = 100;
atomic<int> itemNum(0);
atomic<int> numProducersWorking(0);
stack<Product> items;
int maxBuffer;
float storeSales[10];
float monthSales[12];
float totalSales;
mutex xmutex;
condition_variable isNotFull;
condition_variable isNotEmpty;
int intRand(const int & min, const int & max) {
static thread_local mt19937 generator(time(0));
uniform_int_distribution<int> distribution(min,max);
return distribution(generator);
}
float floatRand(const float & min, const float & max) {
static thread_local mt19937 generator(time(0));
uniform_real_distribution<float> distribution(min,max);
return distribution(generator);
}
void produce(int pId)
{
unique_lock<mutex> lock(xmutex);
int day, month, year, id, regNum;
float saleAmnt;
Product item;
id = pId;
day = intRand(1, 30);
month = intRand(1, 12);
year = 20;
regNum = intRand(1, 6);
saleAmnt = floatRand(0.50, 999.99);
item = Product(day, month, year, id, regNum, saleAmnt);
isNotFull.wait(lock, [] { return items.size() != maxBuffer; });
if(itemNum < max_items)
{
items.push(item);
itemNum++;
}
isNotEmpty.notify_all();
}
void consume(int cId)
{
unique_lock<mutex> lock(xmutex);
Product item;
isNotEmpty.wait(lock, [] { return items.size() > 0; });
item = items.top();
items.pop();
storeSales[item.getStoreID()-1] += item.getSaleAmnt();
monthSales[item.getMonth()-1] += item.getSaleAmnt();
totalSales += item.getSaleAmnt();
isNotFull.notify_all();
}
void producer(int id)
{
++numProducersWorking;
while(itemNum < max_items)
{
produce(id);
this_thread::sleep_for(chrono::milliseconds(intRand(5, 40)));
}
--numProducersWorking;
}
void consumer(int id)
{
while(numProducersWorking != 0 || items.size() > 0 )
{
consume(id);
}
}
int main()
{
int p, c, b;
p = 5;
c = 5;
b = 5;
maxBuffer = b;
vector<thread> prodsCons;
auto start = chrono::high_resolution_clock::now();
//create producers
for(int i = 1; i <= p; i++)
{
prodsCons.push_back(thread(producer, i));
}
//create consumers
for(int i = 0; i < c; i++)
{
prodsCons.push_back(thread(consumer, i));
}
int x = 0;
//wait for consumers and producers to finish
for(auto& th : prodsCons)
{
th.join();
cout<<"thread "<<x<<" joined"<<endl;
x++;
}
auto stop = chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::microseconds>(stop - start);
cout<<"Store-wide total sales: "<<endl;
for(int x = 1; x <= p; x++)
{
cout<<" store "<<x<<" sales: $"<<storeSales[x-1]<<endl;
}
cout<<"Month-wise total sales: "<<endl;
for(int x = 1; x <= 12; x++)
{
cout<<" month "<<x<<" sales: $"<<monthSales[x-1]<<endl;
}
cout<<"Total sales: $"<<totalSales<<endl;
cout<<"Simulation time: "<<duration.count()<<" microseconds"<<endl;
}
The output looks like this:
thread 0 joined
thread 1 joined
thread 2 joined
thread 3 joined
thread 4 joined
indicating that 5 out of the 10 threads aren't exiting (most likely the consumers), and so the program never reaches the end. Is there a condition that isn't being fulfilled, or did I implement the mutexes incorrectly?
Once the consume thread reaches the condition_variable::wait call inside consume(), it will not return without some sort of signal.
I typically have a flag shutdown, which is protected by the same mutex as the queue, and my wait condition is going to be based on the shutdown flag and the size.
When its time for the consumers to stop, I acquire the mutex, and set the shutdown flag. Then, on exit from the wait, I will either exit immediately on shutdown, or only if the queue is also empty. The former is an immediate shutdown, while the latter is a shutdown once work is complete.
Also, all access to the items stack must be protected by the mutex. You've done that in some places, but not others.
Why only last threads executes every time? I'm trying to divide grid into N workers, half of grid always not touchable and other part always proceed by 1 last created thread. Should I use an array instead of vector? Locks also do not help to resolve this problem.
#include <iostream>
#include <unistd.h>
#include <vector>
#include <stdio.h>
#include <cstring>
#include <future>
#include <thread>
#include <pthread.h>
#include <mutex>
using namespace std;
std::mutex m;
int main(int argc, char * argv[]) {
int iterations = atoi(argv[1]), workers = atoi(argv[2]), x = atoi(argv[3]), y = atoi(argv[4]);
vector<vector<int> > grid( x , vector<int> (y, 0));
std::vector<thread> threads(workers);
int start, end, lastworker, nwork;
int chunkSize = y/workers;
for(int t = 0; t < workers; t++){
start = t * chunkSize;
end = start + chunkSize;
nwork = t;
lastworker = workers - 1;
if(lastworker == t){
end = y; nwork = workers - 1;
}
threads[nwork] = thread([&start, &end, &x, &grid, &t, &nwork, &threads] {
cout << " ENTER TO THREAD -> " << threads[nwork].get_id() << endl;
for (int i = start; i < end; ++i)
{
for (int j = 0; j < x; ++j)
{
grid[i][j] = t;
}
}
sleep(2);
});
cout << threads[nwork].get_id() << endl;
}
for(auto& th : threads){
th.join();
}
for (int i = 0; i < y; ++i)
{
for (int j = 0; j < x; ++j)
{
cout << grid[i][j];
}
cout << endl;
}
return(0);
}
[&start, &end, &x, &grid, &t, &nwork, &threads]
This line is the root of the problem. You are capturing all the variables by reference, which is not what you want to do.
As a consequence, each thread uses the same variables, which is also not what you want.
You should only capture grid and threads by reference, the other variables should be captured by value ('copied' into the lambda)
[start, end, x, &grid, t, nwork, &threads]
Also, you are accessing grid wrong everywhere: change grid[i][j] to grid[j][i]
thread([&start, &end, &x, &grid, &t, &nwork, &threads] {
=======
The lambda closure that gets executed by every thread captures a reference to nwork.
Which means that as the for loop iterates and starts every thread, each captured thread will always reference the current value of nwork, at the time it does.
As such, the outer loop probably quickly finishes creating each thread object before all the threads actually initialize and actually enter the lambda closure, and each closure sees the same value of nwork, because it is captured by reference, which is the last thread id.
You need to capture nwork by value instead of by reference.
You're passing all the thread parameters are references to the thread lambda. However, when the loop continues in the main thread, the thread parameter variables change, which changes their values in the threads as well, messing up all the previously-created threads.
I'm using for loop to create given number of threads, each one of them makes approximation of part of my integral, I want them to give that data back to array so later I can sum it up (if I think right, I can't just make sum += in each thread because they will collide), everything worked right, to the moment when I want to take that data from each thread, I get error:
calka.cpp:49:33: error: request for member 'get_future' in 'X', which is of non-class type 'std::promise<float>[(N + -1)]'
code:
#include <iostream> //cout
#include <thread> //thread
#include <future> //future , promise
#include <stdlib.h> //atof
#include <string> //string
#include <sstream> //stringstream
using namespace std;
// funkcja 4x^3 + (x^2)/3 - x + 3
// całka x^4 + (x^3)/9 - (x^2)/2 + 3x
void thd(float begin, float width, promise<float> & giveback)
{
float x = begin + 1/2 * width;
float height = x*x*x*x + (x*x*x)/9 - (x*x)/2 + 3*x ;
float outcome = height * width;
giveback.set_value(outcome);
stringstream ss;
ss << this_thread::get_id();
string output = "thread #id: " + ss.str() + " outcome" + to_string(outcome);
cout << output << endl;
}
int main(int argc, char* argv[])
{
int sum = 0;
float begin = atof(argv[1]);
float size = atof(argv[2]);
int N = atoi(argv[3]);
float end = begin + N*size;
promise<float> X[N-1];
thread t[N];
for(int i=0; i<N; i++){
t[i] = thread(&thd, begin, size, ref(X[i]));
begin += size;
}
future<float> wynik_ftr = X.get_future();
float wyniki[N-1];
for(int i=0; i<N; i++){
t[i].join();
wyniki[i] = wynik_ftr.get();
}
//place for loop adding outcome from threads to sum
cout << N;
return 0;
}
Don't use VLA - promise<float> X[N-1]. It is an extension of some compilers, so your code is not portable. Use std::vector instead.
It seems you want to split calculation of integral to N threads. You create N-1 background threads and one invocation of thd is executed from main thread. In main you join all results, so
you don't need to create wyniki as array to store a result per thread,
because you are gathering these results in serially manner - inside for loop in main function.
Therefore one float wyniki variable is sufficient.
Steps you have to do are:
prepare N promises
starts N-1 threads
call thd from main
join and add results from N-1 threads in for loop
join and add main thread result
Code:
std::vector<promise<float>> partialResults(N);
std::vector<thread> t(N-1);
for (int i = 0; i<N-1; i++) {
t[i] = thread(&thd, begin, size, ref(partialResults[i]));
begin += size;
}
thd(begin,size,ref(partialResults[N-1]));
float wyniki = 0.0f;
for (int i = 0; i<N-1; i++) {
t[i].join();
std::future<float> res = partialResults[i].get_future();
wyniki += res.get();
}
std::future<float> res = partialResults[N-1].get_future(); // get res from main
wyniki += res.get();
cout << wyniki << endl;
I am looking for the fastest way to have multiple threads reading from the same small vector (one which is not static but will only ever be changed by the main thread and only ever when the child threads are not reading from it) of pointers.
I've tried using a shared std::vector of pointers which is somewhat faster than a shared array of pointers but still slower per thread... I thought that the reason for that is the threads reading so close together in memory causing false sharing, but I am unsure.
I'm hoping there is either a way around that since the data is read only when the threads are accessing it or there's an entirely different approach that is faster. Below is a minimal example
#include <thread>
#include <iostream>
#include <iomanip>
#include <vector>
#include <atomic>
#include <chrono>
namespace chrono=std::chrono;
class A {
public:
A(int n=1) {
a=n;
}
int a;
};
void tfunc();
int nelements=10;
int nthreads=1;
std::vector<A*> elements;
std::atomic<int> complete;
std::atomic<int> remaining;
std::atomic<int> next;
std::atomic<int> tnow;
int tend=1000000;
int main() {
complete=false;
remaining=0;
next=0;
tnow=0;
for (int i=0; i < nelements; i++) {
A* a=new A();
elements.push_back(a);
}
std::thread threads[nthreads];
for (int i=0; i < nthreads; i++) {
threads[i]=std::thread(tfunc);
}
auto begin=chrono::high_resolution_clock::now();
while (tnow < tend) {
remaining=nthreads;
next=0;
tnow += 1;
while (remaining > 0) {}
// if {elements} is changed it is changed here
}
complete=true;
for (int i=0; i < nthreads; i++) {
threads[i].join();
}
auto complete=chrono::high_resolution_clock::now();
auto elapsed=chrono::duration_cast<chrono::microseconds>(complete-begin).count();
std::cout << std::setw(2) << nthreads << "Time - " << elapsed << std::endl;
}
void tfunc() {
int sum=0;
int tpre=0;
int curr=0;
while (tnow == 0) {}
while (!complete) {
if (tnow-tpre > 0) {
tpre=tnow;
while (remaining > 0) {
curr=next++;
if (curr > nelements) break;
for (int i=0; i < nelements; i++) {
if (i != curr) {
sum += elements[i] -> a;
}
}
remaining--;
}
}
}
}
Which for nthreads between 1 and 10 on my system outputs (the times are in microseconds)
1 Time - 281548
2 Time - 404926
3 Time - 546826
4 Time - 641898
5 Time - 714259
6 Time - 812776
7 Time - 922391
8 Time - 994909
9 Time - 1147579
10 Time - 1199838
I am wondering if there is a faster way to do this or if such a parallel operation will always be slower than serial due to the smallness of the vector.
I am trying to implement a procedure in parallel processing form with OpenMP. It contains four level nested for loops (dependent) and has a variable sum_p to be updated in the innermost loop. In short, the my question is regarding the parallel implementation of the following code snippet:
for (int i = (test_map.size() - 1); i >= 1; --i) {
bin_i = test_map.at(i); //test_map is a "STL map of vectors"
len_rank_bin_i = bin_i.size(); // bin_i is a vector
for (int j = (i - 1); j >= 0; --j) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
for (int u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i]; //node_u is a scalar
for (int v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_p += 1;
}
}
}
}
The full program is given below:
#include <iostream>
#include <vector>
#include <omp.h>
#include <random>
#include <unordered_map>
#include <algorithm>
#include <functional>
#include <time.h>
int main(int argc, char* argv[]){
double time_temp;
int test_map_size = 5000;
std::unordered_map<unsigned int, std::vector<unsigned int> > test_map(test_map_size);
// Fill the test map with random intergers ---------------------------------
std::random_device rd;
std::mt19937 gen1(rd());
std::uniform_int_distribution<int> dist(1, 5);
auto gen = std::bind(dist, gen1);
for(int i = 0; i < test_map_size; i++)
{
int vector_len = dist(gen1);
std::vector<unsigned int> tt(vector_len);
std::generate(begin(tt), end(tt), gen);
test_map.insert({i,tt});
}
// Sequential implementation -----------------------------------------------
time_temp = omp_get_wtime();
std::vector<unsigned int> bin_i, bin_j;
unsigned int node_v, node_u;
unsigned int len_rank_bin_i;
unsigned int len_rank_bin_j;
int sum_s = 0;
for (unsigned int i = (test_map_size - 1); i >= 1; --i) {
bin_i = test_map.at(i);
len_rank_bin_i = bin_i.size();
for (unsigned int j = i; j-- > 0; ) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
for (unsigned int u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i];
for (unsigned int v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_s += 1;
}
}
}
}
std::cout<<"Estimated sum (seq): "<<sum_s<<std::endl;
time_temp = omp_get_wtime() - time_temp;
printf("Time taken for sequential implementation: %.2fs\n", time_temp);
// Parallel implementation -----------------------------------------------
time_temp = omp_get_wtime();
int sum_p = 0;
omp_set_num_threads(4);
#pragma omp parallel
{
std::vector<unsigned int> bin_i, bin_j;
unsigned int node_v, node_u;
unsigned int len_rank_bin_i;
unsigned int len_rank_bin_j;
unsigned int i, u_i, v_i;
int j;
#pragma omp parallel for private(j,u_i,v_i) reduction(+:sum_p)
for (i = (test_map_size - 1); i >= 1; --i) {
bin_i = test_map.at(i);
len_rank_bin_i = bin_i.size();
#pragma omp parallel for private(u_i,v_i)
for (j = (i - 1); j >= 0; --j) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
#pragma omp parallel for private(v_i)
for (u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i];
#pragma omp parallel for
for (v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_p += 1;
}
}
}
}
}
std::cout<<"Estimated sum (parallel): "<<sum_p<<std::endl;
time_temp = omp_get_wtime() - time_temp;
printf("Time taken for parallel implementation: %.2fs\n", time_temp);
return 0;
}
Running the code with command g++-7 -fopenmp -std=c++11 -O3 -Wall -o so_qn so_qn.cpp in macOS 10.13.3 (i5 processor with four logical cores) gives the following output:
Estimated sum (seq): 38445750
Time taken for sequential implementation: 0.49s
Estimated sum (parallel): 38445750
Time taken for parallel implementation: 50.54s
The time taken for parallel implementation is multiple times higher than sequential implementation. Do you think the code or logic can deduced to parallel implementation? I have spent a few days to improve the terrible performance of my code but to no avail. Any help is greatly appreciated.
Update
With the changes suggested by JimCownie, i.e., "using omp for, not omp parallel for" and removing the parellelism of inner loops, the performance is greatly improved.
Estimated sum (seq): 42392944
Time taken for sequential implementation: 0.48s
Estimated sum (parallel): 42392944
Time taken for parallel implementation: 0.27s
My CPU has four logical cores (and I am using four threads), now I am wondering, would there be anyway to get four times better performance than the sequential implementation.
I see a different problem here when my map of vectors test_map is short, but fat at each level, i.e., the map size is small and but the vector size at each of the keys is very large. In such a case the performance of sequential and parallel implementations are comparable, without much difference. It seems like we need to parallelize inner loops too. Do you know how to achieve it in this context?