I am trying to parallelize a for-loop which scans std::map. Below is my toy program:
#include <iostream>
#include <cstdio>
#include <map>
#include <string>
#include <cassert>
#include <omp.h>
#define NUM 100000
using namespace std;
int main()
{
omp_set_num_threads(16);
int realThreads = 0;
string arr[] = {"0", "1", "2"};
std::map<int, string> myMap;
for(int i=0; i<NUM; ++i)
myMap[i] = arr[i % 3];
string is[NUM];
#pragma omp parallel for
for(map<int, string>::iterator it = myMap.begin(); it != myMap.end(); it++)
{
is[it->first] = it->second;
if(omp_get_thread_num() == 0)
realThreads = omp_get_num_threads();
}
printf("First for-loop with %d threads\n", realThreads);
realThreads = 0;
#pragma omp parallel for
for(int i=0; i<NUM; ++i)
{
assert(is[i] == arr[i % 3]);
if(omp_get_thread_num() == 0)
realThreads = omp_get_num_threads();
}
printf("Second for-loop with %d threads\n", realThreads);
return 0;
}
Compilation command:
icc -fopenmp foo.cpp
The output of the above code block is:
First for-loop with 1 threads
Second for-loop with 16 threads
Why am I not able to parallelize the first for-loop?
std::map does not provide random-access iterators, only the usual bi-directional iterator. OpenMP requires that the iterators in parallel loops are of random-access type. With other kind of iterators explicit tasks should be used instead:
#pragma omp parallel
{
#pragma omp master
realThreads = omp_get_num_threads();
#pragma omp single
for(map<int, string>::iterator it = myMap.begin(); it != myMap.end(); it++)
{
#pragma omp task
is[it->first] = it->second;
}
}
Note in that case a separate task is created for each member of the map. Since the task body is very computationally simple, the OpenMP overhead will be relatively high in that particular case.
Related
At some point in my code I have to make operations on all elements in an unordered_map. In order to accelerate this process I want to use openMP, but the naive approach does not work:
std::unordered_map<size_t, double> hastTable;
#pragma omp for
for(auto it = hastTable.begin();
it != hastTable.end();
it ++){
//do something
}
The reason for this is, that the iterator of an unordered_map is no random access iterator.
As an alternative I have tried the __gnu_parallel directives working on for_each. But the following code
#include <parallel/algorithm>
#include <omp.h>
__gnu_parallel::for_each (hashTable.begin(), hashTable.end(),[](std::pair<const size_t, double> & item)
{
//do something with item.secon
});
compiled with (gcc 4.8.2)
g++ -fopenmp -march=native -std=c++11
does not run parallel. Switching the unordered_map with a vector and using the same __gnu_parallel directive runs in parallel.
Why does it not run in parallel in case of the unordered map? Are there workarounds?
In the following I give you some simple code, which reproduces my problem.
#include <unordered_map>
#include <parallel/algorithm>
#include <omp.h>
int main(){
//unordered_map
std::unordered_map<size_t, double> hashTable;
double val = 1.;
for(size_t i = 0; i<100000000; i++){
hashTable.emplace(i, val);
val += 1.;
}
__gnu_parallel::for_each (hashTable.begin(), hashTable.end(),[](std::pair<const size_t, double> & item)
{
item.second *= 2.;
});
//vector
std::vector<double> simpleVector;
val = 1.;
for(size_t i = 0; i<100000000; i++){
simpleVector.push_back(val);
val += 1.;
}
__gnu_parallel::for_each (simpleVector.begin(), simpleVector.end(),[](double & item)
{
item *= 2.;
});
}
I am looking forward to your answers.
The canonical approach with containers that do not support random iterators is to use explicit OpenMP tasks:
std::unordered_map<size_t, double> hastTable;
#pragma omp parallel
{
#pragma omp single
{
for(auto it = hastTable.begin(); it != hastTable.end(); it++) {
#pragma omp task
{
//do something
}
}
}
}
This creates a separate task for each iteration which brings some overhead and therefore is only meaningful when //do something actually means //do quite a bit of work.
You could split a loop over ranges of bucket indices, then create an intra-bucket iterator to handle elements. unordered_map has .bucket_count() and the bucket-specific iterator-yielding begin(bucket_number), end(bucket_number) that allow this. Assuming you haven't modified the default max_load_factor() from 1.0 and have a reasonable hash function, you'll average 1 element per bucket and shouldn't be wasting too much time on empty buckets.
You can do this by iterating over the buckets of the unordered_map, like so:
#include <cmath>
#include <iostream>
#include <unordered_map>
int main(){
const int N = 10000000;
std::unordered_map<int, double> mymap(1.5*N);
//Load up a hash table
for(int i=0;i<N;i++)
mymap[i] = i+1;
#pragma omp parallel for default(none) shared(mymap)
for(size_t b=0;b<mymap.bucket_count();b++)
for(auto bi=mymap.begin(b);bi!=mymap.end(b);bi++){
for(int i=0;i<20;i++)
bi->second += std::sqrt(std::log(bi->second) + 1);
}
std::cout<<mymap.begin()->first<<" "<<mymap.begin()->second<<std::endl;
return 0;
}
I am trying to implement a procedure in parallel processing form with OpenMP. It contains four level nested for loops (dependent) and has a variable sum_p to be updated in the innermost loop. In short, the my question is regarding the parallel implementation of the following code snippet:
for (int i = (test_map.size() - 1); i >= 1; --i) {
bin_i = test_map.at(i); //test_map is a "STL map of vectors"
len_rank_bin_i = bin_i.size(); // bin_i is a vector
for (int j = (i - 1); j >= 0; --j) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
for (int u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i]; //node_u is a scalar
for (int v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_p += 1;
}
}
}
}
The full program is given below:
#include <iostream>
#include <vector>
#include <omp.h>
#include <random>
#include <unordered_map>
#include <algorithm>
#include <functional>
#include <time.h>
int main(int argc, char* argv[]){
double time_temp;
int test_map_size = 5000;
std::unordered_map<unsigned int, std::vector<unsigned int> > test_map(test_map_size);
// Fill the test map with random intergers ---------------------------------
std::random_device rd;
std::mt19937 gen1(rd());
std::uniform_int_distribution<int> dist(1, 5);
auto gen = std::bind(dist, gen1);
for(int i = 0; i < test_map_size; i++)
{
int vector_len = dist(gen1);
std::vector<unsigned int> tt(vector_len);
std::generate(begin(tt), end(tt), gen);
test_map.insert({i,tt});
}
// Sequential implementation -----------------------------------------------
time_temp = omp_get_wtime();
std::vector<unsigned int> bin_i, bin_j;
unsigned int node_v, node_u;
unsigned int len_rank_bin_i;
unsigned int len_rank_bin_j;
int sum_s = 0;
for (unsigned int i = (test_map_size - 1); i >= 1; --i) {
bin_i = test_map.at(i);
len_rank_bin_i = bin_i.size();
for (unsigned int j = i; j-- > 0; ) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
for (unsigned int u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i];
for (unsigned int v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_s += 1;
}
}
}
}
std::cout<<"Estimated sum (seq): "<<sum_s<<std::endl;
time_temp = omp_get_wtime() - time_temp;
printf("Time taken for sequential implementation: %.2fs\n", time_temp);
// Parallel implementation -----------------------------------------------
time_temp = omp_get_wtime();
int sum_p = 0;
omp_set_num_threads(4);
#pragma omp parallel
{
std::vector<unsigned int> bin_i, bin_j;
unsigned int node_v, node_u;
unsigned int len_rank_bin_i;
unsigned int len_rank_bin_j;
unsigned int i, u_i, v_i;
int j;
#pragma omp parallel for private(j,u_i,v_i) reduction(+:sum_p)
for (i = (test_map_size - 1); i >= 1; --i) {
bin_i = test_map.at(i);
len_rank_bin_i = bin_i.size();
#pragma omp parallel for private(u_i,v_i)
for (j = (i - 1); j >= 0; --j) {
bin_j = test_map.at(j);
len_rank_bin_j = bin_j.size();
#pragma omp parallel for private(v_i)
for (u_i = 0; u_i < len_rank_bin_i; u_i++) {
node_u = bin_i[u_i];
#pragma omp parallel for
for (v_i = 0; v_i < len_rank_bin_j; v_i++) {
node_v = bin_j[v_i];
if (node_u> node_v)
sum_p += 1;
}
}
}
}
}
std::cout<<"Estimated sum (parallel): "<<sum_p<<std::endl;
time_temp = omp_get_wtime() - time_temp;
printf("Time taken for parallel implementation: %.2fs\n", time_temp);
return 0;
}
Running the code with command g++-7 -fopenmp -std=c++11 -O3 -Wall -o so_qn so_qn.cpp in macOS 10.13.3 (i5 processor with four logical cores) gives the following output:
Estimated sum (seq): 38445750
Time taken for sequential implementation: 0.49s
Estimated sum (parallel): 38445750
Time taken for parallel implementation: 50.54s
The time taken for parallel implementation is multiple times higher than sequential implementation. Do you think the code or logic can deduced to parallel implementation? I have spent a few days to improve the terrible performance of my code but to no avail. Any help is greatly appreciated.
Update
With the changes suggested by JimCownie, i.e., "using omp for, not omp parallel for" and removing the parellelism of inner loops, the performance is greatly improved.
Estimated sum (seq): 42392944
Time taken for sequential implementation: 0.48s
Estimated sum (parallel): 42392944
Time taken for parallel implementation: 0.27s
My CPU has four logical cores (and I am using four threads), now I am wondering, would there be anyway to get four times better performance than the sequential implementation.
I see a different problem here when my map of vectors test_map is short, but fat at each level, i.e., the map size is small and but the vector size at each of the keys is very large. In such a case the performance of sequential and parallel implementations are comparable, without much difference. It seems like we need to parallelize inner loops too. Do you know how to achieve it in this context?
I wish to iterate through all the elements in a std::multimap in parallel using OpenMP.
I tried to compile the following code using g++-7 (7.2.0) and icpc (18.0.0 20170811), but both failed.
Is this possible? If so, how can I parallelize a for-loop through a C++ std::multimap using OpenMP?
#include <map>
#include <omp.h>
int main() {
std::multimap<int,int> myMultimap;
for (int i = 0; i < 10; ++i){
myMultimap.insert(std::make_pair(i,i+1)); // create dummy contents
}
std::multimap<int, int>::const_iterator itr;
#pragma omp parallel for private (itr)
for (itr = myMultimap.cbegin(); itr != myMultimap.cend(); ++itr) {
// do something here
}
return 0;
}
What is the performance cost of call omp_get_thread_num(), compared to look up the value of a variable?
How to avoid calling omp_get_thread_num() for many times in a simd openmp loop?
I can use #pragma omp parallel, but will that make a simd loop?
#include <vector>
#include <omp.h>
int main() {
std::vector<int> a(100);
auto a_size = a.size();
#pragma omp for simd
for (int i = 0; i < a_size; ++i) {
a[i] = omp_get_thread_num();
}
}
I wouldn't be too worried about the cost of the call, but for code clarity you can do:
#include <vector>
#include <omp.h>
int main() {
std::vector<int> a(100);
auto a_size = a.size();
#pragma omp parallel
{
const auto threadId = omp_get_thread_num();
#pragma omp for
for (int i = 0; i < a_size; ++i) {
a[i] = threadId;
}
}
}
As long as you use #pragma omp for (and don't put an extra `parallel in there! otherwise each of your n threads will spawn n more threads... that's bad) it will ensure that inside your parallel region that for loop is split up amongst the n threads. Make sure omp compiler flag is turned on.
In the following example the C++11 threads take about 50 seconds to execute, but the OMP threads only 5 seconds. Any ideas why? (I can assure you it still holds true if you are doing real work instead of doNothing, or if you do it in a different order, etc.) I'm on a 16 core machine, too.
#include <iostream>
#include <omp.h>
#include <chrono>
#include <vector>
#include <thread>
using namespace std;
void doNothing() {}
int run(int algorithmToRun)
{
auto startTime = std::chrono::system_clock::now();
for(int j=1; j<100000; ++j)
{
if(algorithmToRun == 1)
{
vector<thread> threads;
for(int i=0; i<16; i++)
{
threads.push_back(thread(doNothing));
}
for(auto& thread : threads) thread.join();
}
else if(algorithmToRun == 2)
{
#pragma omp parallel for num_threads(16)
for(unsigned i=0; i<16; i++)
{
doNothing();
}
}
}
auto endTime = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = endTime - startTime;
return elapsed_seconds.count();
}
int main()
{
int cppt = run(1);
int ompt = run(2);
cout<<cppt<<endl;
cout<<ompt<<endl;
return 0;
}
OpenMP thread-pools for its Pragmas (also here and here). Spinning up and tearing down threads is expensive. OpenMP avoids this overhead, so all it's doing is the actual work and the minimal shared-memory shuttling of the execution state. In your Threads code you are spinning up and tearing down a new set of 16 threads every iteration.
I tried a code of an 100 looping at
Choosing the right threading framework and it took
OpenMP 0.0727, Intel TBB 0.6759 and C++ thread library 0.5962 mili-seconds.
I also applied what AruisDante suggested;
void nested_loop(int max_i, int band)
{
for (int i = 0; i < max_i; i++)
{
doNothing(band);
}
}
...
else if (algorithmToRun == 5)
{
thread bristle(nested_loop, max_i, band);
bristle.join();
}
This code looks like taking less time than your original C++ 11 thread section.