I've got a loop, which I parallelize using OpenMP. In this loop, I read a triangle from a file, and perform some operations on this data. These operations are independent from each triangle to another, so I thought this would be easy to parallelize, as long as I kept the actual reading of files in a critical section.
Order in which triangles are read is not important
Some triangles are read and get discarded pretty quickly, some need some more algorithmic work (bbox construction, ...)
I'm doing binary I/O
Using C++ ifstream *tri_data*
I'm testing this on an SSD
ReadTriangle calls file.read() and reads 12 floats from an ifstream.
#pragma omp parallel for shared (tri_data)
for(int i = 0; i < ntriangles ; i++) {
vec3 v0,v1,v2,normal;
#pragma omp critical
{
readTriangle(tri_data,v0,v1,v2,normal);
}
(working with the triangle here)
}
Now, the behaviour I'm observing is that with OpenMP enabled, the whole process is slower.
I've added some timers to my code to track time spent in the I/O method, and time spent in the loop itself.
Without OpenMP:
Total IO IN time : 41.836 s.
Total algorithm time : 15.495 s.
With OpenMP:
Total IO IN time : 48.959 s.
Total algorithm time : 44.61 s.
My guess is, since the reading is in a critical section, the threads are just waiting for eachother to finish using the file handler, resulting in a longer waiting time.
Any pointers on how to resolve this? My program would really benefit from the possibility to process read triangles with multiple processes. I've tried toying with thread scheduling and related stuff, but that doesn't seem to help a lot in this instance.
Since I'm working on an out-of-core algorithm, introducing a buffer to hold a multitude of triangles is not really an option.
So, the solution I would propose is based on a master/slave strategy, where:
the master (thread 0) performs all the I/O
the slaves do some work on the retrieved data
The pseudo-code would read something like the following:
#include<omp.h>
vector<vec3> v0;
vector<vec3> v1;
vector<vec3> v2;
vector<vec3> normal;
vector<int> tdone;
int nthreads;
int triangles_read = 0;
/* ... */
#pragma omp parallel shared(tri_data)
{
int id = omp_get_thread_num();
/*
* Initialize all the buffers in the master thread.
* Notice that the size in memory is similar to your example.
*/
#pragma omp single
{
nthreads = omp_get_num_threads();
v0.resize(nthreads);
v1.resize(nthreads);
v2.resize(nthreads);
normal.resize(nthreads);
tdone.resize(nthreads,1);
}
if ( id == 0 ) { // Producer thread
int next = 1;
while( triangles_read != ntriangles ) {
if ( tdone[next] ) { // If the next thread is free
readTriangle(tri_data,v0[next],v1[next],v2[next],normal[next]); // Read data and fill the correct buffer
triangles_read++;
tdone[next] = 0; // Set a flag for thread next to start working
#pragma omp flush (tdone[next],triangles_read) // Flush it
}
next = next%(nthreads - 1) + 1; // Set next
} // while
} else { // Consumer threads
while( true ) { // Wait for work
if( tdone[id] == 0) {
/* ... do work here on v0[id], v1[id], v2[id], normal[id] ... */
tdone[id] == 1;
#pragma omp flush (tdone[id]) // Flush it
}
if( tdone[id] == 1 && triangles_read == ntriangles) break; // Work finished for all
}
}
#pragma omp barrier
}
I am not sure if this is still valuable to you but that was a nice teaser anyhow!
Related
I'm trying to parallelize the below for loop with OpenMP, however only one thread seems to be running at a time. I can tell this based on the below observations:
Normally when I have prints inside the loop, the output will be jumbled and lines will be mixed together, however here, all my outputs are printed cleanly, suggesting that only one thread is executing at a time.
There is some heavy dynamic programming computation going on inside the loop, however I only see CPU usage on one core in htop.
If I print the current thread number omp_get_thread_num() I only see one active thread at a time. e.g I see some iterations all from thread 4, then some iterations all from thread 3 and so on.
This only happens after a while. For the first few iterations, things seem to run in parallel.
I'm not sure if there is anything wrong with the code that prevents OpenMP from running two threads in parallel. Below is the for loop and the templates for the functions called inside it. The functions only work with what's passed into them and don't modify any other data structures.
I'm suspecting this may have something to do with the fact that I'm passing const references to things around, could that be the case?
// variables
string ref ; // read-only access
vector<vector<Cluster>> _clusters(24) ;
vector<Cluster> position_clusters = some_function() ;
#pragma omp parallel for num_threads(24) schedule(dynamic, 10)
for (int i = 0; i < position_clusters.size(); i++) {
auto& pc = position_clusters[i] ;
if (pc.size() < 2) {
continue ;
}
vector<Cluster> type_clusters = type_cluster(pc);
for (Cluster &tc : type_clusters) {
if (tc.size() < 2) {
continue ;
}
auto clusters = cluster_breakpoints(tc, 0.7) ; // dynamic programming
for (const Cluster &c : clusters) {
auto result = dynamic_programming(c, ref) ; // dynamic programming
_clusters[omp_get_thread_num()].push_back(result);
}
}
}
// Templates:
vector<Cluster> type_cluster(const Cluster &c) ;
vector<Cluster> cluster_breakpoints(Cluster& cluster, float ratio) ;
vector<Cluster> dynamic_programming(const Cluster& cluster, const string& ref) ;
Suppose I have some tasks (Monte Carlo simulations) that I want to run in parallel. I want to complete a given number of tasks, but tasks take different amount of time so not easy to divide the work evenly over the threads. Also: I need the results of all simulations in a single vector (or array) in the end.
So I come up with below approach:
int Max{1000000};
//SimResult is some struct with well-defined default value.
std::vector<SimResult> vec(/*length*/Max);//Initialize with default values of SimResult
int LastAdded{0};
void fill(int RandSeed)
{
Simulator sim{RandSeed};
while(LastAdded < Max)
{
// Do some work to bring foo to the desired state
//The duration of this work is subject to randomness
vec[LastAdded++]
= sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1);
auto fut2 = std::async(fill,2);
//maybe some more tasks.
fut1.get();
fut2.get();
//do something with the results in vec.
}
The above code will give race conditions I guess. I am looking for a performant approach to avoid that. Requirements: avoid race conditions (fill the entire array, no skips) ; final result is immediately in array ; performant.
Reading on various approaches, it seems atomic is a good candidate, but I am not sure what settings will be most performant in my case? And not even sure whether atomic will cut it; maybe a mutex guarding LastAdded is needed?
One thing I would say is that you need to be very careful with the standard library random number functions. If your 'Simulator' class creates an instance of a generator, you should not run Monte Carlo simulations in parallel using the same object, because you'll get likely get repeated patterns of random numbers between the runs, which will give you inaccurate results.
The best practice in this area would be to create N Simulator objects with the same properties, and give each one a different random seed. Then you could pool these objects out over multiple threads using OpenMP, which is a common parallel programming model for scientific software development.
std::vector<SimResult> generateResults(size_t N_runs, double seed)
{
std::vector<SimResult> results(N_runs);
#pragma omp parallel for
for(auto i = 0; i < N_runs; i++)
{
auto sim = Simulator(seed + i);
results[i] = sim.GetResult();
}
}
Edit: With OpenMP, you can choose different scheduling models, which allow you to for e.g. dynamically split work between threads. You can do this with:
#pragma omp parallel for schedule(dynamic, 16)
which would give each thread chunks of 16 items to work on at a time.
Since you already know how many elements your are going to work with and never change the size of the vector, the easiest solution is to let each thread work on it's own part of the vector. For example
Update
to accomodate for vastly varying calculation times, you should keep your current code, but avoid race conditions via a std::lock_guard. You will need a std::mutex that is the same for all threads, for example a global variable, or pass a reference of the mutex to each thread.
void fill(int RandSeed, std::mutex &nextItemMutex)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
// enter critical area
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
// Acquire next item
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded++;
}
else
{
break;
}
// lock is released when nextItemLock goes out of scope
}
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[workingIndex] = sim.GetResult();//Produces SimResult.
}
}
Problem with this is, that snychronisation is quite expensive. But it's probably not that expensive in comparison to the simulation you run, so it shouldn't be too bad.
Version 2:
To reduce the amount of synchronisation that is required, you could acquire blocks to work on, instead of single items:
void fill(int RandSeed, std::mutex &nextItemMutex, size_t blockSize)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded += blockSize;
}
else
{
break;
}
}
for(size_t i = workingIndex; i < workingIndex + blockSize && i < MAX; i++)
vec[i] = sim.GetResult();//Produces SimResult.
}
}
Simple Version
void fill(int RandSeed, size_t partitionStart, size_t partitionEnd)
{
Simulator sim{RandSeed};
for(size_t i = partitionStart; i < partitionEnd; i++)
{
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[i] = sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1, 0, Max / 2);
auto fut2 = std::async(fill,2, Max / 2, Max);
// ...
}
I have a program structure similarly to this:
ssize_t remain = nsamp;
while (!nsamp || remain > 0) {
#pragma omp parallel for num_threads(nthread)
for (ssize_t ii=0; ii < nthread; ii++) {
<generate noise>
}
// write noise
out.write(data, nthread*PERITER);
remain -= nthread*PERITER;
}
The problem is, when I benchmark the output of this, if I run with eg: two threads, sometimes it takes ~ the same time as a single thread, and sometimes I get a 2x speedup, it feels like there's some sort of synchronization race condition that I'm running into, sometimes I hit it and things go smoothly and sometimes (often) not.
Does anyone know what might be causing this and what the right way to parallelize a section inside of an outer while loop is?
Edit: Using strace, I see a lot of calls to sched_yield() This is probably making it look like I'm doing a lot on the CPU but I'm fighting the scheduler for a good scheduling pattern.
You are creating a new bunch of threads each time the while loop gets entered. After the parallel loop, the threads are destroyed. Because of the nature of a while loop, this might happen irregularily (depending on the condition).
So if your loops gets executed only a few times, then the thread creation process might overweigh the actual workload and thus you get at most sequential performance, if not less. However, maybe the parallel system (OpenMP) can detect if the loop is entered many times to keep threads alive.
Nothing guaranteed though.
I'd suggest something like this.
For nsamp == 0 you'll need some more reasonable handling. For proper Signal handling with OpenMP, please refer to this answer.
ssize_t remain = nsamp;
#pragma omp parallel num_threads(nthread) shared(out, remain, data)
while (remain > 0) {
#pragma omp for
for (ssize_t ii=0; ii < nthread; ii++) {
/* generate noise */
}
#pragma omp single
{
// write noise
out.write(data, nthread*PERITER);
remain -= nthread*PERITER;
}
}
I am use OpenMP to parallize a for loop like so
std::stringType = "somevalue";
#pragma omp parallel for reduction(+ : stringType)
//a for loop here which every loop appends a string to stringType
The only way I can think to do this is to convert to an int representation in some way first and then convert back at the end but this has obvious overhead. Is there any better ways to perform this style of operation?
As mentioned in comments, reduction assumes that the operation is associative and commutative. The values may be computed in any order and be "accumulated" through any kind of partial results and the final result will be the same.
There is no guarantee that an OpenMP for loop will distribute contiguous iterations to each thread unless the loop schedule explicitly requests that. There is no guarantee either that continuous blocks will be distributed by increasing thread number (i.e. thread #0 might go through iterations 1000-1999 while thread #1 goes through 0-999). If you need that behavior, then you should define you own schedule.
Something like:
int N=1000;
std::string globalString("initial value");
#pragma omp parallel shared(N,stringType)
{
std::string localString; //Empty string
// Set schedule
int iterTo, iterFrom;
iterFrom = omp_get_thread_num() * (N / omp_get_num_threads());
if (omp_get_num_threads() == omp_get_thread_num()+1)
iterTo = N;
else
iterTo = (1+omp_get_thread_num()) * (N / omp_get_num_threads());
// Loop - concatenate a number of neighboring values in the right order
// No #pragma omp for: each thread goes through the loop, but loop
// boundaries change according to the thread ID
for (int ii=iterTo; ii<iterTo ; ii++){
localString += get_some_string(ii);
}
// Dirty trick to concatenate strings from all threads in the good order
for (int ii=0;ii<omp_get_num_threads();ii++){
#pragma omp barrier
if (ii==omp_get_thread_num())
globalString += localString;
}
}
A better way would be to have a shared array of std::string, each thread using one as a local accumulator. At the end, a single thread can run the concatenation part (and avoid the dirty trick and all its overhead-heavy barrier calls).
I am using a node with 16 cores. But when I run the code is run in parallel it runs hundreds of times slower than the serial. I am unable to understand the reason. The parallel region is given below:
int Vector_mult_Matrix(vector<double> & vec, CTMC_matrix & ctmc_um)
{
vector<double> res_vec(vec.size(),0);
omp_set_num_threads(16);
#pragma omp parallel num_threads(16)
{
#pragma omp for schedule(static) nowait
for(size_t i=0; i<ctmc_um.trans_num; i++)
{
double temp = 0;
temp = res_vec[ctmc_um.to_index[i]]+vec[ctmc_um.from_index[i]]*ctmc_um.rate[i];
#pragma omp critical
res_vec[ctmc_um.to_index[i]] = temp;
}
}
vec.swap(res_vec);
return 0;
}
I am not sure why 100 times slower but it is slower due to reading/writing to the same memory regions, and multi thread requires this region to be locked else you will see race conditions. (If you were only reading no lock is required).
You are using res_vec[ctmc_um.to_index[i]] so even if openmp has split the indices with stride, your access index to res_vec might be tangled (result of [ctmc_um.to_index[i]]. Thus every other thread might need to wait for a thread to finish its job, and there are 16 of them.