using #pragma omp parallel for make the program slower - c++

My C++ program takes about 300 s to run.
Inside my program I need to cwis divide my vectors. VS analyzer tells this takes about 15% of running time. here is the code:
template <class T> myVector<T> cWisDivide(myVector<T> &vec1,
myVector<T> &vec2)
{
try
{
if (vec1._rows == vec2._rows)
{
myVector<T> result(vec1._rows);
//#pragma omp parallel for
for (int r = 1; r <= vec1._rows; r++)
{
if (vec2(r) != 0)
{
result(r) = vec1(r) / vec2(r);
}
else
{
throw std::runtime_error("");
}
}
return result;
}
}
catch (const exception &e)
{
....
}
}
this function is called many time.
If I use #pragma ... before the loop, the cpu usage sticks 100% for about 350 s. which is more than the time taken to run program sequentially.
I would appreciate if any one could help me on the issue.

This can go wrong in a number of ways:
without knowing the type of result, it's possible that barriers have to be built in to avoid a race condition when modifying it -- you could avoid that by having parallel result vectors that you merge afterwards.
copy overhead for the vec1 and vec2 vectors might be bigger than performance reward.
all in all, this is a question about parallelizable vector types -- refer to your openMP documentation of choice to learn more about parallely accessible types.

Anyway, I just looked it up and from the OMP specification ...
• A throw executed inside a loop region must cause execution to resume within the same iteration of the loop region, and the same thread that threw the exception must catch it.
I knew I didn't like the look of the exception.
OpenMP API V4.0 page 59.

Related

Multithreaded concurrent file reading/writing, managing container of processes

Wholly new to multithreading.
I am writing a program which takes as input a vector of objects and an integer for the number of threads to dedicate. The nature of the objects isn't important, only that each has several members that are file paths to large text files. Here's a simplified version:
// Not very important. Reads file, writes new version omitting
// some lines
void proc_file(OBJ obj) {
std::string inFileStr(obj.get_path().c_str());
std::string outFileStr(std::string(obj.get_path().replace_extension("new.txt").c_str()));
std::ifstream inFile(inFileStr);
std::ofstream outFile(outFileStr);
std::string currLine;
while (getline(inFile, currLine)) {
if (currLine.size() == 1 ||
currLine.compare(currLine.length()-5, 5, "thing") != 0) {
outFile << currLine << '\n';
}
else {
for (int i = 0; i < 3; i++) {
getline(inFile, currLine);
}
}
}
inFile.close();
outFile.close();
}
// Processes n file concurrently, working way through
// all OBJ in objs
void multi_file_proc(std::vector<OBJ> objs, int n) {
std::vector<std::thread> procVec;
for (int i = 0; i < objs.size(); i++) {
/*
Ensure that n files are always being processed.
Upon completion of one, initiate another, until
all OBJ in objs have had their text files changed.
*/
}
}
I want to loop through each OBJ and write altered versions of their text files in concurrence, the limitation on simultaneous file read/writes being the thread value (n). Ultimately, all the objects' text files must be changed, but in such a way that there are always n files being processed, to maximize efficiency in concurrence.
Note the vector of threads, procVec. I originally approached this by managing a vector of threads, with a file being processed for each thread in procVec. From my reading, it seems a vector for managing these tasks is logical. But how do I always ensure there are n files open until all have been processed, without exiting with an open thread?
Edit: Apologies, my intention was not to ask others to write code for me. I just didn't want my approach to bias anyone's answer if the approach was bad to begin with.
These are some things I've tried (this code would go into the block comment in my function):
1. First approach. Idea is to add to procVec up until the thread limit n was reached, then join, remove a process from the front of the vector upon its completion. This is a summary of several similar iterations, none of which worked:
if (i >= n) {
procVec.front().join();
procVec.erase(procVec.begin());
}
procVec.push_back(std::thread(proc_file, sra[i]));
Problems with this:
Incorrectly assumes front of vector will always finish first
(Possibly?) Invalidates all iterators in procVec after first is erased
2. Using mutexes, I attempt writing a lambda function where the thread would be removed upon its completion. This is my current approach. Unsure why it isn't working, or if it even suits my needs:
// remThread() and lamb() defined above main function, **procVec** and **threadMutex**
//are global variables
void remThread(std::thread::id id) {
std::lock_guard<std::mutex lock(threadMutex);
auto iter = std::find_if(procVec.begin(), procVec.end(), [=](std::thread &t)
{return (t.get_id() == id); });
if (iter != procVec.end()) {
iter->join();
procVec.erase(iter);
}
}
void lamb(SRA sra, std::thread::id id) {
proc_file(sra);
remThread(id);
}
// This is the code contained in the main for loop. called lambda to process file
// and then remove thread
std::lock_guard<std::mutex> lock(threadMutex);
procVec.push_back(std::thread([sras, i]() {
std::thread(lamb, sras[i], std::this_thread::get_id()).detach();
}));
Problems with this:
Program terminates, likely a joinable thread is active, leaves scope
Given that the example you show is fairly simple, a for loop of fixed size, no strange dependencies, a very simple solution could be to use OpenMP which would allow you to do what you describe (providing I understood correctly) by adding a single line
void multi_file_proc(std::vector<OBJ> objs, int n) {
std::vector<std::thread> procVec;
#pragma omp parallel for num_threads(n) schedule(dynamic, 1)
for (int i = 0; i < objs.size(); i++) {
/*
...
*/
}
}
in front of the for loop. Of course you then have to modify your compile command to add openmp support, the precise flag naturally being different from compiler to compiler i.e. -fopenmp for g++, -qopenmp for icpc, etc.
The line above basically instructs the compiler to create code to execute the for loop below in parallel. The important bit here is the last one where we set the schedule. Dynamic simply means that the order is not predetermined, instead threads will get their next iteration when they finish with the last. The integer 1 there defines the number of steps they take at a time, given that each file is large we want something fine grained since we don't expect too much overhead from the scheduling.
A word of caution, OpenMP, like most of C++, will not even try to stop you from shooting yourself in the foot. And with concurrency there are whole new ways to do just that.
Finally, this is by no means guaranteed to be the absolute best solution outright. For instance if your files are of varying lengths then you would probably want to sort the objects from longest to shortest before the loop. This way once the last object is being processed (at some point only a single thread will be working on the final object) that won't take too long.

Sampling conditional distribution OpenMP

I have a function which draws a random sample:
Sample sample();
I have a function which checks weather a sample is valid:
bool is_valid(Sample s);
This simulates a conditional distribution. Now I want a lot of valid samples (most samples will not be valid).
So I want to parallelize this code with openMP
vector<Sample> valid_samples;
while(valid_samples.size() < n) {
Sample s = sample();
if(is_valid(s)) {
valid_samples.push_back(s);
}
}
How would I do this? Most of the OpenMP code I found were simple for loops where the number of iterations is determined in the beginning.
The sample() function has a
thread_local std::mt19937_64 gen([](){
std::random_device d;
std::uniform_int_distribution<int> dist(0,10000);
return dist(d);
}());
as a random number generator. Is is valid and thread save if I assume that my device has a source of randomness? Are there better solutions?
You may employ OpenMP task parallelism. The simplest solution would be to define a task as a single sample insertion:
vector<Sample> valid_samples(n); // need to be resized to allow access in parallel
void insert_ith(size_t i)
{
do {
valid_samples[i] = sample();
} while (!is_valid(valid_samples[i]));
}
#pragma omp parallel
{
#pragma omp single
{
for (size_t i = 0; i < n; i++)
{
#pragma omp task
insert_ith(i);
}
}
}
Note that there might be performance issues with such single-task-single-insertion mapping. First, there would be false sharing involved, but likely worse, tasking management has some overhead which might be significant for very small tasks. In such a case, a remedy is simple — instead of a single insertion per tasks, insert multiple items at once, such as 100. Usually, a suitable number is a trade-off: the lower creates more tasks = more overhead, the higher may result in worse load balancing.
You need to take care of the critical section in your code, which is the insertion into the answer vector
something like this should work (haven't compiled because functions and types are not given)
// create vector before parallel section because it shall be shared
vector<Sample> valid_samples(n); // set initial size to avoid reallocation
int reached_count = 0;
#pragma omp parallel shared(valid_samples, n, reached_count)
{
while(reached_count < n) { // changed this, see comments for discussion
Sample s = sample(); // I assume this to be thread indepent
if(is_valid(s)) {
#pragma omp critical
{
// check condition again, another thread might have already
// reached maximum number
if(reached_count < n) {
valid_samples.push_back(s);
reached_count = valid_samples.size();
}
}
}
}
}
note that neither sample() nor isvalid(s) are inside of the critical section because I assume these functions to be far more expensive than the vector insertion or the acceptance is very rare
If that is not the case, you could work with indepent local vectors and merge in the end, but that would only gain a significant benefit if you reduce the number of synchronization in some way, like giving a fixed number of iterations (at least for a large part)

OpenMP parallelization inside for loops takes too long

I am preparing a program which must use OpenMP parallelization. The program is supposed to compare two frames, inside of which both frames must be compared block by block, and OpenMP must be applied in two ways: one where frame work must be split across threads, and the other where the work must be split between the threads by a block level, finding the minimum cost of each comparison.
The main idea behind the skeleton of the code would look as follows:
int main() {
// code
for () {
for () {
searchBlocks();
}
}
// code
}
searchBlocks() {
for () {
for () {
getCost()
}
}
}
getCost() {
for () {
for () {
// operations
}
}
}
Then, considering parallelization at a frame level, I can simply change the main nested loop to this
int main() {
// code
omp_set_num_threads(threadNo);
#pragma omp parallel for collapse(2) if (isFrame)
for () {
for () {
searchBlocks();
}
}
// code
}
Where threadNo is specified upon running and isFrame is obtained via a parameter to specify if frame level parallelization is needed. This works and the execution time of the program becomes shorter as the number of threads used becomes bigger. However, as I try block level parallelization, I attempted the following:
getCost() {
#pragma omp parallel for collapse(2) if (isFrame)
for () {
for () {
// operations
}
}
}
I'm doing this in getCost() considering that it is the innermost function where the comparison of each corresponding block happens, but as I do this the program takes really long to execute, so much so that if I were to run it without OpenMP support (so 1 single thread) against OpenMP support with 10 threads, the former would finish first.
Is there something that I'm not declaring right here? I'm setting the number of threads right before the nested loops of the main function, just like I had in frame level parallelization.
Please let me know if I need to explain this better, or what it is I could change in order to manage to run this parallelization successfully, and thanks to anyone who may provide help.
Remember that every time your program executes #pragma omp parallel directive, it spawns new threads. Spawning threads is very costly, and since you do getCost() many many times, and each call is not that computationally heavy, you end up wasting all the time on spawning and joining threads (which is essentially making costly system calls).
On the other hand, when #pragma omp for directive is executed, it doesn't spawn any threads, but it lets all the existing threads (which are spawned by previous parallel directive) to execute in parallel on separate pieces of data.
So what you want is to spawn threads on the top level of your computation by doing: (notice no for)
int main() {
// code
omp_set_num_threads(threadNo);
#pragma omp parallel
for () {
for () {
searchBlocks();
}
}
// code
}
and then later to split loops by doing (notice no parallel)
getCost() {
#pragma omp for collapse(2) if (isFrame)
for () {
for () {
// operations
}
}
}
You get cascading parallelization. Take the limit values in the main cycles as I,J, and in the getcost cycles as K,L: you get I * J * K * L threads. Here any operating system will go crazy. So not long before fork bomb to reach...
Well, and "collapse" is also not clear why. It's still two cycles inside, and I don't see much point in this parameter.
Try removing parallelism in Getcost.

Multihreading recursive program c++

I am working on a recursive algorithm which we want to parallelize to improve the performance.
I implemented multithreading using Visual c++ 12.0 and < thread > library . However I dont see any performance improvements. The time taken either less by a few milliseconds or is more than the time with single thread.
Kindly let me know if am doing something wrong and what corrections should I make to the code.
Here is my code
void nonRecursiveFoo(<className> &data, int first, int last)
{
//process the data between first and last index and set its value to true based on some condition
//no threads are created here
}
void recursiveFoo(<className> &data, int first, int last)
{
int partitionIndex = -1;
data[first]=true;
data[last]=true;
for (int i = first + 1; i < last; i++)
{
//some logic setting the index
If ( some condition is true)
partitionIndex = i;
}
//no dependency of partitions on one another and so can be parallelized
if( partitionIndex != -1)
{
data[partitionIndex]=true;
//assume some threadlimit
if (Commons::GetCurrentThreadCount() < Commons::GetThreadLimit())
{
std::thread t1(recursiveFoo, std::ref(data), first, index);
Commons::IncrementCurrentThreadCount();
recursiveFoo(data, partitionIndex , last);
t1.join();
}
else
{
nonRecursiveFoo(data, first, partitionIndex );
nonRecursiveFoo(data, partitionIndex , last);
}
}
}
//main
int main()
{
recursiveFoo(data,0,data.size-1);
}
//commons
std::mutex threadCountMutex;
static void Commons::IncrementCurrentThreadCount()
{
threadCountMutex.lock();
CurrentThreadCount++;
threadCountMutex.unlock();
}
static int GetCurrentThreadCount()
{
return CurrentThreadCount;
}
static void SetThreadLimit(int count)
{
ThreadLimit = count;
}
static int GetThreadLimit()
{
return ThreadLimit;
}
static int GetMinPointsPerThread()
{
return MinimumPointsPerThread;
}
Without further information (see comments) this is mostly guesswork, but there are a few things you should watch out for:
First of all, make sure that your partitioning logic is very short and fast compared to the processing. Otherwise, you are just creating more work than you gain processing power.
Make sure, there is enough work to begin with or the speedup might be not enough to pay for the additional overhead of thread creation.
Check that your work gets evenly distributed among the different threads and don't spawn more threads than you have cores on your computer (print the number of total threads at the end - don't rely on your ThreadLimit).
Don't let your partitions get too small, (especially no less than 64 Bytes) or you end up with false sharing.
It would be MUCH more efficient, to implement CurrentThreadCount as a std::atomic<int> in which case you don't need a mutex.
Put the increment of the counter before the creation of the thread. Otherwise, the newly created thread might read the counter before it is incremented and spawn a new thread again, even if the max number of threads is already reached (This is still not a perfect solution, but I would only invest more time on this if you have verified, that overcommitting is your actual problem)
If you really must use a mutex (for reasons outside of the example code) you have to use it for every access to CurrentThreadCount (read and write access). Otherwise this is - strictly speaking - a race condition and thus UB.
By using t1.join you're basically waiting for the other thread to finish - i.e. not doing anything in parallel.
By looking at your algorithm I don't see how it can be parallelized(thus improved) by using threads - you have to wait for a single recursive call to end.
First of all, you are not doing anything in parallel, as every thread creation blocks, until the created thread has finished. Hence, your multithreaded code will always be slower than the non multithreaded version.
In order to parallelize you could spawn threads for that part, where the non-recursive function is called, put the thread ID into a vector and join on the highest level of the recursion, by walking through the vector. (Although there are more elegant ways to do that, but for a first should this would be OK, I think).
Thus, all non recursive calls will run in parallel. But you should use another condition than the max number of threads, but the size of the problem, e.g. last-first<threshold.

c++ OpenMP critical: "one-way" locking?

Consider the following serial function. When I parallelize my code, every thread will call this function from within the parallel region (not shown). I am trying to make this threadsafe and efficient (fast).
float get_stored_value__or__calculate_if_does_not_yet_exist( int A )
{
static std::map<int, float> my_map;
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
if (found_A)
{
return it_find->second;
}
else
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
return result_for_A;
}
}
Almost every single time this function is called, the threads will successfully "find" the stored value for their "A" (whatever it is). Every once in a while, when a "new A" is called, a value will have to be calculated and stored.
So where should I put the #pragma omp critical ?
Though easy, it is very inefficient to put a #pragma omp critical around all of this, since each thread will be doing this constantly and it will often be the read-only case.
Is there any way to implement a "one-way" critical, or a "one-way" lock routine? That is, the above operations involving the iterator should only be "locked" when writing to my_map in the else statement. But multiple threads should be able to execute the .find call simultaneously.
I hope I make sense.
Thank you.
According to this link on Stack Overflow inserting into an std::map doesn't invalidate iterators. The same goes for the end() iterator. Here's a supporting link.
Unfortunately, insertion can happen multiple times if you don't use a critical section. Also, since your calculate_value routine might be computationally expensive, you will have to lock to avoid this else clause being operated on twice with the same value of A and then inserted twice.
Here's a sample function where you can replicate this incorrect multiple insertion:
void testFunc(std::map<int,float> &theMap, int i)
{
std::map<int,float>::iterator ite = theMap.find(i);
if(ite == theMap.end())
{
theMap[i] = 3.14 * i * i;
}
}
Then called like this:
std::map<int,float> myMap;
int i;
#pragma omp parallel for
for(i=1;i<=100000;++i)
{
testFunc(myMap,i % 100);
}
if(myMap.size() != 100)
{
std::cout << "Problem!" << std::endl;
}
Edit: edited to correct error in earler version.
OpenMP is a compiler "tool" for automatic loop parallelization, not a thread communication or synchronization library; so it doesn't have sophisticated mutexes, like a read/write mutex: acquire lock on writing, but not on reading.
Here's an implementation example.
Anyway Chris A.'s answer is better than mine though :)
While #ChrisA's answer may solve your problem, I'll leave my answer here in case any future searchers find it useful.
If you'd like, you can give #pragma omp critical sections a name. Then, any section with that name is considered the same critical section. If this is what you would like to do, you can easily cause onyl small portions of your method to be critical.
#pragma omp critical map_protect
{
std::map::iterator it_find = my_map.find(A); //many threads do this often.
bool found_A = it_find != my_map.end();
}
...
#pragma omp critical map_protect
{
float result_for_A = calculate_value(A); //should only be done once, really.
my_map[A] = result_for_A;
}
The #pragma omp atomic and #pragma omp flush directives may also be useful.
atomic causes a write to a memory location (lvalue in the expression preceded by the directive) to always be atomic.
flush ensures that any memory expected to be available to all threads is actually written to all threads, not stored in a processor cache and unavailable where it should be.