Trouble understanding some unexpected behaviour in code with OpenMP - c++

Question regarding OpenMP parallelization. I have included a stripped down version of my function below. The problem is that, the contents of the for loop are not getting evaluated for all values of uiIndex, although not always.
I use the buffer vec_succ_status to check if all values of uiIndex are getting evaluated. It turns out that it is not.
My code does not crash, it just exits from the function compute_Lagr_shortest_paths_from_source, without encountering any of the exit(-1) statements in the function definition below.
I am using g++ 7.4.0 version on Ubunutu 14, and every time it has failed, there is exactly one value of uiIndex that was skipped. There is no consistency to the uiIndex for which the function fails to evaluate.
For the programs I have been testing, the size of vec_group is always 1, so only the first if statement inside the for loop will evaluate.
In my main function, I included the line omp_set_num_threads(4). Apart from that, I did not set any other settings (such as scheduler type) for OpenMP.
Also, I can assure that no 2 values of uiIndex lead to the same uiRobot value, so no 2 threads will ever have to access the same vec_cf_graphs[uiRobot] array through the lieftime of the function.
I wonder if I am making some wrong assumptions about OpenMp. I require all objects such as vec_cf_graphs, vec_succ_status to be shared across all threads. I am wondering if I need to explicitly mention them as shared, as it usually the recommended approach. Anyways, I thought the way I have implemented also suffices. However, it seems rather strange to me that certain uiIndex values can get skipped altogether. I must point out that, I repeatedly call the function shown, but only sometimes certain uiIndex values are getting skipped from evaluation. If someone can point me to potential issues with my approach, that would be great. I am happy to provide additional information. Thanks.
bool compute_Lagr_shortest_paths_from_source(std::vector<Robot_CF_Graph>& vec_cf_graphs, const std::vector<std::vector<size_t>>& vec_robot_groups)
{
size_t uiIndex;
std::vector<bool> vec_succ_status(vec_robot_groups.size(), false);
#pragma omp parallel for default(shared) private(uiIndex)
for(uiIndex = 0; uiIndex < vec_robot_groups.size(); uiIndex++)
{
vec_succ_status[uiIndex] = false;
const auto& vec_group = vec_robot_groups[uiIndex];
if(1 == vec_group.size())
{
size_t uiRobot = vec_group[0];
vec_cf_graphs[uiRobot].compute_shortest_path("ABC");
vec_succ_status[uiIndex] = true;
}
else
{
std::cout<< "Tag: Code should not have entered this block"<<endl;
exit(-1);
}
if(false == vec_succ_status[uiIndex])
{
std::cout<< "It is not possible for this to happen \n";
exit(-1);
}
}
return true;
}

You concurrently write to a vector<bool> which is not a 'normal' vector. It has an internal memory optimization. This is undefined behaviour.
See detailed reasoning here:
Write concurrently vector<bool>
How vector<bool> is different from other vectors can be found here:
https://en.cppreference.com/w/cpp/container/vector_bool
Just using a vector<char> with 0 or 1 representing true or false is the easiest way to solve this. Other options are discussed here, if you want to have more elegant code:
Alternative to vector<bool>

Related

in subset sum problem , when i am taking memo[sum][size] i have to add sum<0 but not in case of mem[size][sum] .i do not why . please explain

#include<bits/stdc++.h>
using namespace std;
int issubseset(vector<int> subset,int size,int sum,vector<vector<int>>&memo){
// if(sum<0)return 0;
if(sum==0) return 1;
if(size<0) return 0;
if(subset[size]>sum) issubseset(subset,size-1,sum,memo);
if(memo[size][sum]>=0) return memo[size][sum];
memo[size][sum] = issubseset(subset,size-1,sum-subset[size],memo)||issubseset(subset,size-1,sum,memo);
return memo[size][sum];
}
int main(){
vector<int> subset{3, 34, 4, 12, 5, 2};
int sum=9;
std::cout << subset.size() << std::endl;
vector<vector<int>> memo(subset.size(),vector<int>(sum+1,INT_MIN));
printf("%s",issubseset(subset,subset.size()-1,sum,memo)?"true":"false");
}
Question:
Given a set of non-negative integers, and a value sum, determine if there is a subset of the given set with sum equal to given sum.
When I am interchanging the memo 2d array from memo[size][sum] to memo[sum][size], I have to uncomment the the first line in issubseset function . If I am just changing the shape of memo it should not have any effect since the array will be filled as per recursion and I am already covering base cases. If memo[size][sum] can work without the if(sum<0) line, why can't memo[sum][size]?
Your code exhibits undefined behaviour thanks to sum being used as an index even though it is sometimes negative. This is true of the code you posted as well as the equivalent with the shape of memo changed.
To find out why this happens, we'll have to look closely at your code. I'll reproduce it here with a couple of helpful labels:
#include<bits/stdc++.h>
using namespace std;
int issubseset(vector<int> subset,int size,int sum,vector<vector<int>>&memo){
// (1)
if(sum==0) return 1;
if(size<0) return 0;
// (2)
if(subset[size]>sum) issubseset(subset,size-1,sum,memo);
// (3)
if(memo[size][sum]>=0) return memo[size][sum];
memo[size][sum] = issubseset(subset,size-1,sum-subset[size],memo)||issubseset(subset,size-1,sum,memo);
return memo[size][sum];
}
Now let's walk through the code, assuming sum is negative and size is non-negative. If we get to (3), we've encountered undefined behaviour.
The checks for base cases at (1) do not trigger in this case, so execution carries on.
Now we're at (2), which is a very important line. It is the last line before the potentially troublesome (3), so there's a lot riding on it. We had better be sure it doesn't let execution go to (3). Unfortunately, even without looking deeply, we can tell that it's not up to the task: there isn't any control flow in this line (aside from the branching for the if of course). There's no question about it now: execution will definitely go ahead to (3), resulting in undefined behaviour.
Thankfully the fix is easy. Add a return for the recursive call in (2):
// (2)
if(subset[size]>sum) return issubseset(subset,size-1,sum,memo);
This will prevent execution from continuing to (3) whenever sum is negative: since subset[size] is non-negative and sum is negative, subset[size] > sum will be true and the return path will be taken. I'll leave it to you to determine whether this is the correct thing to do for your given problem.
The same analysis holds when the shape of memo is changed. The fact that you only noticed a problem with one shape and not the other is luck of the draw, really. There is no "why", it just happens to be that way. Either version of the code could literally have done (or not done) anything else (we don't call it undefined behaviour for nothing). I'll avoid going on a tangent about best practices, but I will give one piece of advice: use .at() instead of [], at least until you've proven the code correct (and even then, keeping .at() around may not be a bad idea). .at() will check each index and will scream at you (throw an exception) if it is invalid. Unlike [], .at() will not silently break your code when given a bad index, making much nicer from a debugging standpoint.

Branching when mixing template parameters and variables in C++

I'm trying to carry out some loop optimization as described here: Optimizing a Loop vs Code Duplication
I have the additional complication that some code inside the loop only needs to be executed depending on a combination of run-time-known variables external to the loop (which can be replaced with template parameters for optimization, as discussed in the link above) and a run-time-known variable that is only known inside the loop.
Here is the completely un-optimized version of the code:
for (int i = 0; i < 100000, i++){
if (external_condition_1 || (external_condition_2 && internal_condition[i])){
run_some_code;
}
else{
run_some_other_code;
}
run_lots_of_other_code;
}
This is my attempt at wrapping the loop in a templated function as suggested in the question linked above to optimize performance and avoid code duplication by writing multiple versions of the loop:
template<bool external_condition_1, external_condition_2>myloop(){
for (int i = 0; i < 100000, i++){
if (external_condition_1 || (external_condition_2 && internal_condition[i]){
run_some_code;
}
else{
run_some_other_code;
}
run_lots_of_other_code;
}
My question is: how can the code be written to avoid branching and code duplication?
Note that the code is sufficiently complex that the function probably can't be inlined, and compiler optimization also likely wouldn't sort this out in general.
My question is: how can the code be written to avoid branching and code duplication?
Well, you already wrote your template to avoid code duplication, right? So let's look at what branching is left. To do this, we should look at each function that is generated from your template (there are four of them). We should also apply the expected compiler optimizations based upon the template parameters.
First up, set condition 1 to true. This should produce two functions that are essentially (using a bit of pseudo-syntax) the following:
myloop<true, bool external_condition_2>() {
for (int i = 0; i < 100000, i++){
// if ( true || whatever ) <-- optimized out
run_some_code;
run_lots_of_other_code;
}
}
No branching there. Good. Moving on to the first condition being false and the second condition being true.
myloop<false, true>(){
for (int i = 0; i < 100000, i++){
if ( internal_condition[i] ){ // simplified from (false || (true && i_c[i]))
run_some_code;
}
else{
run_some_other_code;
}
run_lots_of_other_code;
}
}
OK, there is some branching going on here. However, each i needs to be analyzed to see which code should execute. I think there is nothing more that can be done here without more information about internal_condition. I'll give some thoughts on that later, but let's move on to the fourth function for now.
myloop<false, false>() {
for (int i = 0; i < 100000, i++){
// if ( false || (false && whatever) ) <-- optimized out
run_some_other_code;
run_lots_of_other_code;
}
}
No branching here. You already have done a good job avoiding branching and code duplication.
OK, let's go back to myloop<false,true>, where there is branching. The branching is largely unavoidable simply because of how your situation is set up. You are going to iterate many times. Some iterations you want to do one thing while other iterations should do another. To get around this, you would need to re-envision your setup so that you can do the same thing each iteration. (The optimization you are working from is based upon doing the same thing each iteration, even though it might be a different thing the next time the loop starts.)
The simplest, yet unlikely, scenario would be where internal_condition[i] is equivalent to something like i < 5000. It would also be convenient if you could do all of the "some code" before any of the "lots of other code". Then you could loop from 0 to 4999, running "some code" each iteration. Then loop from 5000 to 99999, running "other code". Then a third loop to run "lots of other code".
Any solution I can think of would involve adapting your situation to make it more like the unlikely simple scenario. Can you calculate how many times internal_condition[i] is true? Can you iterate that many times and map your (new) loop control variable to the appropriate value of i (the old loop control variable)? (Or maybe the exact value of i is not important?) Then do a second loop to cover the remaining cases? In some scenarios, this might be trivial. In others, far from it.
There might be other tricks that could be done, but they depend on more details about what you are doing, what you need to do, and what you think you need to do but don't really. (It's possible that the required level of detail would overwhelm StackOverflow.) Is the order important? Is the exact value of i important?
In the end, I would opt for profiling the code. Profile the code without code duplication but with branching. Profile the code with minimal branching but with code duplication. Is there a measurable change? If so, think about how you can re-arrange your internal condition so that i can cover large ranges without changing the value of the internal condition. Then divide your loop into smaller pieces.
In C++17, to guaranty no extra branches evaluation, you might do:
template <bool external_condition_1, bool external_condition_2>
void myloop()
{
for (int i = 0; i < 100000, i++){
if constexpr (external_condition_1) {
run_some_code;
} else if constexpr (external_condition_2){
if (internal_condition[i]) {
run_some_code;
} else {
run_some_other_code;
}
} else {
run_some_other_code;
}
run_lots_of_other_code;
}
}

Using elses in boolean functions c++

Let's say I have a simple function that checks a condition and returns true if the condition is true and false if the condition is false.
Is it better to use this type of code:
bool myfunction( /*parameters*/ ) {
if ( /*conditional statement*/ ) {
return true;
}
return false;
}
Or this type:
bool myfunction( /*parameters*/ ) {
if ( /*conditional statement*/ ) {
return true;
}
else return false;
}
Or does it just really not make a difference? Also, what considerations should I bear in mind when deciding whether to "if...else if" vs. "if...else" vs. "switch"?
You can also write this without any conditional at all:
bool myfunction( /*parameters*/ ) {
return /*conditional statement*/;
}
This way you avoid the conditional entirely.
Of course, if you are dealing with a different function where you need the conditional, it shouldn't make a difference. Modern compilers work well either way.
As far as using switch vs if-else, switch adds efficiency when you have many cases by allowing you to jump to a single one, making execution faster by not running all cases. At a low (hardware/compiler level), switch statements allow you to make a single check/jump, where if you had many if statements, you would need to make many checks/jumps.
It is the same. Remember whenever you say
return boolean;
the function ends and return to its calling line.
Therefore putting it inside else or just simply putting it is same.
say we want to check the prime
bool isPrime (int n){
for (int i = 2; i <= sqrt(n); i++){
if (n % i == 0)
return false;
}
return true;
}
if you see the function closely you will know if the number is divided properly with any value in range of sqrt(n) it will return false as the number is not a prime..
if it cannot be divided then the loop will end without any interference and said the number to be a prime. hence forth the function works properly.
Since neither of two given answers are hitting the nail, i will give you another one.
From the code (or compiler's) view, assuming recent compiler both versions are identical. Compiler will optimise if version to return version just fine. Difference is in debugging - the debugger you're using might not allow you to set breakpoint on return value (for example if you want to set breakpoint on only returning true values). While if version give you two return statements on different lines and any sane debugger will set breakpoint on line just fine.
Both functions are identical, regardless of any optimizations applied by the compiler, because the "else" in the second function hasn't any effect. If you leave the function as soon as the condition is met, you'll never enter the other branch in this case, so the "else" is implicit in the first version.
Hence I'd prefer the first version, because the "else" in the other one is misleading.
However, I agree with others here that this kind of function (both variants) doesn't make sense anyway, because you can simply use the plain boolean condition instead of this function, which is just a needless wrapper.
In terms of compilation the specific form you choose for if-else syntax won't make a big difference. The optimization path will usually erase any differences. Your decision should be made based on visual form instead.
As others have pointed out already, if you have a simple condition like this it's best to just return the calculation directly and avoid the if statement.
Returning directly only works if you have a boolean calculation. You might instead need to return a different type:
int foo(/*args*/) {
if(/*condition*/) {
return bar();
}
return 0;
}
Alternately you could use the ternary operator ?:, but if the expressions, it may not be as clear.
By using short returns (evaluation doesn't reach the end of the function) you can also sequence several conditions and evaluations.
int foo(/*args*/) {
if(/*condition1*/) {
return 0;
}
if(/*condition2*/) {
return 3;
}
int common = bar(/*args*/);
if(/*condition3*/) {
return 1-common;
}
return common;
}
Pick the form based on what makes the most logical sense, just ignore how this might compile. Then consider massaging the form to have the least visual complexity (avoids too much indentation or deep branching).

Are while loops more efficient than for loops

I was told that a while loop was more efficient than a for loop. (c/c++)
This seemed reasonable but I wanted to find a way to prove or disprove it.
I have tried three tests using analogous snippets of code. Each containing Nothing but a for or while loop with the same output:
Compile time - roughly the same
Run time - Same
Compiled to intel assembly code and compared - Same number of lines and virtually the same code
Should I have tried anything else, or can anyone confirm one way or the other?
All loops follow the same template:
{
// Initialize
LOOP:
if(!(/* Condition */) ) {
goto END
}
// Loop body
// Loop increment/decrement
goto LOOP
}
END:
Therefor the two loops are the same:
// A
for(int i=0; i<10; i++) {
// Do stuff
}
// B
int i=0;
while(i < 10) {
// Do stuff
i++;
}
// Or even
int i=0;
while(true) {
if(!(i < 10) ) {
break;
}
// Do stuff
i++;
}
Both are converted to something similar to:
{
int i=0;
LOOP:
if(!(i < 10) ) {
goto END
}
// Do stuff
i++;
goto LOOP
}
END:
Unused/unreachable code will be removed from the final executable/library.
Do-while loops skip the first conditional check and are left as an exercise for the reader. :)
Certainly LLVM will convert ALL types of loops to a consistent form (to the extent possible, of course). So as long as you have the same functionality, it doesn't really matter if you use for, while, do-while or goto to form the loop, if it's got the same initialization, exit condition, and update statement and body, it will produce the exact same machine code.
This is not terribly hard to do in a compiler if it's done early enough during the optimisation (so the compiler still understands what is actually being written). The purpose of such "make all loops equal" is that you then only need one way to optimise loops, rather than having one for while-loops, one for for-loops, one for do-while loops and one for "any other loops".
It's not guaranteed for ALL compilers, but I know that gcc/g++ will also generate nearly identical code whatever loop construct you use, and from what I've seen Microsoft also does the same.
C and C++ compilers actually convert high level C or C++ codes to assembly codes and in assembly we don't have while or for loops. We can only check a condition and jump to another location.
So, performance of for or while loop heavily depends on how strong the compiler is to optimize the codes.
This is good paper on code optimizations:
http://www.linux-kongress.org/2009/slides/compiler_survey_felix_von_leitner.pdf.

concurrent_vector invalid data

using : VC++ 2013
concurrency::concurrent_vector<datanode*> dtnodelst
Occasionally when I do dtnodelst->at(i) .... I am getting an invalid address (0XCDCD.. ofc)
which shouldn't be the case cause after I do push back, I never delete or remove any of the itms ( even if I delete it should have returned the deleted old address... but I am not ever deleting so that is not even the case )
dtnodelst itm = new dtnodelst ();
....
dtnodelst->push_back(itm);
any ideas on what might be happening ?
p.s. I am using windows thread pool. some times .. I can do 8million inserts and find and everything goes fine .... but sometimes even 200 inserts and finds will fail. I am kind of lost. any help would be awesomely appreciated!!
thanks and best regards
actual code as an fyi
p.s. am I missing something or is it pain in the ass to past code with proper formatting ? I remember it being auto align before ... -_-
struct datanode {
volatile int nodeval;
T val;
};
concurrency::concurrent_vector<datanode*> lst
inline T find(UINT32 key)
{
for (int i = 0; i < lst->size(); i++)
{
datanode* nd = lst->at(i);
//nd is invalid sometimes
if (nd)
if (nd->nodeval == key)
{
return (nd->val);
}
}
return NULL;
}
inline T insert_nonunique(UINT32 key, T val){
datanode* itm = new datanode();
itm->val = val;
itm->nodeval = key;
lst->push_back(itm);
_updated(lst);
return val;
}
The problem is using of concurrent_vector::size() which is not fully thread-safe as you can get reference to not yet constructed elements (where memory contains garbage). Microsoft PPL library (which provides it in concurrency:: namespace) uses Intel TBB implementation of concurrent_vector and TBB Reference says:
size_type size() const |
Returns: Number of elements in the vector. The result may include elements that are allocated but still under construction by concurrent calls to any of the growth methods.
Please see my blog for more explanation and possible solutions.
In TBB, the most reasonable solution is to use tbb::zero_allocator as underlying allocator of concurrent_vector in order to fill newly allocated memory with zeroes before size() will count it too.
concurrent_vector<datanode*, tbb::zero_allocator<datanode*> > lst;
Then, the condition if (nd) will filter out not-yet-ready elements.
volatile is no substitute for atomic<T>. Do not use volatile in some attempt to provide synchronization.
The whole idea of your find call doesn't make sense in a concurrent context. As soon as the function iterates over one value, it could be mutated by another thread to be the value you're looking for. Or it could be the value you want, but mutated to be some other value. Or as soon as it returns false, the value you're seeking is added. The return value of such a function would be totally meaningless. size() has all the same problems, which is a good part of why your implementation would never work.
Inspecting the state of concurrent data structures is a very bad idea, because the information becomes invalid the moment you have it. You should design operations that do not require knowing the state of the structure to execute correctly, or, block all mutations whilst you operate.