Best way to parallelize this for loop with multiple threads

Best way to parallelize this for loop with multiple threads - c++

I currently have a code block like this
UINT8* u = getResult();
for (UINT64 counter = 0; counter < MaxCount; counter++)
{
for (UINT64 index = 0; index < c_uOneMB; ++index)
{
*u++ = genValue();
}
}
Now in order to make this run faster. I am doing something like this. Basically splitting the inner thread into a method. However I have two concerns which I am not sure how to tackle.
*u++ how do I handle that?
Before calling doSomethingElse() all the threads need to .join()
Any suggestions on how to accomplish that?
void doSomething(UINT8* u)
{
for (UINT64 index = 0; index < c_uOneMB; ++index)
{
*u++ = genValue();
}
}
UINT8* u = getResult();
for (UINT64 counter = 0; counter < MaxCount; counter++)
{
std::thread t(doSomething,u);
}
doSomethingElse();

With little details you have provided I can give only this:
std::generate_n(std::execution::par, getResult(), MaxCount * c_uOneMB, genValue);
https://en.cppreference.com/w/cpp/algorithm/generate_n
https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag

Best way to parallize this for loop with multiple threads
Best way depends on many factors and is subjective. In fact, sometimes (perhaps most of the time) non-parallelised code is faster. If speed is most important, then the best way is whatever you have measured to be fastest.
Using the standard library algorithms is usually straightforward:
std::generate_n(
std::execution::par_unseq,
u,
MaxCount * c_uOneMB,
genValue);

Related

Parallelization of bin packing problem by OpenMp

I am learning open mp, and I want to parallelize well-known BinPacking problem. But the problem is what whatever I try, can't get correct solution ( the one I get with sequential verstion).
So far, I have tried multiple different versions (including reduction, tasks, schedule) but didn't get anything useful.
Below is my the most recent try.
int binPackingParallel(std::vector<int> weight, int n, int c)
{
int resltut = 0;
int bin_rem[n];
#pragma omp parallel for schedule(dynamic) reduction(+:result)
for (int i = 0; i < n; i++) {
bool done = false;
int j;
for (j = 0; j < result && !done; j++) {
int b ;
#pragma omp atomic
b = bin_rem[j] - weight[i];
if ( b >= 0) {
bin_rem[j] = bin_rem[j] - weight[i];
done = true;
}
}
if (!done) {
#pragma omp critical
bin_rem[result] = c - weight[i];
result++;
}
}
return result;
}
Edit: I made modification on starting problem, so now there is given number of bins N and we need to check if all elements can be put in N bins. I made this by using recursion, still my parallel version is slower.
bool can_fit_parallel(std::vector<int> arr, std::vector<int> bins, int n) {
// base case: if the array is empty, we can fit the elements
if (arr.empty()) {
return true;
}
bool found = false;
#pragma omp parallel for schedule (dynamic,10)
for (int i = 0; i < n; i++) {
if (bins[i] >= arr[0]) {
bins[i] -= arr[0];
if (can_fit_parallel(std::vector<int>(arr.begin() + 1, arr.end()), bins, n)) {
found = true;
#pragma omp cancel for
}
// if the element doesn't fit or if the recursion fails,
// restore the bin's capacity and try the next bin
bins[i] += arr[0];
}
}
// if the element doesn't fit in any of the bins, return false
return found;
}
Any help would be great

You do not need parallelization to make your code significantly faster. You have implemented the First Fit method (its complexity is O(n2)), but it can be significantly faster if you use binary search trees (O(n Log n)). To do so, you just have to use the standard library (std::multiset), in this example I have implemented the BestFit algorithm:
int binPackingSTL(const std::vector<int>& weight, const int n, const int c)
{
std::multiset<int> bins; //multiset to store bins
for (const auto x: weight) {
const auto it=bins.lower_bound(x); // find the best bin to accomodate x
if(it==bins.end()){
bins.insert(c - x); // if no suitale bin found insert a new one
} else {
//suitable bin found - replace it with a smaller value
auto value=*it; // store its value
bins.erase(it); // erase the old value
bins.insert(value-x); // insert the new value
}
}
return bins.size(); // number of bins
}
In my measurements, it is 100x times faster than your code in the case of n=50000
EDIT: Both algorithms mentioned above (First-Fit and Best-Fit) are approximations to the bin packing problem. To answer your revised question, you have to use an algorithm that finds the optimal solution. So, you need to find an algorithm for the exact solution, not an approximation. Instead of trying to reinvent the wheel, you can consider using already available libraries such as BPPLIB – A Bin Packing Problem Library.

This is not a reduction: that would cause each thread to have it own partial result, and you want result to be global. I think that putting a critical section around two statements might work. The atomic statement is meaningless since it is not on a shared variable.
But there a deeper problem: each i iteration can write a result, which affects how far the search of the other iterations goes. That means that the outer iteration has to be sequential. (You really need to think hard about whether iterations are independent before you slap a parallel directive on them!) Maybe you can make the inner iteration parallel: it's a search, which would be a reduction on j. However that loop would have to be pretty dang long before you'd see a performance improvement.
This looks to me like the sort of algorithm that you'd have to reformulate before you can make it parallel.

Multiple statements in a ranged for loop

I'd like to know if it's possible to convert this expression
vector<Mesh>::iterator vIter;
for(int count = 0, vIter = meshList.begin(); vIter < meshList.end(); vIter++, count++)
{
...
}
into something along the lines of C++ 11
I'd like to get something like this:
for(auto count = 0, auto mesh : meshList; ; count++)
{
...
}
Is there a way to do this?

No, it is not possible. The best you can do is the following:
int count = 0;
for(auto &mesh : meshList)
{
...
++count;
}

For completeness's sake alone, I just want to point out that you can define those two in the for loop's init-list (if your really want to) by (cheating and) aggregating them:
for(struct { int count; decltype(meshList)::iterator vIter; } _{0, meshList.begin()} ;
_.vIter < meshList.end(); _.vIter++, _.count++)
{
// ...
}
See it live
But as you may have noticed it's verbose, ugly, and totally not worth it. The solution in Remy's answer is better by a factor of 100 at least.

How to avoid use of goto and break nested loops efficiently

I'd say that it's a fact that using goto is considered a bad practice when it comes to programming in C/C++.
However, given the following code
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; j++)
{
for (k = 0; k < N; ++k)
{
...
if (condition)
goto out;
...
}
}
}
out:
...
I wonder how to achieve the same behavior efficiently not using goto. What i mean is that we could do something like checking condition at the end of every loop, for example, but AFAIK goto will generate just one assembly instruction which will be a jmp. So this is the most efficient way of doing this I can think of.
Is there any other that is considered a good practice? Am I wrong when I say it is considered a bad practice to use goto? If I am, would this be one of those cases where it's good to use it?
Thank you

The (imo) best non-goto version would look something like this:
void calculateStuff()
{
// Please use better names than this.
doSomeStuff();
doLoopyStuff();
doMoreStuff();
}
void doLoopyStuff()
{
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; j++)
{
for (k = 0; k < N; ++k)
{
/* do something */
if (/*condition*/)
return; // Intuitive control flow without goto
/* do something */
}
}
}
}
Splitting this up is also probably a good idea because it helps you keep your functions short, your code readable (if you name the functions better than I did) and dependencies low.

If you have deeply-nested loops like that and you must break out, I believe that goto is the best solution. Some languages (not C) have a break(N) statement that will break out of more than one loop. The reason C doesn't have it is that it's even worse than a goto: you have to count the nested loops to figure out what it does, and it's vulnerable to someone coming along later and adding or removing a level of nesting, without noticing that the break count needs to be adjusted.
Yes, gotos are generally frowned upon. Using a goto here is not a good solution; it's merely the least of several evils.
In most cases, the reason you have to break out of a deeply-nested loop is because you're searching for something, and you've found it. In that case (and as several other comments and answers have suggested), I prefer to move the nested loop to its own function. In that case, a return out of the inner loop accomplishes your task very cleanly.
(There are those who say that functions must always return at the end, not from the middle. Those people would say that the easy break-it-out-to-a-function solution is therefore invalid, and they'd force the use of the same awkward break-out-of-the-inner-loop technique(s) even when the search was split off to its own function. Personally, I believe those people are wrong, but your mileage may vary.)
If you insist on not using a goto, and if you insist on not using a separate function with an early return, then yes, you can do something like maintaining extra Boolean control variables and testing them redundantly in the control condition of each nested loop, but that's just a nuisance and a mess. (It's one of the greater evils that I was saying using a simple goto is lesser than.)

I think that goto is a perfectely sane thing to do here, and is one of it's exceptional use cases per the C++ Core Guidelines.
However, perhaps another solution to be considered is an IIFE lambda. In my opinion this is slightly more elegant than declaring a separate function!
[&] {
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; ++k)
if (condition)
return;
}();
Thanks to JohnMcPineapple on reddit for this suggestion!

In this case you don't wan't to avoid using goto.
In general the use of goto should be avoided, however there are exceptions to this rule, and your case is a good example of one of them.
Let's look at the alternatives:
for (i = 0; i < N; ++i) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; ++k) {
...
if (condition)
break;
...
}
if (condition)
break;
}
if (condition)
break;
}
Or:
int flag = 0
for (i = 0; (i < N) && !flag; ++i) {
for (j = 0; (j < N) && !flag; j++) {
for (k = 0; (k < N) && !flag; ++k) {
...
if (condition) {
flag = 1
break;
...
}
}
}
Neither of these is as concise or as readable as the goto version.
Using a goto is considered acceptable in cases where you're only jumping ahead (not backward) and doing so makes your code more readable and understandable.
If on the other hand you use goto to jump in both directions, or to jump into a scope which could potentially bypass variable initialization, that would be bad.
Here's a bad example of goto:
int x;
scanf("%d", &x);
if (x==4) goto bad_jump;
{
int y=9;
// jumping here skips the initialization of y
bad_jump:
printf("y=%d\n", y);
}
A C++ compiler will throw an error here because the goto jumps over the initialization of y. C compilers however will compile this, and the above code will invoke undefined behavior when attempting to print y which will be uninitialized if the goto occurs.
Another example of proper use of goto is in error handling:
void f()
{
char *p1 = malloc(10);
if (!p1) {
goto end1;
}
char *p2 = malloc(10);
if (!p2) {
goto end2;
}
char *p3 = malloc(10);
if (!p3) {
goto end3;
}
// do something with p1, p2, and p3
end3:
free(p3);
end2:
free(p2);
end1:
free(p1);
}
This performs all of the cleanup at the end of the function. Compare this to the alternative:
void f()
{
char *p1 = malloc(10);
if (!p1) {
return;
}
char *p2 = malloc(10);
if (!p2) {
free(p1);
return;
}
char *p3 = malloc(10);
if (!p3) {
free(p2);
free(p1);
return;
}
// do something with p1, p2, and p3
free(p3);
free(p2);
free(p1);
}
Where the cleanup is done in multiple places. If you later add more resources that need to be cleaned up, you have to remember to add the cleanup in all of these places plus the cleanup of any resources that were obtained earlier.
The above example is more relevant to C than C++ since in the latter case you can use classes with proper destructors and smart pointers to avoid manual cleanup.

Lambdas let you create local scopes:
[&]{
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; j++)
{
for (k = 0; k < N; ++k)
{
...
if (condition)
return;
...
}
}
}
}();
if you also want the ability to return out of that scope:
if (auto r = [&]()->boost::optional<RetType>{
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; j++)
{
for (k = 0; k < N; ++k)
{
...
if (condition)
return {};
...
}
}
}
}()) {
return *r;
}
where returning {} or boost::nullopt is a "break", and returning a value returns a value from the enclosing scope.
Another approach is:
for( auto idx : cube( {0,N}, {0,N}, {0,N} ) {
auto i = std::get<0>(idx);
auto j = std::get<1>(idx);
auto k = std::get<2>(idx);
}
where we generate an iterable over all 3 dimensions and make it a 1 deep nested loop. Now break works fine. You do have to write cube.
In c++17 this becomes
for( auto[i,j,k] : cube( {0,N}, {0,N}, {0,N} ) ) {
}
which is nice.
Now, in an application where you are supposed to be responsive, looping over a large 3 dimensional range at primiary control flow level is often a bad idea. You can thread it off, but even then you end up with problem that the thread runs too-long. And most 3 dimensional large iterations I've played with can benefit from using sub-task threading themselves.
To that end, you'll end up wanting to categorize your operation based on what kind of data it accesses, then pass your operation to something that schedules the iteration for you.
auto work = do_per_voxel( volume,
[&]( auto&& voxel ) {
// do work on the voxel
if (condition)
return Worker::abort;
else
return Worker::success;
}
);
then the control flow involved goes into the do_per_voxel function.
do_per_voxel isn't going to be a simple naked loop, but rather a system to rewrite the per-voxel tasks into per-scanline tasks (or even per-plane tasks depending on how large the planes/scanlines are at runtime (!)) then dispatch them in turn to a thread pool managed task scheduler, stitch up the resulting task handles, and return a future-like work that can be awaited on or used as a continuation trigger for when the work is done.
And sometimes you just use goto. Or you manually break out functions for subloops. Or you use flags to break out of deep recursion. Or you put the entire 3 layer loop in its own function. Or you compose the looping operators using a monad library. Or you can throw an exception (!) and catch it.
The answer to almost every question in c++ is "it depends". The scope of problem and the number of techniques you have available is large, and the details of the problem change the details of the solution.

Alternative - 1
You can do something like follows:
Set a bool variable in the beginning isOkay = true
All of your forloop conditions, add an extra condition isOkay == true
When your your custom condition is satisfied/ fails, set isOkay = false.
This will make your loops stop. An extra bool variable would be sometimes handy though.
bool isOkay = true;
for (int i = 0; isOkay && i < N; ++i)
{
for (int j = 0; isOkay && j < N; j++)
{
for (int k = 0; isOkay && k < N; ++k)
{
// some code
if (/*your condition*/)
isOkay = false;
}
}
}
Alternative - 2
Secondly. if the above loop iterations are in a function, best choice is to return result, when ever the custom condition is satisfied.
bool loop_fun(/* pass the array and other arguments */)
{
for (int i = 0; i < N ; ++i)
{
for (int j = 0; j < N ; j++)
{
for (int k = 0; k < N ; ++k)
{
// some code
if (/* your condition*/)
return false;
}
}
}
return true;
}

Break your for loops out into functions.
It makes things significantly easier to understand because now you can see what each loop is actually doing.
bool doHerpDerp() {
for (i = 0; i < N; ++i)
{
if (!doDerp())
return false;
}
return true;
}
bool doDerp() {
for (int i=0; i<X; ++i) {
if (!doHerp())
return false;
}
return true;
}
bool doHerp() {
if (shouldSkip)
return false;
return true;
}

Is there any other that is considered a good practice? Am I wrong when
I say it is considered a bad practice to use goto?
goto can be misused and overused, but I dont see any of the two in your example. Breaking out of a deeply nested loop is most clearly expressed by a simple goto label_out_of_the_loop;.
It is bad practice to use many gotos that jump to different labels, but in such cases it isnt the keyword goto itself that makes your code bad. It is the fact that you are jumping around in the code making it hard to follow that makes it bad. If however, you need a single jump out of nested loops then why not use the tool that was made for exactly that purpose.
To use a made up out of thin air analogy: Imagine you live in a world where in the past it was hip to hammer nails into walls. In recent times it became more fashinable to drill screws into walls using screwdrivers and hammers are completely out of fashion. Now consider you have to (despite being a bit old-fashinoned) get a nail into a wall. You should not refrain from using a hammer to do that, but maybe you should rather ask yourself if you really need a nail in the wall instead of a screw.
(Just in case it isnt clear: The hammer is goto and the nail in the wall is a jump out of a nested loop while the screw in the wall would be using functions to avoid the deep nesting alltogether ;)

One possible way is to assign a boolean value to a variable that represents the state. This state can later be tested using an "IF" conditional statement for other purposes later on in the code.

as far as your comment on efficiency compiling the both options in release mode on visual studio 2017 produces the exact same assembly.
for (int i = 0; i < 5; ++i)
{
for (int j = 0; j < 5; ++j)
{
for (int k = 0; k < 5; ++k)
{
if (i == 1 && j == 2 && k == 3) {
goto end;
}
}
}
}
end:;
and with a flag.
bool done = false;
for (int i = 0; i < 5; ++i)
{
for (int j = 0; j < 5; ++j)
{
for (int k = 0; k < 5; ++k)
{
if (i == 1 && j == 2 && k == 3) {
done = true;
break;
}
}
if (done) break;
}
if (done) break;
}
both produce..
xor edx,edx
xor ecx,ecx
xor eax,eax
cmp edx,1
jne main+15h (0C11015h)
cmp ecx,2
jne main+15h (0C11015h)
cmp eax,3
je main+27h (0C11027h)
inc eax
cmp eax,5
jl main+6h (0C11006h)
inc ecx
cmp ecx,5
jl main+4h (0C11004h)
inc edx
cmp edx,5
jl main+2h (0C11002h)
so there is no gain. Another option if your using a modern c++ compiler is to wrap it in a lambda.
[](){
for (int i = 0; i < 5; ++i)
{
for (int j = 0; j < 5; ++j)
{
for (int k = 0; k < 5; ++k)
{
if (i == 1 && j == 2 && k == 3) {
return;
}
}
}
}
}();
again this produces the exact same assembly. Personally I think using goto in your example is perfectly acceptable. It is clear what is happening to anyone else, and makes for more concise code. Arguably the lambda is equally as concise.

Specific
IMO, in this specific example, I think it is important to notice a common functionality between your loops. (Now I know that your example isn't necessarily literal here, but just bear with me for a sec) as each loop iterates N times, you can restructure your code like the following:
Example
int max_iterations = N * N * N;
for (int i = 0; i < max_iterations; i++)
{
/* do stuff, like the following for example */
*(some_ptr + i) = 0; // as opposed to *(some_3D_ptr + i*X + j*Y + Z) = 0;
// some_arr[i] = 0; // as oppose to some_3D_arr[i][j][k] = 0;
}
Now, it is important to remember that all loops, while for or otherwise, are really just syntatic sugar for the if-goto paradigm. I agree with the others that you ought to have a function return the result, however I wanted to show an example like the above in which that may not be the case. Granted, I'd flag the above in a code review but if you replaced the above with a goto I'd consider that a step in the wrong direction. (NOTE -- Make sure that you can reliably fit it into your desired datatype)
General
Now, as a general answer, the exit conditions for your loop may not be the same everytime (like the post in question). As a general rule, pull as many unneeded operations out of your loops (multiplications, etc.) as far out as you can as, while compilers are getting smarter everyday, there is no replacement for writing efficient and readable code.
Example
/* matrix_length: int of m*n (row-major order) */
int num_squared = num * num;
for (int i = 0; i < matrix_length; i++)
{
some_matrix[i] *= num_squared; // some_matrix is a pointer to an array of ints of size matrix_length
}
rather than writing *= num * num, we no longer have to rely on the compiler to optimize this out for us (though any good compiler should). So any doubly or triply nested loops which perform the above functionality would also benefit not only your code, but IMO writing clean and efficient code on your part. In the first example, we could have instead had *(some_3D_ptr + i*X + j*Y + Z) = 0;! Do we trust the compiler to optimize out i*X and j*Y, so that they aren't computed every iteration?
bool check_threshold(int *some_matrix, int max_value)
{
for (int i = 0; i < rows; i++)
{
int i_row = i*cols; // no longer computed inside j loop unnecessarily.
for (int j = 0; j < cols; j++)
{
if (some_matrix[i_row + j] > max_value) return true;
}
}
return false;
}
Yuck! Why aren't we using classes provided by the STL or a library like Boost? (we must be doing some low level/high performant code here). I couldn't even write a 3D version, due to the complexity. Even though we have hand optimized something, it may even be better to use #pragma unroll or similar preprocessor hints if your compiler allows.
Conclusion
Generally, the higher the abstraction level you can use, the better, however if say aliasing a 1-Dimensional row-major order matrix of integers to a 2-Dimensional array makes your code-flow harder to understand/extend, is it worth it? Likewise, that also may be an indicator to make something into its own function. I hope that, given these examples, you can see that different paradigms are called for in different places, and its your job as the programmer to figure that out. Don't go crazy with the above, but make sure you know what they mean, how to use them, and when they are called for, and most importantly, make sure the other people using your codebase know what they are as well and have no qualms about it. Good luck!

bool meetCondition = false;
for (i = 0; i < N && !meetCondition; ++i)
{
for (j = 0; j < N && !meetCondition; j++)
{
for (k = 0; k < N && !meetCondition; ++k)
{
...
if (condition)
meetCondition = true;
...
}
}
}

There are already several excellent answers that tell you how you can refactor your code, so I won’t repeat them. There isn’t a need to code that way for efficiency any more; the question is whether it’s inelegant. (Okay, one refinement I’ll suggest: if your helper functions are only ever intended to be used inside the body of that one function, you might help the optimizer out by declaring them static, so it knows for certain that the function does not have external linkage and will never be called from any other module, and the hint inline can’t hurt. However, previous answers say that, when you use a lambda, modern compilers don’t need any such hints.)
I’m going to challenge the framing of the question a bit. You’re correct that most programmers have a taboo against using goto. This has, in my opinion, lost sight of the original purpose. When Edsger Dijkstra wrote, “Go To Statement Considered Harmful,” there was a specific reason he thought so: the “unbridled” use of go to makes it too hard to reason formally about the current program state, and what conditions must currently be true, compared to control flow from recursive function calls (which he preferred) or iterative loops (which he accepted). He concluded:
The go to statement as it stands is just too primitive; it is too much an invitation to make a mess of one's program. One can regard and appreciate the clauses considered as bridling its use. I do not claim that the clauses mentioned are exhaustive in the sense that they will satisfy all needs, but whatever clauses are suggested (e.g. abortion clauses) they should satisfy the requirement that a programmer independent coordinate system can be maintained to describe the process in a helpful and manageable way.
Many C-like programming languages, for example Rust and Java, do have an additional “clause considered as bridling its use,” the break to a label. An even more restricted syntax might be something like break 2 continue; to break out of two levels of the nested loop and resume at the top of the loop containing them. This presents no more of a problem than a C-style break to what Dijkstra wanted to do: defining a concise description of the program state that programmers can keep track of in their heads or a static analyzer would find tractable.
Restricting goto to constructions like this makes it simply a renamed break to a label. The remaining problem with it is that the compiler and the programmer don’t necessarily know you’re only going to use it this way.
If there’s an important post-condition that holds after the loop, and your concern with goto is the same as Dijkstra’s, you might consider stating it in a short comment, something like // We have done foo to every element, or encountered condition and stopped. That would alleviate the problem for humans, and a static analyzer should do fine.

The best solution is to put the loops in a function and then return from that function.
This is essentially the same thing as your goto example, but with the massive benefit that you avoid having yet another goto debate.
Simplified pseudo code:
bool function (void)
{
bool result = something;
for (i = 0; i < N; ++i)
for (j = 0; j < N; j++)
for (k = 0; k < N; ++k)
if (condition)
return something_else;
...
return result;
}
Another benefit here is that you can upgrade from bool to an enum if you come across more than 2 scenarios. You can't really do that with goto in a readable way. The moment you start to use multiple gotos and multiple labels, is the moment you embrace spaghetti coding. Yes, even if you just branch downwards - it will not be pretty to read and maintain.
Notably, if you have 3 nested for loops, that may be an indication that you should try to split your code up in several functions and then this whole discussion might not even be relevant.

modifying values in pointers is very slow?

I'm working with a huge amount of data stored in an array, and am trying to optimize the amount of time it takes to access and modify it. I'm using Window, c++ and VS2015 (Release mode).
I ran some tests and don't really understand the results I'm getting, so I would love some help optimizing my code.
First, let's say I have the following class:
class foo
{
public:
int x;
foo()
{
x = 0;
}
void inc()
{
x++;
}
int X()
{
return x;
}
void addX(int &_x)
{
_x++;
}
};
I start by initializing 10 million pointers to instances of that class into a std::vector of the same size.
#include <vector>
int count = 10000000;
std::vector<foo*> fooArr;
fooArr.resize(count);
for (int i = 0; i < count; i++)
{
fooArr[i] = new foo();
}
When I run the following code, and profile the amount of time it takes to complete, it takes approximately 350ms (which, for my purposes, is far too slow):
for (int i = 0; i < count; i++)
{
fooArr[i]->inc(); //increment all elements
}
To test how long it takes to increment an integer that many times, I tried:
int x = 0;
for (int i = 0; i < count; i++)
{
x++;
}
Which returns in <1ms.
I thought maybe the number of integers being changed was the problem, but the following code still takes 250ms, so I don't think it's that:
for (int i = 0; i < count; i++)
{
fooArr[0]->inc(); //only increment first element
}
I thought maybe the array index access itself was the bottleneck, but the following code takes <1ms to complete:
int x;
for (int i = 0; i < count; i++)
{
x = fooArr[i]->X(); //set x
}
I thought maybe the compiler was doing some hidden optimizations on the loop itself for the last example (since the value of x will be the same during each iteration of the loop, so maybe the compiler skips unnecessary iterations?). So I tried the following, and it takes 350ms to complete:
int x;
for (int i = 0; i < count; i++)
{
fooArr[i]->addX(x); //increment x inside foo function
}
So that one was slow again, but maybe only because I'm incrementing an integer with a pointer again.
I tried the following too, and it returns in 350ms as well:
for (int i = 0; i < count; i++)
{
fooArr[i]->x++;
}
So am I stuck here? Is ~350ms the absolute fastest that I can increment an integer, inside of 10million pointers in a vector? Or am I missing some obvious thing? I experimented with multithreading (giving each thread a different chunk of the array to increment) and that actually took longer once I started using enough threads. Maybe that was due to some other obvious thing I'm missing, so for now I'd like to stay away from multithreading to keep things simple.
I'm open to trying containers other than a vector too, if it speeds things up, but whatever container I end up using, I need to be able to easily resize it, remove elements, etc.
I'm fairly new to c++ so any help would be appreciated!

Let's look from the CPU point of view.
Incrementing an integer means I have it in a CPU register and just increments it. This is the fastest option.
I'm given an address (vector->member) and I must copy it to a register, increment, and copy the result back to the address. Worst: My CPU cache is filled with vector pointers, not with vector-member pointers. Too few hits, too much cache "refueling".
If I could manage to have all those members just in a vector, CPU cache hits would be much more frequent.

Try the following:
int count = 10000000;
std::vector<foo> fooArr;
fooArr.resize(count, foo());
for (auto it= fooArr.begin(); it != fooArr.end(); ++it) {
it->inc();
}
The new is killing you and actually you don't need it because resize inserts elements at the end if the size it's greater (check the docs: std::vector::resize)
And the other thing it's about using pointers which IMHO should be avoided until the last moment and it's uneccesary in this case. The performance should be a little bit faster in this case since you get better locality of your references (see cache locality). If they were polymorphic or something more complicated it might be different.

Change for loop condition with respect to an external flag (cpp)

I have a code block as following, where the inner for loop code remains the same but only the loop condition changes based on the reverseFlag. Is there a better way to code this without having to copy paste the content of the for loop twice ?
bool reverseFlag=false;
if (reverseFlag)
{
for(int i = 1; i < TotalFrames; i++)
{...}
}
else
{
for(int i = TotalFrames-1; i >0; i--)
{...}
}

Yes, you can do it in a single for loop, like this:
int from, to, step;
if (reverseFlag) {
from = TotalFrames-1;
to = -1;
step = -1;
} else {
from = 0;
to = TotalFrames;
step = 1;
}
for (int i = from ; i != to ; i+= step) {
...
}
A single conditional ahead of the loop prepares loop's parameters - i.e. its starting and ending values and the step, and then the loop uses these three values to iterate in the desired direction.

There are several options. You can:
Use two loops but put the loop body in a separate function/object/lambda.. to avoid duplication.
Use an increasing loop and calculate the real index within the loop:
j = reverseFlag ? TotalFrames - i : i;
Pre-calculate the loop conditions as #dasblinkenlight suggested.
Note that if you have a performance critical loop, some of these methods could hurt performance. If in doubt, check what your compiler does and measure the elapsed time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Best way to parallelize this for loop with multiple threads - c++

With little details you have provided I can give only this: std::generate_n(std::execution::par, getResult(), MaxCount * c_uOneMB, genValue); https://en.cppreference.com/w/cpp/algorithm/generate_n https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag

Related

Parallelization of bin packing problem by OpenMp

Multiple statements in a ranged for loop

How to avoid use of goto and break nested loops efficiently

modifying values in pointers is very slow?

Change for loop condition with respect to an external flag (cpp)

Categories

Resources