Is there a fast parallel "for" loop in Perl 6?

Is there a fast parallel "for" loop in Perl 6? - concurrency

Given some code which does a bit of math/casting for each number from 1 to 500000, we have options:
Simple for loop: for ^500000 -> $i { my $result = ($i ** 2).Str; }. In my unscientific benchmark, this takes 2.8 seconds.
The most canonical parallel version does each bit of work in a Promise, then waits for the result. await do for ^500000 -> $i { start { my $result = ($i ** 2).Str; } } takes 19 seconds. This is slow! Creating a new promise must have too much overhead to be worthwhile for such a simple computation.
Using a parallel map operation is fairly fast. At 2.0 seconds, the operation seems barely slow enough to take advantage of parallelization: (^500000).race.map: -> $i { my $result = ($i ** 2).Str; }
The third option seems best. Unfortunately, it reads like a hack. We should not be writing map code for iteration in sink context, because others that read "map" in the source may assume the purpose is to build a list, which isn't our intent at all. It's poor communication to use map this way.
Is there any canonical fast way to use Perl 6's built in concurrency? A hyper operator would be perfect if it could accept a block instead of only functions:
(^500000)».(-> $i { my $result = ($i ** 2).Str; }) # No such method 'CALL-ME' for invocant of type 'Int'

If you want to use for with a hyper or race operation, you have to spell it hyper for #blah.hyper(:batch(10_000)) or race for #blah.race(:batch(10_000)). Or without parameters: hyper for #blah, race for #blah.
This was decided because you might have code like for some-operation() { some-non-threadsafe-code } where some-operation is part of a library or something. Now you cannot tell any more if the for loop can have thread-unsafe code in it or not, and even if you know the library doesn't return a HyperSeq at that point in time, what if the library author comes up with this great idea to make some-operation faster by hypering it?
That's why a signifier for "it's safe to run this for loop in parallel" is required right where the code is, not only where the sequence gets created.

On my PC, this is a bit (~15%) faster than the naive loop:
(^500_000).hyper(batch => 100_000).map(-> $i { my $result = ($i ** 2).Str; })
Since the computation inside the loop is really fast, typically the cost of parallelization and synchronization dwarfs any gains you get from it. The only remedy is a large batch size.
Update: with a batch size of 200_000 I get slightly better results (another few percent faster).

Related

Double simpler for loop vs one complex

Maybe this is a stupid question, but say I've got like two functions, void F1(int x) and void F2(int x), and I want to execute them in each iteration of a for loop. How much would it difer (performance-wise) if I did one big for loop, like this:
`
for(int i = 0; i < 100 ; i++)
{
F1(i);
F2(i);
}
`
compared to doing two separate loops, one in which I call F1, one in which I call F2:
`
for(int i = 0 ; i < 100 ; i++)
F1(i);
for(int i = 0 ; i < 100 ; i++)
F2(i);
`
While writing this, it occured to me that the first way is probably faster because there are only aproximately 100 increments and 100 comparisons, while in the second case, we'll get 200 of each.
Say my loop only has to run for 200 iterations. Would the two-for loops approach be pretty much the same in terms of performance, considering, say, CPUs from 2007 and after:)?

It depends on what F1 and F2 do.
It could not matter at all or you can experience dramatic slow down by having both functions called one after the other.
As an example of the latter case consider F1 and F2 accessing two different arrays. At each run of F1 and F2 they read enough data to cause the whole cache to be overwritten. That would probably cause a good slow down.
But is always better not to speculate and measure and benchmark your code instead. If performance is equivalent for both versions go for the most readable.

Well, as you pointed out the amount of operations differ. However, if you go with the second solution you can use multiple threads and achieve the same or better performance. Also, consider readability, testability, expandability, encapsulation, usually those factors are more important than any small performance gain you might get. And also, the compiler is usually very great at making your code run more effectively, so my advice is focus on your readability more than your performance in most cases.

Performance impact of using 'break' inside 'for-loop'

I have done my best and read a lot of Q&As on SO.SE, but I haven't found an answer to my particular question. Most for-loop and break related question refer to nested loops, while I am concerned with performance.
I want to know if using a break inside a for-loop has an impact on the performance of my C++ code (assuming the break gets almost never called). And if it has, I would also like to know tentatively how big the penalization is.
I am quite suspicions that it does indeed impact performance (although I do not know how much). So I wanted to ask you. My reasoning goes as follows:
Independently of the extra code for the conditional statements that
trigger the break (like an if), it necessarily ads additional
instructions to my loop.
Further, it probably also messes around when my compiler tries to
unfold the for-loop, as it no longer knows the number of iterations
that will run at compile time, effectively rendering it into a
while-loop.
Therefore, I suspect it does have a performance impact, which could be
considerable for very fast and tight loops.
So this takes me to a follow-up question. Is a for-loop & break performance-wise equal to a while-loop? Like in the following snippet, where we assume that checkCondition() evaluates 99.9% of the time as true. Do I loose the performance advantage of the for-loop?
// USING WHILE
int i = 100;
while( i-- && checkCondition())
{
// do stuff
}
// USING FOR
for(int i=100; i; --i)
{
if(checkCondition()) {
// do stuff
} else {
break;
}
}
I have tried it on my computer, but I get the same execution time. And being wary of the compiler and its optimization voodoo, I wanted to know the conceptual answer.
EDIT:
Note that I have measured the execution time of both versions in my complete code, without any real difference. Also, I do not trust compiling with -s (which I usually do) for this matter, as I am not interested in the particular result of my compiler. I am rather interested in the concept itself (in an academic sense) as I am not sure if I got this completely right :)

The principal answer is to avoid spending time on similar micro optimizations until you have verified that such condition evaluation is a bottleneck.
The real answer is that CPU have powerful branch prediction circuits which empirically work really well.
What will happen is that your CPU will choose if the branch is going to be taken or not and execute the code as if the if condition is not even present. Of course this relies on multiple assumptions, like not having side effects on the condition calculation (so that part of the body loop depends on it) and that that condition will always evaluate to false up to a certain point in which it will become true and stop the loop.
Some compilers also allow you to specify the likeliness of an evaluation as a hint the branch predictor.
If you want to see the semantic difference between the two code versions just compile them with -S and examinate the generated asm code, there's no other magic way to do it.

The only sensible answer to "what is the performance impact of ...", is "measure it". There are very few generic answers.
In the particular case you show, it would be rather surprising if an optimising compiler generated significantly different code for the two examples. On the other hand, I can believe that a loop like:
unsigned sum = 0;
unsigned stop = -1;
for (int i = 0; i<32; i++)
{
stop &= checkcondition(); // returns 0 or all-bits-set;
sum += (stop & x[i]);
}
might be faster than:
unsigned sum = 0;
for (int i = 0; i<32; i++)
{
if (!checkcondition())
break;
sum += x[i];
}
for a particular compiler, for a particular platform, with the right optimization levels set, and for a particular pattern of "checkcondition" results.
... but the only way to tell would be to measure.

Algorithm: taking out every 4th item of an array

I have two huge arrays (int source[1000], dest[1000] in the code below, but having millions of elements in reality). The source array contains a series of ints of which I want to copy 3 out of every 4.
For example, if the source array is:
int source[1000] = {1,2,3,4,5,6,7,8....};
int dest[1000];
Here is my code:
for (int count_small = 0, count_large = 0; count_large < 1000; count_small += 3, count_large +=4)
{
dest[count_small] = source[count_large];
dest[count_small+1] = source[count_large+1];
dest[count_small+2] = source[count_large+2];
}
In the end, dest console output would be:
1 2 3 5 6 7 9 10 11...
But this algorithm is so slow! Is there an algorithm or an open source function that I can use / include?
Thank you :)
Edit: The actual length of my array would be about 1 million (640*480*3)
Edit 2: Processing this for loop takes about 0.98 seconds to 2.28 seconds, while the other code only take 0.08 seconds to 0.14 seconds, so the device uses at least 90 % cpu time only for the loop

Well, the asymptotic complexity there is as good as it's going to get. You might be able to achieve slightly better performance by loading in the values as four 4-way SIMD integers, shuffling them into three 4-way SIMD integers, and writing them back out, but even that's not likely to be hugely faster.
With that said, though, the time to process 1000 elements (Edit: or one million elements) is going to be utterly trivial. If you think this is the bottleneck in your program, you are incorrect.

Before you do much more, try profiling your application and determine if this is the best place to spend your time. Then, if this is a hot spot, determine how fast is it, and how fast you need it to be/might achieve? Then test the alternatives; the overhead of threading or OpenMP might even slow it down (especially, as you now have noted, if you are on a single core processor - in which case it won't help at all). For single threading, I would look to memcpy as per Sean's answer.
#Sneftel has also reference other options below involving SIMD integers.
One option would be to try parallel processing the loop, and see if that helps. You could try using the OpenMP standard (see Wikipedia link here), but you would have to try it for your specific situation and see if it helped. I used this recently on an AI implementation and it helped us a lot.
#pragma omp parallel for
for (...)
{
... do work
}
Other than that, you are limited to the compiler's own optimisations.
You could also look at the recent threading support in C11, though you might be better off using pre-implemented framework tools like parallel_for (available in the new Windows Concurrency Runtime through the PPL in Visual Studio, if that's what you're using) than rolling your own.
parallel_for(0, max_iterations,
[...] (int i)
{
... do stuff
}
);
Inside the for loop, you still have other options. You could try a for loop that iterates and skips every for, instead of doing 3 copies per iteration (just skip when (i+1) % 4 == 0), or doing block memcopy operations for groups of 3 integers as per Seans answer. You might achieve slightly different compiler optimisations for some of these, but it is unlikely (memcpy is probably as fast as you'll get).
for (int i = 0, int j = 0; i < 1000; i++)
{
if ((i+1) % 4 != 0)
{
dest[j] = source[i];
j++;
}
}
You should then develop a test rig so you can quickly performance test and decide on the best one for you. Above all, decide how much time is worth spending on this before optimising elsewhere.

You could try memcpy instead of the individual assignments:
memcpy(&dest[count_small], &source[count_large], sizeof(int) * 3);

Is your array size only a 1000? If so, how is it slow? It should be done in no time!
As long as you are creating a new array and for a single threaded application, this is the only away AFAIK.
However, if the datasets are huge, you could try a multi threaded application.
Also you could explore having a bigger data type holding the value, such that the array size decreases... That is if this is viable to your real life application.

If you have Nvidia card you can consider using CUDA. If thats not the case you can try other parallel programming methods/environments as well.

Is while faster than for?

As in the topic, I learnt in school, that loop for is faster than loop while, but someone told me that while is faster.
I must optimize the program and I want to write while instead for, but I have a concern that it will be slower?
for example I can change for loop:
for (int i=0; i<x; i++)
{
cout<<"dcfvgbh"<<endl;
}
into while loop:
i=0;
while (i<x)
{
cout<<"dcfvgbh"<<endl;
i++;
}

The standard requires (§6.5.3/1) that:
The for statement
for ( for-init-statement conditionopt; expressionopt) statement
is equivalent to
{
for-init-statement
while ( condition ) {
statement
expression;
}
}
As such, you're unlikely to see much difference between them (even if execution time isn't necessarily part of the equivalence specified in the standard). There are a few exceptions listed to the equivalence as well (scopes of names, execution of the expression before evaluating the condition if you execute a continue). The latter could, at least theoretically, affect speed a little bit under some conditions, but probably not enough to notice or care about as a rule, and definitely not unless you actually used a continue inside the loop.

For all intents and purposes for is just a fancy way of writing while, so there is no performance advantage either way. The main reason to use one over the other is how the intent is translated so the reader understands better what the loop is actually doing.

No.
Nope, it's not.
It is not faster.

You cout will eat 99% of the clock cycles for this loop. Beware micro-optimization. At any rate, these two will give essentially identical code.
The only time when a for loop can be faster is when you have a known terminating condition - e.g.
for(ii = 0; ii < 24; ii++)
because some optimizing compilers will perform loop unrolling. This means they will not perform a test on every pass through the loop because they can "see" that just doing the thing inside the loop 24 times (or 6 times in blocks of 4, etc) will be a tiny bit more efficient. When the thing inside the loop is very small (e.g. jj += ii;), such optimization makes the for loop a bit faster than the while (which typically doesn't do "unrolling").
Otherwise - no difference.
update at the request of #zeroth
Source: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.9346&rep=rep1&type=pdf
Quote from source (my emphasis):
Unrolling a loop at the source- code level involves identification of
loop constructs (e.g., for, while, do-while, etc.), determination of
the loop count to ensure that it is a counting loop, replication of
the loop body and the adjustment of loop count of the unrolled loop. A
prologue or epilogue code may also be inserted. Using this approach,
it is difficult to unroll loops formed using a while and goto
statements since the loop count is not obvious. However, for all but
the simplest of loops, this approach is tedious and error prone.
The other alternative is to unroll loops automatically. Automatic
unrolling can be done early on source code, late on the unoptimized
intermediate representation, or very late on an optimized
representation of the program. If it is done at the source-code level,
then typically only counting loops formed using for statements are
unrolled. Unrolling loops formed using other control constructs is
difficult since the loop count is not obvious.

To the best of my knowledge swapping out for loops for while loops is not an established optimization technique.
Both your examples will be identical in performance, but as an exercise you could time them to confirm this for yourself.

If statement in c++

Hay Dear!
i know that if statement is an expensive statement in c++. I remember that once my teacher said that if statement is an expensive statement in the sense of computer time.
Now we can do every thing by using if statement in c++ so this is very powerful statement in programming perspective but its expensive in computer time perspective.
i am a beginner and i am studying data structure course after introduction to c++ course.
my Question to you is
Is it better for me to use if statement extensivly?

If statements are compiled into a conditional branch. This means the processor must jump (or not) to another line of code, depending on a condition. In a simple processor, this can cause a pipeline stall, which in layman's terms means the processor has to throw away work it did early, which wastes time on the assembly line. However, modern processors use branch prediction to avoid stalls, so if statements become less costly.
In summary, yes they can be expensive. No, you generally shouldn't worry about it. But Mykola brings up a separate (though equally valid) point. Polymorphic code is often preferable (for maintainability) to if or case statements

I'm not sure how you can generalise that the if statement is expensive.
If you have
if ( true ) { ... }
then this the if will most likele be optimised away by your compiler.
If, on the other hand, you have..
if ( veryConvolutedMethodTogGetAnswer() ) { .. }
and the method veryConvolutedMethodTogGetAnswer() does lots of work then, you could argue tha this is an expensive if statement but not because of the if, but because of the work you're doing in the decision making process.
"if"'s themselves are not usually "expensive" in terms of clock cycles.

Premature optimization is a bad idea. Use if statements where they make sense. When you discover a part of your code where its performance needs improvement, then possibly work on removing if statements from that part of your code.
If statements can be expensive because they force the compiler to generate branch instructions. If you can figure out a way to code the same logic in such a way that the compiler does not have to branch at all the code will likely be a lot faster, even if there are more total instructions. I remember being incredibly surprised at how recoding a short snippet of code to use various bit manipulations rather than doing any branching sped it up by a factor of 10-20%.
But that is not a reason to avoid them in general. It's just something to keep in mind when you're trying to wring the last bit of speed out of a section of code you know is performance critical because you've already run a profiler and various other tools to prove it to yourself.
Another reason if statements can be expensive is because they can increase the complexity of your program which makes it harder to read and maintain. Keep the complexity of your individual functions low. Don't use too many statements that create different control paths through your function.

I would say a lot of if statement is expensive from the maintainability perspective.

An if-statement implies a conditional branch which might be a bit more expensive that a code that doesn't branch.
As an example, counting how many times a condition is true (e.g how many numbers in a vector are greater than 10000):
for (std::vector<int>::const_iterator it = v.begin(), end = v.end(); it != end; ++it) {
//if (*it > 10000) ++count;
count += *it > 10000;
}
The version which simply adds 1 or 0 to the running total may be a tiny amount faster (I tried with 100 million numbers before I could discern a difference).
However, with MinGW 3.4.5, using a dedicated standard algorithm turns out to be noticeably faster:
count = std::count_if(v.begin(), v.end(), std::bind2nd(std::greater<int>(), 10000));
So the lesson is that before starting to optimize prematurely, using some tricks you've learnt off the internets, you might try out recommended practices for the language. (And naturally make sure first, that that part of the program is unreasonably slow in the first place.)
Another place where you can often avoid evaluating complicated conditions is using look-up tables (a rule of thumb: algorithms can often be made faster if you let them use more memory). For example, counting vowels (aeiou) in a word-list, where you can avoid branching and evaluating multiple conditions:
unsigned char letters[256] = {0};
letters['a'] = letters['e'] = letters['i'] = letters['o'] = letters['u'] = 1;
for (std::vector<std::string>::const_iterator it = words.begin(), end = words.end(); it != end; ++it) {
for (std::string::const_iterator w_it = it->begin(), w_end = it->end(); w_it != w_end; ++w_it) {
unsigned char c = *w_it;
/*if (c == 'e' || c == 'a' || c == 'i' || c == 'o' || c == 'u') {
++count;
}*/
count += letters[c];
}
}

You should write your code to be correct, easy to understand, and easy to maintain. If that means using if statements, use them! I would find it hard to believe that someone suggested you to not use the if statement.
Maybe your instructor meant that you should avoid something like this:
if (i == 0) {
...
} else if (i == 1) {
...
} else if (i == 2) {
...
} ...
In that case, it might be more logical to rethink your data structure and/or algorithm, or at the very least, use switch/case:
switch (i) {
case 1: ...; break;
case 2: ...; break;
...;
default: ...; break;
}
But even then, the above is better more because of improved readability rather than efficiency. If you really need efficiency, things such as eliminating if conditions are probably a bad way to start. You should profile your code instead, and find out where the bottleneck is.
Short answer: use if if and only if it makes sense!

In terms of computer time the "if" statement by itself is one of the cheapest statements there is.
Just don't put twenty of them in a row when there is a better way like a switch or a hash table, and you'll do fine.

You can use Switch in stead which makes the more readable, but I don't if it is any faster. If you have something like :
if (condition1) {
// do something
} else if (condition2) {
// do something
} else if (condition3) {
// do something
} else if (condition4) {
// do something
}
I am not what can be done to speed it up. if condition4 is occurs more frequently, you might move it to the top.

In a data structures course, the performance of an if statement doesn't matter. The small difference between an if statement and any of the obscure alternatives is totally swamped by the difference between data structures. For instance, in the following pseudocode
FOR EACH i IN container
IF i < 100
i = 100
container.NEXT(i)
END FOR
the performance is most determined by container.NEXT(i); this is far more expensive for linked lists then it is for contiguous arrays. For linked lists this takes an extra memory access, which depending on the cache(s) may take somewhere between 2.5 ns and 250 ns. The cost of the if statement would be measured in fractions of a nanosecond.

I did confront with performance issues due to to many if statements called inside loops in batch scripts, so instead I used integer math to emulate if statement, and it dramatically improved the performance.
if [%var1%] gtr [1543]
set var=1
else
set var=0
equivalent to
set /a var=%var1%/1543
I even used much more longer expressions, with many / and % operations, and it still was preferable over an if statement.
I know this is not C++, but I guess is the same principle. So whenever you need performance, then avoid conditional statements as much as you can.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js