Im programming some ring buffers and this question came to me several times.
Suppouse we have a counter and we need to reset after certain count.
Ive seen several examples of ring buffers (mostly audio, wraping around r/w pointers), that do this:
x++;
if (x == SOME_NUMBER ){ // Reseting counter
x -= x;
}
is there any difference/preference in doing this:
x++;
if (x == SOME_NUMBER ){ // Reseting counter
x = 0;
}
?
This question applies to almost all kind of variable resets. In my case, besides ring buiffers, im also reseting a counter that do an average, so after i made all my measures, i reset that counter.
Besides the fact that the result may be the same (x reseting to zero), there may be some difference between one approach and the other. Is there any preference?
Consider those slightly modified versions of your snippets
void f(int n)
{
int x = 0;
for (;;)
{
++x;
if (x == n ) { // Reseting counter
x -= x;
}
// Ending condition to avoid UB
if ( x == 42 )
return;
}
}
void g(int n)
{
int x = 0;
for (;;)
{
++x;
if (x == n ) {
x = 0;
}
if ( x == 42 )
return;
}
}
If you look at the generated assembly (e.g. using Compiler Explorer) you'll notice how modern optimizing compilers can take advantage of the as-if rule.
Clang (with -O2) generates the same machine code for both functions. It uses
xor eax, eax
To load a zero into a register and then
cmove ecx, eax
to "reset" the other register when needed.
Gcc just creates f() and then g() becomes
jmp f(int)
That said
Is there any preference?
A common guideline is to write the more readable and maintainable code and to explore possible optimizations only after having profiled it.
In most cases I'd use the x = 0; version, because it conveys the intent better, IMHO. I can only think of a couple of reasons to adopt the x -= x; one:
It does not rely on "magic numbers". However, that would be the case for the 42 literal in my snippet, 0 is an exceptional case.
It doesn't need any implicit conversion. Consider any case where x is not an int.
There may be some architectures/toolchains where it actually delivers faster code. I can't think of any, but that's immaterial.
The difference is in the number of operations: x -= x is subtraction and assignment, whereas x = 0 is just an assignment. Other than the number of CPU cycles, this affects behavior if x is accessible from other threads.
A simple assignment x = 0 is much clearer as well IMO.
Related
I was trying to write some code that allow me to observe reordering of memory operations.
In the fallowing example I expected that on some executions of set_values() order of assigning values could change. Especialy notification = 1 may occur before the rest of operations, but in dosn't happend even after thousens of iterations.
I've compiled code with -O3 optimization.
Here is youtube material that i'm refering to : https://youtu.be/qlkMbxUbKfw?t=200
int a{0};
int b{0};
int c{0};
int notification{0};
void set_values()
{
a = 1;
b = 2;
c = 3;
notification = 1;
}
void calculate()
{
while(notification != 1);
a += b + c;
}
void reset()
{
a = 0;
b = 0;
c = 0;
notification = 0;
}
int main()
{
a=6; //just to allow first iteration
for(int i = 0 ; a == 6 ; i++)
{
reset();
std::thread t1(calculate);
std::thread t2(set_values);
t1.join();
t2.join();
std::cout << "Iteration: " << i << ", " "a = " << a << std::endl;
}
return 0;
}
Now the program is stuck in infinited loop. I expect that in some iterations order of instructions in set_values() function can change (due to optimalization on cash memory). For example notification = 1 will be executed before c = 3 what will trigger execution of calculate() function and gives a==3 what satisfies the condition of terminating the loop and prove reordering
Or maybe someone can provide other trivial example of code that help observe reordering of memory operations?
The compiler can indeed reorder your assignments in the function set_values. However, it is not required to do so. In this case it has no reason to reorder anything, since you are assigning constants to all four variables.
Now the program is stuck in infinited loop.
This is probably because while(notification != 1); will be optimized to an infinite loop.
With a bit of work, we can find a way to make the compiler reorder the assignment notify = 1 before the other statements, see https://godbolt.org/z/GY-pAw.
Notice that the program reads x from the standard input, this is done to force the compiler to read from a memory location.
I've also made the variable notification volatile, so that while(notification != 1); doesn't get optimised away.
You can try this example on your machine, I've been able to consistently fail the assertion using g++9.2 and -O3 running on an Intel Sandy Bridge cpu.
Be aware that the cpu itself can reorder instructions if they are independent of each other, see https://en.wikipedia.org/wiki/Out-of-order_execution. This is, however, a bit tricky to test and reproduce consistently.
Your compiler optimizes in unexpected ways but is allowed to do so because you are violating a fundamental rule of the C++ memory model.
You cannot access a memory location from multiple threads if at least one of them is a writer.
To synchronize, either use a std:mutex or use std:atomic<int> instead of int for your variables
So, I am new to online competitive programming and i came across a code where i am using the if else statement inside a for loop. I want to increase the speed of the loop and after doing some research i came across break and continue statements.
So my question is that does using continue really increases the speed of the loop or not.
CODE :
int even_sum = 0;
for(int i=0;i<200;i++){
if(i%4 == 0){
even_sum +=i;
continue;
}else{
//do other stuff when sum of multiple of 4 is not calculated
}
}
In the specific code in the question, the code has the identical meaning with and without the continue: In either case, after execution leaves even_sum +=i;, it flows to the closing } of the for statement. Any compiler of even modest quality should treat the two options identically.
The intended purpose of continue is not to speed up code by requesting a jump the compiler is going to make anyway but to skip code that is undesired in the current loop iteration—it acts as if the remaining code had been enclosed in an else clause but may be more visually appealing and less disruptive to human perception of the code.
It is conceivable a very rudimentary compiler, or even a decent compiler but with optimization disabled, might generate a jump instruction for the continue and also a jump instruction for the “then” clause of the if statement to jump over the else clause. The latter would never be executed and would have no direct effect on program execution time, but it would increase the size of the program and thus could have indirect effects. This possibility is of negligible concern in typical modern environments, where you are unlikely to encounter such a rudimentary compiler.
No, there's no speed advantage when using continue here. Both of your codes are identical and even without optimizations they produce the same machine code.
However, sometimes continue can make your code a lot more efficient, if you have structured your loop in a specific way, e.g.
This:
int even_sum = 0;
for (int i = 0; i < 200; i++) {
if (i % 4 == 0) {
even_sum += i;
continue;
}
if (huge_computation_but_always_false_when_multiple_of_4(i)) {
// do stuff
}
}
is a lot more efficient, than:
int even_sum = 0;
for (int i = 0; i < 200; i++) {
if (i % 4 == 0) {
even_sum += i;
}
if (huge_computation_but_always_false_when_multiple_of_4(i)) {
// do stuff
}
}
because the former doesn't have to execute the huge_computation_but_always_false_when_multiple_of_4() function every time.
So even though both of these codes would always produce the same result (given that huge_computation_but_always_false_when_multiple_of_4() has no side effects), the first one, which uses continue, would be a lot faster.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I've been doing some of the LeetCode problems, and I notice that the C solutions are a couple of times faster than the exact same thing in C++. For example:
Updated with a couple of simpler examples:
Given a sorted array and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order. You may assume no duplicates in the array. (Link to question on LeetCode)
My solution in C, runs in 3 ms:
int searchInsert(int A[], int n, int target) {
int left = 0;
int right = n;
int mid = 0;
while (left<right) {
mid = (left + right) / 2;
if (A[mid]<target) {
left = mid + 1;
}
else if (A[mid]>target) {
right = mid;
}
else {
return mid;
}
}
return left;
}
My other C++ solution, exactly the same but as a member function of the Solution class runs in 13 ms:
class Solution {
public:
int searchInsert(int A[], int n, int target) {
int left = 0;
int right = n;
int mid = 0;
while (left<right) {
mid = (left + right) / 2;
if (A[mid]<target) {
left = mid + 1;
}
else if (A[mid]>target) {
right = mid;
}
else {
return mid;
}
}
return left;
}
};
Even simpler example:
Reverse the digits of an integer. Return 0 if the result will overflow. (Link to question on LeetCode)
The C version runs in 6 ms:
int reverse(int x) {
long rev = x % 10;
x /= 10;
while (x != 0) {
rev *= 10L;
rev += x % 10;
x /= 10;
if (rev>(-1U >> 1) || rev < (1 << 31)) {
return 0;
}
}
return rev;
}
And the C++ version is exactly the same but as a member function of the Solution class, and runs for 19 ms:
class Solution {
public:
int reverse(int x) {
long rev = x % 10;
x /= 10;
while (x != 0) {
rev *= 10L;
rev += x % 10;
x /= 10;
if (rev>(-1U >> 1) || rev < (1 << 31)) {
return 0;
}
}
return rev;
}
};
I see how there would be considerable overhead from using vector of vector as a 2D array in the original example if the LeetCode testing system doesn't compile the code with optimisation enabled. But the simpler examples above shouldn't suffer that issue because the data structures are pretty raw, especially in the second case where all you have is long or integer arithmetics. That's still slower by a factor of three.
I'm starting to think that there might be something odd happening with the way LeetCode do the benchmarking in general because even in the C version of the integer reversing problem you get a huge bump in running time from just replacing the line
if (rev>(-1U >> 1) || rev < (1 << 31)) {
with
if (rev>INT_MAX || rev < INT_MIN) {
Now, I suppose having to #include<limits.h> might have something to do with that but it seems a bit extreme that this simple change bumps the execution time from just 6 ms to 19 ms.
Lately I've been seeing the vector<vector<int>> suggestion a lot for doing 2d arrays in C++, and I've been pointing out to people why this really isn't a good idea. It's a handy trick to know when slapping together temporary code, but there's (almost) never any reason to ever use it for real code. The right thing to do is to use a class that wraps a contiguous block of memory.
So my first reaction might be to point to this as a possible source for the disparity. However you're also using int** in the C version, which is generally a sign of the exact same problem as vector<vector<int>>.
So instead I decided to just compare the two solutions.
http://coliru.stacked-crooked.com/a/fa8441cc5baa0391
6468424
6588511
That's the time taken by the 'C version' vs the 'C++ version' in nanoseconds.
My results don't show anything like the disparity you describe. Then it occurred to me to check a common mistake people make when benchmarking
http://coliru.stacked-crooked.com/a/e57d791876b9252b
18386695
42400612
Notice that the -O3 flag from the first example has become -O0, which disables optimization.
Conclusion: you're probably comparing unoptimized executables.
C++ supports building rich abstractions that don't require overhead, but eliminating the the overhead does require certain code transformations that play havoc with the 'debuggability' of code.
That means debug builds avoid those transformations and therefore C++ debug builds are often slower than debug builds of C style code because C style code just doesn't use much abstraction. Seeing a 130% slowdown such as the above is not at all surprising when timing, for example, machine code that uses function calls in place of simple store instructions.
Some code really needs optimizations in order to have reasonable performance even for debugging, so compilers often offer a mode that applies some optimizations which don't cause too much trouble for debuggers. Clang and gcc use -O1 for this, and you can see that even this level of optimization essentially eliminates the gap in this program between C style code and the more C++ style code:
http://coliru.stacked-crooked.com/a/13967ebcfcfa4073
8389992
8196935
Update:
In those later examples optimization shouldn't make a difference, since the C++ is not using any abstraction beyond what the C version is doing. I'm guessing that the explanation for this is that the examples are being compiled with different compilers or with some other different compiler options. Without knowing how the compilation is done I would say it makes no sense to compare these runtime numbers; LeetCode is clearly not producing an apples to apples comparison.
You are using vector of vector in your C++ code snippet. Vectors are sequence containers in C++ that are like arrays that can change in size. Instead of vector<vector<int>> if you use statically allocated arrays, that would be better. You may use your own Array class as well with operator [] overloaded, but vector has more overhead as it dynamically resizes when you add more elements than its original size. In C++, you use call by reference to further reduce your time if you compare that with C. C++ should run even faster if written well.
Recently i was working with an application that had code similar to:
for (auto x = 0; x < width - 1 - left; ++x)
{
// store / reset points
temp = hPoint = 0;
for(int channel = 0; channel < audioData.size(); channel++)
{
if (peakmode) /* fir rms of window size */
{
for (int z = 0; z < sizeFactor; z++)
{
temp += audioData[channel][x * sizeFactor + z + offset];
}
hPoint += temp / sizeFactor;
}
else /* highest sample in window */
{
for (int z = 0; z < sizeFactor; z++)
{
temp = audioData[channel][x * sizeFactor + z + offset];
if (std::fabs(temp) > std::fabs(hPoint))
hPoint = temp;
}
}
.. some other code
}
... some more code
}
This is inside a graphical render loop, called some 50-100 times / sec with buffers up to 192kHz in multiple channels. So it's a lot of data running through the innermost loops, and profiling showed this was a hotspot.
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries. It looked something like this:
if ((const float &&)(*((int *)&temp) & ~0x80000000) > (const float &&)(*((int *)&hPoint) & ~0x80000000))
hPoint = temp;
This gave a 12x reduction in render time, while still producing the same, valid output. Note that everything in the audiodata is sanitized beforehand to not include nans/infs/denormals, and only have a range of [-1, 1].
Are there any corner cases where this optimization will give wrong results - or, why is the standard library function not implemented like this? I presume it has to do with handling of non-normal values?
e: the layout of the floating point model is conforming to ieee, and sizeof(float) == sizeof(int) == 4
Well, you set the floating-point mode to IEEE conforming. Typically, with switches like --fast-math the compiler can ignore IEEE corner cases like NaN, INF and denormals. If the compiler also uses intrinsics, it can probably emit the same code.
BTW, if you're going to assume IEEE format, there's no need for the cast back to float prior to the comparison. The IEEE format is nifty: for all positive finite values, a<b if and only if reinterpret_cast<int_type>(a) < reinterpret_cast<int_type>(b)
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries.
No, you can't, because this violates the strict aliasing rule.
Are there any corner cases where this optimization will give wrong results
Technically, this code results in undefined behavior, so it always gives wrong "results". Not in the sense that the result of the absolute value will always be unexpected or incorrect, but in the sense that you can't possibly reason about what a program does if it has undefined behavior.
or, why is the standard library function not implemented like this?
Your suspicion is justified, handling denormals and other exceptional values is tricky, the stdlib function also needs to take those into account, and the other reason is still the undefined behavior.
One (non-)solution if you care about performance:
Instead of casting and pointers, you can use a union. Unfortunately, that only works in C, not in C++, though. That won't result in UB, but it's still not portable (although it will likely work with most, if not all, platforms with IEEE-754).
union {
float f;
unsigned u;
} pun = { .f = -3.14 };
pun.u &= ~0x80000000;
printf("abs(-pi) = %f\n", pun.f);
But, granted, this may or may not be faster than calling fabs(). Only one thing is sure: it won't be always correct.
You would expect fabs() to be implemented in hardware. There was an 8087 instruction for it in 1980 after all. You're not going to beat the hardware.
How the standard library function implements it is .... implementation dependent. So you may find different implementation of the standard library with different performance.
I imagine that you could have problems in platforms where int is not 32 bits. You 'd better use int32_t (cstdint>)
For my knowledge, was std::abs previously inlined ? Or the optimisation you observed is mainly due to suppression of the function call ?
Some observations on how refactoring may improve performance:
as mentioned, x * sizeFactor + offset can be factored out of the inner loops
peakmode is actually a switch changing the function's behaviour - make two functions rather than test the switch mid-loop. This has 2 benefits:
easier to maintain
fewer local variables and code paths to get in the way of optimisation.
The division of temp by sizeFactor can be deferred until outside the channel loop in the peakmode version.
abs(hPoint) can be pre-computed whenever hPoint is updated
if audioData is a vector of vectors you may get some performance benefit by taking a reference to audioData[channel] at the start of the body of the channel loop, reducing the array indexing within the z loop down to one dimension.
finally, apply whatever specific optimisations for the calculation of fabs you deem fit. Anything you do here will hurt portability so it's a last resort.
In VS2008, using the following to track the absolute value of hpoint and hIsNeg to remember whether it is positive or negative is about twice as fast as using fabs():
int hIsNeg=0 ;
...
//Inside loop, replacing
// if (std::fabs(temp) > std::fabs(hPoint))
// hPoint = temp;
if( temp < 0 )
{
if( -temp > hpoint )
{
hpoint = -temp ;
hIsNeg = 1 ;
}
}
else
{
if( temp > hpoint )
{
hpoint = temp ;
hIsNeg = 0 ;
}
}
...
//After loop
if( hIsNeg )
hpoint = -hpoint ;
I'd like to write a function that would have some optional code to be executed or not depending on user settings. The function is cpu-intensive and having ifs in it would be slow since the branch predictor is not that good.
My idea is making a copy in memory of the function and replace NOPs with a jump when I don't want to execute some code. My working example goes like this:
int Test()
{
int x = 2;
for (int i=0 ; i<10 ; i++)
{
x *= 2;
__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
x *= 2; // Op to skip or not
x *= 2;
}
return x;
}
In my test's main, I copy this function into a newly allocated executable memory and replace the NOPs by a JMP 2 so that the following x *= 2 is not executed. JMP 2 is really "skip the next 2 bytes".
The problem is that I would have to change the JMP operand every time I edit the code to be skipped and change its size.
An alternative that would fix this problem would be:
__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
goto dont_do_it;
x *= 2; // Op to skip or not
dont_do_it:
x *= 2;
I would then want to skip or not the goto, which has a fixed size. Unfortunately, in full optimization mode, the goto and the x*=2 are removed because they are unreachable at compilation time.
Hence the need to keep that dead code.
I'm using VStudio 2008.
You can cut the cost of the branch by up to 10, just by moving it out of the loop:
int Test()
{
int x = 2;
if (should_skip) {
for (int i=0 ; i<10 ; i++)
{
x *= 2;
x *= 2;
}
} else {
for (int i=0 ; i<10 ; i++)
{
x *= 2;
x *= 2;
x *= 2;
}
}
return x;
}
In this case, and others like it, that might also provoke the compiler into doing a better job of optimising the loop body, since it will consider the two possibilities separately rather than trying to optimise conditional code, and it won't optimise anything away as dead.
If this results in too much duplicated code to be maintainable, use a template that takes x by reference:
int x = 2;
if (should_skip) {
doLoop<true>(x);
} else {
doLoop<false>(x);
}
And check that the compiler inlines it.
Obviously this increases code size a bit, which will occasionally be a concern. Whichever way you do it though, if this change doesn't produce a measurable performance improvement then I'd guess that yours won't either.
If the number of permutations for the code is reasonable, you can define your code as C++ templates and generate all variants.
You do not specify what compiler and platform you are using, which will prevent most people from being able to help you. For example, on some platforms, the code section is not going to be writeable, so you won't be able to replace the NOPs with a JMP.
You are trying to pick-and-choose the optimizations offered to you by the compiler and second-guessing it. In general, it's a bad idea. Either write the whole inner loop block in assembly, which would prevent the compiler eliminating is as dead code, or put the damn if statement in there and let the compiler do its thing.
I'm also dubious that the branch prediction is bad enough where you will gain any sort of a net win from doing what you're proposing. Are you sure this isn't a case of premature optimization? Have you written the code in the most obvious way possible and only then determined that its performance isn't good enough? That would be my suggested start.
Here's an actual answer to the actual question!
volatile int y = 0;
int Test()
{
int x = 2;
for (int i=0 ; i<10 ; i++)
{
x *= 2;
__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
goto dont_do_it;
keep_my_code:
x *= 2; // Op to skip or not
dont_do_it:
x *= 2;
}
if (y) goto keep_my_code;
return x;
}
Is this x64? You might be able to use function pointers and a conditional move to avoid the branch predictor. Load the address of the procedure based on the user settings; one of the procedures could be a dummy that does nothing. You should be able to do this without any inline ASM at all.
This may give insight:
#pragma optimize for Visual Studio.
That said, for this particular problem I would hand-code into ASM, using the VS asm output as a reference point.
At the meta level, I would have to be very certain this was the best design & algorithm for what I was doing before I started optimizing for the CPU pipe.
If you get this to work then I would profile it to make sure that it really is faster for you. On modern CPUs there is very little you can do that is slower than modifying code that is already in the cpu cache, or worse, the cpu pipeline. The cpu basically has to throw out all the work that is in the pipeline and start again.