Speed of bit operations with bit operators

Speed of bit operations with bit operators - c++

Suppose I have
x &(num-1)
where x is an unsigned long long and num a regular int and & is the bitwise and operator.
I'm getting a significant speed reduction as the value of num increases. Is that normal behavior?
These are the other parts of the code that are affected
int* hash = new int[num]

I don't think that the bitwise operation is slowing down, I think you're using it a lot more times. And probably it isn't even the bitwise operation that's taking too long, but whatever else you're also doing more times.
Use a profiler.

If you're executing the code in a tight loop, it's wholly possibly that you'll see the performance lessen the higher num gets, I'm guessing that your C++ compiler isn't able to find a native instruction to perform the & with an unsigned long long - as you've stated your getting a slowdown for each power of two then I'd expect that the code that results from the & is repeatedly "dividing num" by 2 until it's zero and performing the and bit-by-bit.
Another possibility is that the CPU you're running on is lame and doesn't perform AND in a fixed number of cycles.

Problem solved. It had to do with the CPU cache and locality.

Related

How profitable is it to use powers of two in the calculations?

The question is whether it is possible to achieve a noticeable increase in productivity by using powers of two in multiplications and divisions, since the compiler could convert them to a shift (or it could be explicitly using a shift for this). I have a lot of multiplications by one number in my task (a coefficient that I myself entered), but I can use for example 512 instead of 500.
for(i=0;i<X;i++)
{
cout<<i*512 // or i*500
}
or i need do it same:
for(i=0;i<X;i++)
{
cout<<i>>9;
}
and an additional question - does it make sense to introduce a variable for the condition so that the compiler does not repeatedly read the condition again or does it do it automatically?
For example:
for(int i=0;i<10*K*H;i++)
{
// K and H cant change in this loop
}
I was trying to check it in Compulier Explorer, but it create less lines of code when i divide and no create same code when i multiply

About the limit in the for loop, you may want to give the compiler some assistance.
Compute the limit before the loop:
const int limit = 10 * K * H;
for (i = 0; i < limit; ++i)
{
}
This can help when compiling with no optimizations (e.g. debug mode). Your compiler may perform better optimizations when you increase the optimization level.
I recommend printing the assembly language for your for loop and comparing with the assembly language for the above code. The truth is in the assembly language.
Edit 1: shifting vs. multiplication
In most processors, bit shifting is often faster than multiplication. In modern processors, the savings is in the order of nanoseconds, or possibly microseconds.
Many compilers will convert a multiplication into a bit shift, depending on the optimization level and the context.
In your example, you will probably not notice the optimization gain, because the gain will be wasted in the call to cout. I/O consumes more time than the time gained by micro-optimizations.
Profiling your code will give you the best data for making these kinds of decisions. Also read about benchmarking to collect better data. For example, you may have to run your loop for 1E6 or more iterations to rule out outliers such as interrupts and task swaps.

Does accessing arrays slow down my function?

I have 2 functions:
unsigned long long getLineAsRow(unsigned long long board, int col) {
unsigned long long column = (board >> (7-(col - 1))) & col_mask_right;
column *= magicLineToRow;
return (column >> 56) & row_mask_bottom;
}
unsigned long long getDiagBLTR_asRow(unsigned long long board, int line, int row) {
unsigned long long result = board & diagBottomLeftToTopRightPatterns[line][row];
result = result << diagBLTR_shiftUp[line][row];
result = (result * col_mask_right) >> 56;
return result;
}
The only big difference I see is the access to a 2-dim-array. Defined like
int diagBRTL_shiftUp[9][9] = {};
I call both functions 10.000.000 times:
getLineAsRow ... time used: 1.14237s
getDiagBLTR_asRow ... time used: 2.18997s
I tested it with cl (vc++) and g++. Nearly no difference.
It is a really huge difference, do you have any advice?

The question what creates the difference between the execution times of your two functions really cannot be answered without knowing either the resulting assembler code or which of the globals you are accessing are actually constants that can be compiled right into the code. Anyway, analyzing your functions, we see that
function 1
reads two arguments from stack, returns a single value
reads three globals, which may or may not be constants
performs six arithmetic operations (the two minuses in 7-(col-1) can be collapsed into a single subtraction)
function 2
reads three arguments from stack, returns a single value
reads one global, which may or may not be a constant
dereferences two pointers (not four, see below)
does five arithmetic operations (three which you see, two which produce the array indices)
Note that accesses to 2D arrays actually boil down to a single memory access. When you write diagBottomLeftToTopRightPatterns[line][row], your compiler transforms it to something like diagBottomLeftToTopRightPatterns[line*9 + row]. That's two extra arithmetic instructions, but only a single memory access. What's more, the result of the calculation line*9 + row can be recycled for the second 2D array access.
Arithmetic operations are fast (on the order of a single CPU cycle), reads from memory may take four to twenty CPU cycles. So I guess that the three globals you access in function 1 are all constants which your compiler built right into the assembler code. This leaves function 2 with more memory accesses, making it slower.
However, one thing bothers me: If I assume you have a normal CPU with at least 2 GHz clock frequency, your times suggest that your functions consume more than 200 or 400 cycles, respectively. This is significantly more than expected. Even if your CPU has no values in cache, your functions shouldn't take more than roughly 100 cycles. So I would suggest to take a second look at how you are timing your code, I assume that you have some more code in your measuring loop which spoils your results.

Those functions do completely different things, but I assume that's no relevant to the question.
Sometimes these tests don't show the real cost of a function.
In this case the main cost is the access of the array in the memory. After the first access it will be in the cache and after that your function is going to be fast. So you don't really measure this characteristic. Even though in the test there are 10.000.000 iterations, you pay the price only once.
Now if you execute this function in a batch, calling it many times in a bulk, then it's non-issue. The cache will be warm.
If you access it sporadically, in an application which has high memory demands and frequently flushes the CPU cashes, it could be performance problem. But that of course depends on the context: how often it's called, etc..

Performance wise, how fast are Bitwise Operators vs. Normal Modulus?

Does using bitwise operations in normal flow or conditional statements like for, if, and so on increase overall performance and would it be better to use them where possible? For example:
if(i++ & 1) {
}
vs.
if(i % 2) {
}

Unless you're using an ancient compiler, it can already handle this level of conversion on its own. That is to say, a modern compiler can and will implement i % 2 using a bitwise AND instruction, provided it makes sense to do so on the target CPU (which, in fairness, it usually will).
In other words, don't expect to see any difference in performance between these, at least with a reasonably modern compiler with a reasonably competent optimizer. In this case, "reasonably" has a pretty broad definition too--even quite a few compilers that are decades old can handle this sort of micro-optimization with no difficulty at all.

TL;DR Write for semantics first, optimize measured hot-spots second.
At the CPU level, integer modulus and divisions are among the slowest operations. But you are not writing at the CPU level, instead you write in C++, which your compiler translates to an Intermediate Representation, which finally is translated into assembly according to the model of CPU for which you are compiling.
In this process, the compiler will apply Peephole Optimizations, among which figure Strength Reduction Optimizations such as (courtesy of Wikipedia):
Original Calculation Replacement Calculation
y = x / 8 y = x >> 3
y = x * 64 y = x << 6
y = x * 2 y = x << 1
y = x * 15 y = (x << 4) - x
The last example is perhaps the most interesting one. Whilst multiplying or dividing by powers of 2 is easily converted (manually) into bit-shifts operations, the compiler is generally taught to perform even smarter transformations that you would probably think about on your own and who are not as easily recognized (at the very least, I do not personally immediately recognize that (x << 4) - x means x * 15).

This is obviously CPU dependent, but you can expect that bitwise operations will never take more, and typically take less, CPU cycles to complete. In general, integer / and % are famously slow, as CPU instructions go. That said, with modern CPU pipelines having a specific instruction complete earlier doesn't mean your program necessarily runs faster.
Best practice is to write code that's understandable, maintainable, and expressive of the logic it implements. It's extremely rare that this kind of micro-optimisation makes a tangible difference, so it should only be used if profiling has indicated a critical bottleneck and this is proven to make a significant difference. Moreover, if on some specific platform it did make a significant difference, your compiler optimiser may already be substituting a bitwise operation when it can see that's equivalent (this usually requires that you're /-ing or %-ing by a constant).
For whatever it's worth, on x86 instructions specifically - and when the divisor is a runtime-variable value so can't be trivially optimised into e.g. bit-shifts or bitwise-ANDs, the time taken by / and % operations in CPU cycles can be looked up here. There are too many x86-compatible chips to list here, but as an arbitrary example of recent CPUs - if we take Agner's "Sunny Cove (Ice Lake)" (i.e. 10th gen Intel Core) data, DIV and IDIV instructions have a latency between 12 and 19 cycles, whereas bitwise-AND has 1 cycle. On many older CPUs DIV can be 40-60x worse.

By default you should use the operation that best expresses your intended meaning, because you should optimize for readable code. (Today most of the time the scarcest resource is the human programmer.)
So use & if you extract bits, and use % if you test for divisibility, i.e. whether the value is even or odd.
For unsigned values both operations have exactly the same effect, and your compiler should be smart enough to replace the division by the corresponding bit operation. If you are worried you can check the assembly code it generates.
Unfortunately integer division is slightly irregular on signed values, as it rounds towards zero and the result of % changes sign depending on the first operand. Bit operations, on the other hand, always round down. So the compiler cannot just replace the division by a simple bit operation. Instead it may either call a routine for integer division, or replace it with bit operations with additional logic to handle the irregularity. This may depends on the optimization level and on which of the operands are constants.
This irregularity at zero may even be a bad thing, because it is a nonlinearity. For example, I recently had a case where we used division on signed values from an ADC, which had to be very fast on an ARM Cortex M0. In this case it was better to replace it with a right shift, both for performance and to get rid of the nonlinearity.

C operators cannot be meaningfully compared in therms of "performance". There's no such thing as "faster" or "slower" operators at language level. Only the resultant compiled machine code can be analyzed for performance. In your specific example the resultant machine code will normally be exactly the same (if we ignore the fact that the first condition includes a postfix increment for some reason), meaning that there won't be any difference in performance whatsoever.

Here is the compiler (GCC 4.6) generated optimized -O3 code for both options:
int i = 34567;
int opt1 = i++ & 1;
int opt2 = i % 2;
Generated code for opt1:
l %r1,520(%r11)
nilf %r1,1
st %r1,516(%r11)
asi 520(%r11),1
Generated code for opt2:
l %r1,520(%r11)
nilf %r1,2147483649
ltr %r1,%r1
jhe .L14
ahi %r1,-1
oilf %r1,4294967294
ahi %r1,1
.L14: st %r1,512(%r11)
So 4 extra instructions...which are nothing for a prod environment. This would be a premature optimization and just introduce complexity

Always these answers about how clever compilers are, that people should not even think about the performance of their code, that they should not dare to question Her Cleverness The Compiler, that bla bla bla… and the result is that people get convinced that every time they use % [SOME POWER OF TWO] the compiler magically converts their code into & ([SOME POWER OF TWO] - 1). This is simply not true. If a shared library has this function:
int modulus (int a, int b) {
return a % b;
}
and a program launches modulus(135, 16), nowhere in the compiled code there will be any trace of bitwise magic. The reason? The compiler is clever, but it did not have a crystal ball when it compiled the library. It sees a generic modulus calculation with no information whatsoever about the fact that only powers of two will be involved and it leaves it as such.
But you can know if only powers of two will be passed to a function. And if that is the case, the only way to optimize your code is to rewrite your function as
unsigned int modulus_2 (unsigned int a, unsigned int b) {
return a & (b - 1);
}
The compiler cannot do that for you.

Bitwise operations are much faster.
This is why the compiler will use bitwise operations for you.
Actually, I think it will be faster to implement it as:
~i & 1
Similarly, if you look at the assembly code your compiler generates, you may see things like x ^= x instead of x=0. But (I hope) you are not going to use this in your C++ code.
In summary, do yourself, and whoever will need to maintain your code, a favor. Make your code readable, and let the compiler do these micro optimizations. It will do it better.

Performance of comparisons in C++ ( foo >= 0 vs. foo != 0 )

I've been working on a piece of code recently where performance is very important, and essentially I have the following situation:
int len = some_very_big_number;
int counter = some_rather_small_number;
for( int i = len; i >= 0; --i ){
while( counter > 0 && costly other stuff here ){
/* do stuff */
--counter;
}
/* do more stuff */
}
So here I have a loop that runs very often and for a certain number of runs the while block will be executed as well until the variable counter is reduced to zero and then the while loop will not be called because the first expression will be false.
The question is now, if there is a difference in performance between using
counter > 0 and counter != 0?
I suspect there would be, does anyone know specifics about this.

To measure is to know.

Do you think that what will solve your problem! :D
if(x >= 0)
00CA1011 cmp dword ptr [esp],0
00CA1015 jl main+2Ch (0CA102Ch) <----
...
if(x != 0)
00CA1026 cmp dword ptr [esp],0
00CA102A je main+3Bh (0CA103Bh) <----

In programming, the following statement is the sign designating the road to Hell:
I've been working on a piece of code recently where performance is very important
Write your code in the cleanest, most easy to understand way. Period.
Once that is done, you can measure its runtime. If it takes too long, measure the bottlenecks, and speed up the biggest ones. Keep doing that until it is fast enough.
The list of projects that failed or suffered catastrophic loss due to a misguided emphasis on blind optimization is large and tragic. Don't join them.

I think you're spending time optimizing the wrong thing. "costly other stuff here", "do stuff" and "do more stuff" are more important to look at. That is where you'll make the big performance improvements I bet.

There will be a huge difference if the counter starts with a negative number. Otherwise, on every platform I'm familiar with, there won't be a difference.

Is there a difference between counter > 0 and counter != 0? It depends on the platform.
A very common type of CPU are those from Intel we have in our PC's. Both comparisons will map to a single instruction on that CPU and I assume they will execute at the same speed. However, to be certain you will have to perform your own benchmark.

As Jim said, when in doubt see for yourself :
#include <boost/date_time/posix_time/posix_time.hpp>
#include <iostream>
using namespace boost::posix_time;
using namespace std;
void main()
{
ptime Before = microsec_clock::universal_time(); // UTC NOW
// do stuff here
ptime After = microsec_clock::universal_time(); // UTC NOW
time_duration delta_t = After - Before; // How much time has passed?
cout << delta_t.total_seconds() << endl; // how much seconds total?
cout << delta_t.fractional_seconds() << endl; // how much microseconds total?
}
Here's a pretty nifty way of measuring time. Hope that helps.

OK, you can measure this, sure. However, these sorts of comparisons are so fast that you are probably going to see more variation based on processor swapping and scheduling then on this single line of code.
This smells of unnecessary, and premature, optimization. Right your program, optimize what you see. If you need more, profile, and then go from there.

I would add that the overwhelming performance aspects of this code on modern cpus will be dominated not by the comparison instruction but whether the comparison is well predicted since any mis-predict will waste many more cycles than any integral operation.
As such loop unrolling will probably be the biggest winner but measure, measure, measure.

Thinking that the type of comparison is going to make a difference, without knowing it, is the definition of guessing.
Don't guess.

In general, they should be equivalent (both are usually implemented in single-cycle instructions/micro-ops). Your compiler may do some strange special-case optimization that is difficult to reason about from the source level, which may make either one slightly faster. Also, equality testing is more energy-efficient than inequality testing (>), though the system-level effect is so small as to not merit discussion.

There may be no difference. You could try examining the assembly output for each.
That being said, the only way to tell if any difference is significant is to try it both ways and measure. I'd bet that the change makes no difference whatsoever with optimizations on.

Assuming you are developing for the x86 architecture, when you look at the assembly output it will come down to jns vs jne. jns will check the sign flag and jne will check the zero flag. Both operations, should as far as I know, be equally costly.

Clearly the solution is to use the correct data type.
Make counter an unsigned int. Then it can't be less than zero. Your compiler will obviously know this and be forced to choose the optimal solution.
Or you could just measure it.
You could also think about how it would be implemented...(here we go on a tangent)...
less than zero: the sign bit would be set, so need to check 1 bit
equal to zero : the whole value would be zero, so need to check all the bits
Of course, computers are funny things, and it may take longer to check a single bit than the whole value (however many bytes it is on your platform).
You could just measure it...
And you could find out that one it more optimal than another (under the conditions you measured it). But your program will still run like a dog because you spent all your time optimising the wrong part of your code.
The best solution is to use what many large software companies do - blame the hardware for not runnnig fast enough and encourage your customer to upgrade their equipment (which is clearly inferior since your product doesn't run fast enough).
< /rant>

I stumbled across this question just now, 3 years after it is asked, so I am not sure how useful the answer will still be... Still, I am surprised not to see clearly stated that answering your question requires to know two and only two things:
which processor you target
which compiler you work with
To the first point, each processor has different instructions for tests. On one given processor, two similar comparisons may turn up to take a different number of cycles. For example, you may have a 1-cycle instruction to do a gt (>), eq (==), or a le (<=), but no 1-cycle instruction for other comparisons like a ge (>=). Following a test, you may decide to execute conditional instructions, or, more often, as in your code example, take a jump. There again, cycles spent in jumps take a variable number of cycles on most high-end processors, depending whether the conditional jump is taken or not taken, predicted or not predicted. When you write code in assembly and your code is time critical, you can actually take quite a bit of time to figure out how to best arrange your code to minimize overall the cycle count and may end up in a solution that may have to be optimized based on the number of time a given comparison returns a true or false.
Which leads me to the second point: compilers, like human coders, try to arrange the code to take into account the instructions available and their latencies. Their job is harder because some assumptions an assembly code would know like "counter is small" is hard (not impossible) to know. For trivial cases like a loop counter, most modern compilers can at least recognize the counter will always be positive and that a != will be the same as a > and thus generate the best code accordingly. But that, as many mentioned in the posts, you will only know if you either run measurements, or inspect your assembly code and convince yourself this is the best you could do in assembly. And when you upgrade to a new compiler, you may then get a different answer.

A good way to do a fast divide in C++?

Sometimes I see and have used the following variation for a fast divide in C++ with floating point numbers.
// orig loop
double y = 44100.0;
for(int i=0; i<10000; ++i) {
double z = x / y;
}
// alternative
double y = 44100;
double y_div = 1.0 / y;
for(int i=0; i<10000; ++i) {
double z = x * y_div;
}
But someone hinted recently that this might not be the most accurate way.
Any thoughts?

On just about every CPU, a floating point divide is several times as expensive as a floating point multiply, so multiplying by the inverse of your divisor is a good optimization. The downside is that there is a possibility that you will lose a very small portion of accuracy on certain processors - eg, on modern x86 processors, 64-bit float operations are actually internally computed using 80 bits when using the default FPU mode, and storing it off in a variable will cause those extra precision bits to be truncated according to your FPU rounding mode (which defaults to nearest). This only really matters if you are concatenating many float operations and have to worry about the error accumulation.

Wikipedia agrees that this can be faster. The linked article also contains several other fast division algorithms that might be of interest.
I would guess that any industrial-strength modern compiler will make that optimization for you if it is going to profit you at all.

Your original
// original loop:
double y = 44100.0;
for(int i=0; i<10000; ++i) {
double z = x / y;
}
can easily be optimized to
// haha:
double y = 44100.0;
double z = x / y;
and the performance is pretty nice. ;-)
EDIT: People keep voting this down, so here's the not so funny version:
If there were a general way to make division faster for all cases, don't you think compiler writers might have happened upon it by now? Of course they would have done. Also, some of the people doing FPU circuits aren't exactly stupid, either.
So the only way you're going to get better performance is to know what specific situation you have at hand and doing optimal code for that. Most likely this is a complete waste of your time, because your program is slow for some other reason such as performing math on loop invariants. Go find a better algorithm instead.

In your example using gcc the division with the options -O3 -ffast-math yields the same code as the multiplication without -ffast-math. (In a testing environment with enough volatiles around that the loop is still there.)
So if you really want to optimise those divisions away and don’t care about the consequences, that’s the way to go. Multiplication seems to be roughly 15 times faster, btw.

multiplication is faster than division so the second method is faster. It might be slightly less accurate but unless you are doing hard core numerics the level of accuracy should be more than enough.

When processing audio, I prefer to use fixed point math instead. I suppose this depends on the level of precision you need. But, let's assume that 16.16 fixed point integers will do (meaning high 16 bits is whole number, low 16 is the fraction). Now, all calculation can be done as simple integer math:
unsigned int y = 44100 << 16;
unsigned int z = x / (y >> 16); // divisor must be the whole number portion
Or with macros to help:
#define FP_INT(x) (x << 16)
#define FP_MUL(x, y) (x * (y >> 16))
#define FP_DIV(x, y) (x / (y >> 16))
unsigned int y = FP_INT(44100);
unsigned int z = FP_MUL(x, y);

Audio, hunh? It's not just 44,100 divisions per second when you have, say, five tracks of audio running at once. Even a simple fader consumes cycles, after all. And that's just for a fairly bare-bones, minimal example -- what if you want to have, say, an eq and a compressor? Maybe a little reverb? Your total math budget, so to speak, gets eaten up quickly. It does make sense to wring out a little extra performance in those cases.
Profilers are good. Profilers are your friend. Profilers deserve blowjobs and pudding. But you already know where the main bottle neck is in audio work -- it's in the loop that processes samples, and the faster you can make that, the happier your users will be. Use everything you can! Multiply by reciprocals, shift bits whenever possible (exp(x*y) = exp (x)*exp(y), after all), use lookup tables, refer to variables by reference instead of values (less pushing/popping on the stack), refactor terms, and so forth. (If you're good, you'll laugh at these elementary optimizations.)

I presume from the original post that x is not a constant shown there but probably data from an array so x[i] is likely to be the source of the data and similarly for the output, it will be stored somewhere in memory.
I suggest that if the loop count really is 10,000 as in the original post that it will make little difference which you use as the whole loop won't even take a fraction of millisecond anyway on a modern cpu. If the loop count really is very much higher, perhaps 1,000,000 or more, then I would expect that the cost of memory access would likely make the faster operation completely irrelevent anyway as it will always be waiting for the data anyway.
I suggest trying both with your code and testing if it actually makes any significant difference in run time and if it doesn't then just write the straightforward division if that's what the algorithm needs.

here's the problem with doing it with a reciprocal, you still have to do the division before you can actually divide by Y. unless your only dividing by Y then i suppose this may be useful. this is not very practical since division is done in binary with similar algorithms.

I are looping 10,000 times simply to make the code take long enough to measure the time easily? Or do you have 10000 numbers to divide by the same number? If the former, put the "y_div = 1.0 / y;" inside the loop, because it's part of the operation.
If the latter, yes, floating point multiplication is generally faster than division. Don't change your code from the obvious to the arcane based on guesses, though. Benchmark first to find slow spots, and then optimize those (and take measurements before and after to make sure your idea actually causes an improvement)

On old CPUs like the 80286, floating point maths was abysmally slow and we employed lots of trickiness to speed things up.
On modern CPUs floating point maths is blindingly fast and optimising compilers can generally do wonders with fine-tuning things.
It is almost never worth the effort to employ little micro-optimisations like that.
Try to make your code simple and idiot-proof. Only of you find a real bottleneck (using a profiler) would you think of optimisations in your floating point calculations.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js