C++ conversion from int to bool - c++

I want to know if the compiled code of a bool-to-int conversion contains a branch (jump) operation.
For example, given void func(bool b) and int i:
Is the compiled code of calling func(i) equivalent to the compiled code of func(i? 1:0)?
Or is there a more elaborate way for the compiler to perform this without the branch operation?
Update:
In other words, what code does the compiler generate in order to push 1 or 0 into the stack before jumping to the address of the function?
I assume that it really comes down to the architecture of the CPU at hand, and that some specific processors (certain DSPs, for example) may support this. So my question refers to "conventional" general-purpose CPUs (assuming that this definition is acceptable).
In terms of pure software, the question can also be phrased as: is there an efficient way for converting an integer value to 1 when it's not 0, and to 0 otherwise, without using a conditional statement?
Thanks

It's not your (compiler user) job too make built-in type conversion efficient. If the compiler is not dumb, it will make that sort of things as close as the CPU representation are.
For the most of the commercial CPU, bool and int are the exact same thing, and if(x) { ... }
translate in bit-anding (or bit-oring, whichever is faster: they are normally immediate instructions) x with itself and make a conditional jump after the } if the zero flag is set. (not that this is just a trick to force the zero-flag computation, that is an immediate consequence of the arithmetic unit electronics)
variants are much more a matter of CPU electronics, than code. So don'care about it. ifs are not triggered by a bool, but by the last arithmetic operation result.
Whatever arithmetic operation held by a CPU produces a result ans set some flags that represent certain result attributes: if it is zero, if it produced a carry or borrow, if it has an odd or even number of bit set to 1 etc. Resut and Flags are two registers, and can be loaded and stored from/to memory.

Related

Is atomic.load() is faster than calling it normal

Recently I have been using atomic numbers alot in c++ as i use threading too much and thread safe is important to me
Well, I had a problem with printf() function here is an example
atomic_uint64_t count = {0}
printf("%lu",count);
// It gives error couple of errors like atomic(cost atomic&) = delete; and use of deleted function atomic so i had to write it like this to make it work
printf("%lu",count.load());
// Or
printf("%lu",(uint64_t)count);
Well anyways i don't which is better for performance i really care about the speed
So i started to thinking about which is better to retrieve the value and use it in if conditions or anywhere else
Like
if(count.load() < 8 ){
// Do smth
}
or
if(count < 8){
// Do smth
}
Which is better for speed and performance and thanks.
They're exactly identical in their meaning (unless you pass a non-default memory order like count.load(std::memory_order_acquire)).
I'd expect there to be no difference in the generated assembly for all compilers across all ISAs, with optimization enabled of course. There isn't for GCC/clang/MSVC/ICC in code I've looked at on https://godbolt.org/. This is true regardless of surrounding code it's inlining into.
If there is ever a difference, and one is slower or takes more code-size, report that as a missed-optimization compiler bug in whatever compiler you're using. (Unless you had optimization disabled, then an extra level of calls to wrapper functions is possible.)
As for the error, that's because you're evaluating it in a context that doesn't already imply a type: as an operand for a variadic function (printf).
If there's enough context to imply that you want the underlying T value from an atomic<T> (which is what atomic_uint64_t is), then the operator T() overload is called, which is documented as being equivalent to .load(). Same deal for assignment and .store().
There aren't any other functions that let you access only the low 32 bits of an atomic 64-bit integer (unfortunately); even on a 32-bit machine, current compilers will actually go to the trouble of doing a 64-bit atomic load (which is efficient on some 32-bit machines, not on others), then discarding the high 32 if you cast the value to a narrower type. (This is a missed-optimization, but compilers truly don't optimize atomics for the moment.)
So there's no ambiguity being resolved by .load, or any way a cast can pick a different load.
One reason for the existence of .load() and .store() is that they take a std::memory_order parameter, which is defaulted to seq_cst but can be weaker if you just need atomicity but only acq/rel synchronization between threads. Or none at all with relaxed, just atomicity.
Another reason is to let you write foo.load() to remind readers of you code that this is an atomic variable, not just a plain primitive type. For that style reason I'd prefer count.load(). Presumably if you changed its type away from uint64_t, you'd want to change how you printed it, not still cast it to uint64_t. Using .load() will let the compiler warn you about the format-string mismatch if you change its type.

Enforcing order of execution

I would like to ensure that the calculations requested are executed exactly in the order I specify, without any alterations from either the compiler or CPU (including the linker, assembler, and anything else you can think of).
Operator left-to-right associativity is assumed in the C language
I am working in C (possibly also interested in C++ solutions), which states that for operations of equal precedence there is an assumed left-to-right operator associativity, and hence
a = b + c - d + e + f - g ...;
is equivalent to
a = (...(((((b + c) - d) + e) + f) - g) ...);
A small example
However, consider the following example:
double a, b = -2, c = -3;
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;
So many opportunities for optimisation
For many compilers and pre-processors they may be clever enough to recognise the "+ 2 - 2" is redundant and optimise this away. Similarly they could recognise that the "+= 2*b" followed by the "+= c" can be written using a single FMA. Even if they don't optimise in an FMA, they may switch the order of these operations etc. Furthermore, if the compiler doesn't do any of these optimisations, the CPU may well decide to do some out of order execution, and decide it can do the "+= c" before the "+= 2*b", etc.
As floating-point arithmetic is non-associative, each type of optimisation may result in a different end result, which may be noticeable if the following is inlined somewhere.
Why worry about floating point associativity?
For most of my code I would like as much optimisation as I can have and don't care about floating-point associativity or bit-wise reproduciblilty, but occasionally there is a small snippet (similar to the above example) which I would like to be untampered with and totally respected. This is because I am working with a mathematical method which exactly requires a reproducible result.
What can I do to resolve this?
A few ideas which have come to mind:
Disable compiler optimisations and out of order execution
I don't want this, as I want the other 99% of my code to be heavily optimised. (This seems to be cutting off my nose to spite my face). I also most likely won't have permission to change my hardware settings.
Use a pragma
Write some assembly
The code snippets are small enough that this might be reasonable, although I'm not very confident in this, especially if (when) it comes to debugging.
Put this in a separate file, compile separately as un-optimised as possible, and then link using a function call
Volatile variables
To my mind these are just for ensuring that memory access is respected and un-optimised, but perhaps they might prove useful.
Access everything through judicious use of pointers
Perhaps, but this seems like a disaster in readability, performance, and bugs waiting to happen.
If anyone can think of any feasibly solutions (either from any of the ideas I've suggested or otherwise) that would be ideal. The "pragma" option or "function call" to my mind seem like the best approaches.
The ultimate goal
To have something that marks off a small chuck of simple and largely vanilla C code as protected and untouchable to any (realistically most) optimisations, while allowing for the rest of the code to be heavily optimised, covering optimisations from both the CPU and compiler.
This is not a complete answer, but it is informative, partially answers, and is too long for a comment.
Clarifying the Goal
The question actually seeks reproducibility of floating-point results, not order of execution. Also, order of execution is irrelevant; we do not care if, in (a+b)+(c+d), a+b or c+d is executed first. We care that the result of a+b is added to the result of c+d, without any reassociation or other rewriting of arithmetic unless the result is known to be the same.
Reproducibility of floating-point arithmetic is in general an unsolved technological problem. (There is no theoretical barrier; we have reproducible elementary operations. Reproducibility is a matter of what hardware and software vendors have provided and how hard it is to express the computations we want performed.)
Do you want reproducibility on one platform (e.g., always using the same version of the same math library)? Does your code use any math library routines like sin or log? Do you want reproducibility across different platforms? With multithreading? Across changes of compiler version?
Addressing Some Specific Issues
The samples shown in the question can largely be handled by writing each individual floating-point operation in its own statement, as by replacing:
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;
with:
t0 = 1 + 2;
t0 = t0 - 2;
t0 = t0 + 3;
t0 = t0 + 4;
t1 = 2*b;
t0 += t1;
a += c;
The basis for this is that both C and C++ permit an implementation to use “excess precision” when evaluating an expression but require that precision to be “discarded” when an assignment or cast is performed. Limiting each assignment expression to one operation or executing a cast after each operation effectively isolates the operations.
In many cases, a compiler will then generate code using instructions of the nominal type, instead of instructions using a type with excess precision. In particular, this should avoid a fused multiply-add (FMA) being substituted for a multiplication followed by an addition. (An FMA has effectively infinite precision in the product before it is added to the addend, thus falling under the “excess precision is permitted” rule.) There are caveats, however. An implementation might first evaluate an operation with excess precision and then round it to the nominal precision. In general, this can cause a different result than doing a single operation in the nominal precision. For the elementary operations of addition, subtract, multiplication, division, and even square root, this does not happen if the excess precision is sufficient greater than the nominal precision. (There are proofs that a result with sufficient excess precision is always close enough to the infinitely precise result that the rounding to nominal precision gets the same result.) This is true for the case where the nominal precision is the IEEE-754 basic 32-bit binary floating-point format, and the excess precision is the 64-bit format. However, it is not true where the nominal precision is the 64-bit format and the excess precision is Intel’s 80-bit format.
So, whether this workaround works depends on the platform.
Other Issues
Aside from the use of excess precision and features like FMA or the optimizer rewriting expressions, there are other things that affect reproducibility, such as non-standard treatment of subnormals (notably replacing them with zeroes), variations between math library routines. (sin, log, and similar functions return different results on different platforms. Nobody has fully implemented correctly rounded math library routines with known bounded performance.)
These are discussed in other Stack Overflow questions about floating-point reproducibility, as well as papers, specifications, and standards documents.
Irrelevant Issues
The order in which a processor executes floating-point operations is irrelevant. Processor reordering of calculations obeys rigid semantics; the results are identical regardless of the chronological order of execution. (Processor timing can affect results if, for example, a task is partitioned into subtasks, such as assigning multiple threads or processes to process different parts of the arrays. Among other issues, their results could arrive in different orders, and the process receiving their results might then add or otherwise combine their results in different orders.)
Using pointers will not fix anything. As far as C or C++ is concerned, *p where p is a pointer to double is the same as a where a is a double. One the objects has a name (a) and one of them does not, but they are like roses: They smell the same. (There are issues where, if you have some other pointer q, the compiler might not know whether *q and *p refer to the same thing. But that also holds true for *q and a.)
Using volatile qualifiers will not aid in reproducibility regarding the excess precision or expression rewriting issue. That is because only an object (not a value) is volatile, which means it has no effect until you write it or read it. But, if you write it, you are using an assignment expression1, so the rule about discarding excess precision already applies. When reading the object, you would force the compiler to retrieve the actual value from memory, but this value will not be any different than the non-volatile object has after assignment, so nothing is accomplished.
Footnote
1 I would have to check on other things that modify an object, such as ++, but those are likely not significant for this discussion.
Write this critical chunk of code in assembly language.
The situation you're in is unusual. Most of the time people want the compiler to do optimizations, so compiler developers don't spend much development effort on means to avoid them. Even with the knobs you do get (pragmas, separate compilation, indirections, ...) you can never be sure something won't be optimized. Some of the undesirable optimizations you mention (constant folding, for instance) cannot be turned off by any means in modern compilers.
If you use assembly language you can be sure you're getting exactly what you wrote. If you do it any other way you won't have that level of confidence.
"clever enough to recognise the + 2 - 2 is redundant and optimise this
away"
No ! All decent compilers will apply constant propagation and figure out that a is constant and optimize all your statement away, into something equivalent to a = 1;. Here the example with assembly.
Now if you make a volatile, the compiler has to assume that any change of a could have an impact outside the C++ programme. Constant propagation will still be performed to optimise each of these calculations, but the intermediary assignments are guaranteed to happen. Here the example with assembly.
If you don't want constant propagation to happen, you need to deactivate optimizations. In this case, the best would be to keep your code separate so to compile the rest with all optilizations on.
However this is not ideal. The optimizer could outperform you and with this approach, you'll loose global optimisation across the function boundaries.
Recommendation/quote of the day:
Don't diddle code; Find better algorithms
- B.W.Kernighan & P.J.Plauger

DSP performance, what should be avoided?

I am starting with dsp programming right now and am writing my first low level classes and functions.
Since I want the functions to be fast (or at last not inefficient), I often wonder what I should use and what I should avoid in functions which get called per sample.
I know that the speed of an instruction varies quite a bit but I think that some of you at least can share a rule of thumb or just experience. :)
conditional statements
If I have to use conditions, switch should be faster than an if / else if block, right?
Are there differences between using two if-statements or an if-else? Somewhere I read that else should be avoided but I don't know why.
Also, compared to a multiplication, is there a rude estimation how much more time an if-block takes? Because in some cases, using multiplications by zero could be used instead of if-statements:
//something could be an int either 1 or 0:
if(something) {
signal += something_else;
}
// or:
signa+ += something*something_else;
functions and function-pointers
Instead of using conditional statements, you could use function-pointer. Instead of using conditions in every call, the pointer could be redirected to a specific function. However, for every call, the pointer had to be interpreted in order to call the right function. So I don't know if this would help or not.
What I also wonder is if calling functions have an impact. If so, boxing functions should be avoided, right?
variables
I would think that defining and using many variables in a function doesn't realy have an impact, at least relative to calculations. Is this true? If not, reusing declared variables would be better than more declaration.
calculations
Is there an order of calculation-types in term of the time they take to execute? I am sure that this highly depends on the context but a rule of thumb would be nice. I often read that people only count the multiplication in an algorithm. Is this because additions are realtively fast?
Does it make a difference between multiplication and division? (*0.5 or /2.0)
I hope that you can share soem experience.
Cheers
here are part of the answers:
calculations (talking about native precision of the processor for example 32bits):
Most DSP microprocessors have single cycle multipliers, that means a
multiply costs exactly the same as an addition in term of cycles.
and multiplication it generally faster then division.
conditional statements:
if/else - when looking in the assembly code you can see that the memory of the if condition is usually loaded by default, so when using if else make sure that the condition that will happen more frequently will be in the if.
but generally if possible you should avoid if/else in a loop to improve the pipe lining.
good luck.
DSP compilers are typically good at optimizing for loops that do not contain function-calls.
Therefore, try to inline every function that you call from within a time-critical for loop.
If your DSP is a fixed-point processor, then floating-point operations are implemented by SW.
This means that every such operation is essentially replaced by the compiler with a library function.
So you should basically avoid performing floating-point operations inside time-critical for loops.
The preprocessor should provide a special #pragma for the number of iterations of a for loop:
Minimum number of iterations
Maximum number of iterations
Multiplicity of the number of iterations
Use this #pragma where possible, in order to help the compiler to perform loop-unrolling where possible.
Finally, DSPs usually support a set of unique operations for enhanced performance.
As an example, consider _dotpu4 on Texas Instruments C64xx, which computes the scalar-product of two integers src1 and src2: For each pair of 8-bit values in src1 and src2, the 8-bit value from src1 is multiplied with the 8-bit value from src2, and the four products are summed together.
Check the data-sheet of your DSP, and see if you can make use of any of these operations.
The compiler should generate an intermediate file, which you can explore in order to analyze the expected performance of each of the optimized for loops in your code.
Based on that, you can try different assembly operations that might yield better results.

Performance wise, how fast are Bitwise Operators vs. Normal Modulus?

Does using bitwise operations in normal flow or conditional statements like for, if, and so on increase overall performance and would it be better to use them where possible? For example:
if(i++ & 1) {
}
vs.
if(i % 2) {
}
Unless you're using an ancient compiler, it can already handle this level of conversion on its own. That is to say, a modern compiler can and will implement i % 2 using a bitwise AND instruction, provided it makes sense to do so on the target CPU (which, in fairness, it usually will).
In other words, don't expect to see any difference in performance between these, at least with a reasonably modern compiler with a reasonably competent optimizer. In this case, "reasonably" has a pretty broad definition too--even quite a few compilers that are decades old can handle this sort of micro-optimization with no difficulty at all.
TL;DR Write for semantics first, optimize measured hot-spots second.
At the CPU level, integer modulus and divisions are among the slowest operations. But you are not writing at the CPU level, instead you write in C++, which your compiler translates to an Intermediate Representation, which finally is translated into assembly according to the model of CPU for which you are compiling.
In this process, the compiler will apply Peephole Optimizations, among which figure Strength Reduction Optimizations such as (courtesy of Wikipedia):
Original Calculation Replacement Calculation
y = x / 8 y = x >> 3
y = x * 64 y = x << 6
y = x * 2 y = x << 1
y = x * 15 y = (x << 4) - x
The last example is perhaps the most interesting one. Whilst multiplying or dividing by powers of 2 is easily converted (manually) into bit-shifts operations, the compiler is generally taught to perform even smarter transformations that you would probably think about on your own and who are not as easily recognized (at the very least, I do not personally immediately recognize that (x << 4) - x means x * 15).
This is obviously CPU dependent, but you can expect that bitwise operations will never take more, and typically take less, CPU cycles to complete. In general, integer / and % are famously slow, as CPU instructions go. That said, with modern CPU pipelines having a specific instruction complete earlier doesn't mean your program necessarily runs faster.
Best practice is to write code that's understandable, maintainable, and expressive of the logic it implements. It's extremely rare that this kind of micro-optimisation makes a tangible difference, so it should only be used if profiling has indicated a critical bottleneck and this is proven to make a significant difference. Moreover, if on some specific platform it did make a significant difference, your compiler optimiser may already be substituting a bitwise operation when it can see that's equivalent (this usually requires that you're /-ing or %-ing by a constant).
For whatever it's worth, on x86 instructions specifically - and when the divisor is a runtime-variable value so can't be trivially optimised into e.g. bit-shifts or bitwise-ANDs, the time taken by / and % operations in CPU cycles can be looked up here. There are too many x86-compatible chips to list here, but as an arbitrary example of recent CPUs - if we take Agner's "Sunny Cove (Ice Lake)" (i.e. 10th gen Intel Core) data, DIV and IDIV instructions have a latency between 12 and 19 cycles, whereas bitwise-AND has 1 cycle. On many older CPUs DIV can be 40-60x worse.
By default you should use the operation that best expresses your intended meaning, because you should optimize for readable code. (Today most of the time the scarcest resource is the human programmer.)
So use & if you extract bits, and use % if you test for divisibility, i.e. whether the value is even or odd.
For unsigned values both operations have exactly the same effect, and your compiler should be smart enough to replace the division by the corresponding bit operation. If you are worried you can check the assembly code it generates.
Unfortunately integer division is slightly irregular on signed values, as it rounds towards zero and the result of % changes sign depending on the first operand. Bit operations, on the other hand, always round down. So the compiler cannot just replace the division by a simple bit operation. Instead it may either call a routine for integer division, or replace it with bit operations with additional logic to handle the irregularity. This may depends on the optimization level and on which of the operands are constants.
This irregularity at zero may even be a bad thing, because it is a nonlinearity. For example, I recently had a case where we used division on signed values from an ADC, which had to be very fast on an ARM Cortex M0. In this case it was better to replace it with a right shift, both for performance and to get rid of the nonlinearity.
C operators cannot be meaningfully compared in therms of "performance". There's no such thing as "faster" or "slower" operators at language level. Only the resultant compiled machine code can be analyzed for performance. In your specific example the resultant machine code will normally be exactly the same (if we ignore the fact that the first condition includes a postfix increment for some reason), meaning that there won't be any difference in performance whatsoever.
Here is the compiler (GCC 4.6) generated optimized -O3 code for both options:
int i = 34567;
int opt1 = i++ & 1;
int opt2 = i % 2;
Generated code for opt1:
l %r1,520(%r11)
nilf %r1,1
st %r1,516(%r11)
asi 520(%r11),1
Generated code for opt2:
l %r1,520(%r11)
nilf %r1,2147483649
ltr %r1,%r1
jhe .L14
ahi %r1,-1
oilf %r1,4294967294
ahi %r1,1
.L14: st %r1,512(%r11)
So 4 extra instructions...which are nothing for a prod environment. This would be a premature optimization and just introduce complexity
Always these answers about how clever compilers are, that people should not even think about the performance of their code, that they should not dare to question Her Cleverness The Compiler, that bla bla bla… and the result is that people get convinced that every time they use % [SOME POWER OF TWO] the compiler magically converts their code into & ([SOME POWER OF TWO] - 1). This is simply not true. If a shared library has this function:
int modulus (int a, int b) {
return a % b;
}
and a program launches modulus(135, 16), nowhere in the compiled code there will be any trace of bitwise magic. The reason? The compiler is clever, but it did not have a crystal ball when it compiled the library. It sees a generic modulus calculation with no information whatsoever about the fact that only powers of two will be involved and it leaves it as such.
But you can know if only powers of two will be passed to a function. And if that is the case, the only way to optimize your code is to rewrite your function as
unsigned int modulus_2 (unsigned int a, unsigned int b) {
return a & (b - 1);
}
The compiler cannot do that for you.
Bitwise operations are much faster.
This is why the compiler will use bitwise operations for you.
Actually, I think it will be faster to implement it as:
~i & 1
Similarly, if you look at the assembly code your compiler generates, you may see things like x ^= x instead of x=0. But (I hope) you are not going to use this in your C++ code.
In summary, do yourself, and whoever will need to maintain your code, a favor. Make your code readable, and let the compiler do these micro optimizations. It will do it better.

Profiling a simple, one cycle length operation

We have an assignment where we need to profile a 'simple instruction' (addition or bit-wise and for example). This means performing the same operation a large number of times (100K+) and measuring the average time in microseconds. The result should be presented in cycle-lengths: (totalTime/iterations)*cphMHz.
So, results may vary but all in all we were told that we should get a result close to 1 cycle-length. Actual result doesn't matter as long as programming is correct.
My question is: what is a good operation to profile?
There are two points I need to concider:
I use loop unrolling to be a bit more accurate, so in each iteration I perform 10 simple instruction. This means I have to choose an operation to wouldn't be performed only once due to compiler optimization (we can't use -o0 flag as school staff does not).
Bad example: var = i; - the compiler would only perform the last command.
What is a real 'simple instruction'? How do I know the number of operations that are actually performed? I tried reading the assembly output, but I couldn't understand it.
Hope I was clear enough, any idea would be great.
Thanks anyway
P.S don't know if it matters but I write in CPP
1) This sounds (to me) like an impossible task, if optimizations are (or might be) enabled. You can never be sure on what the compiler will do during optimizations. I'd definitely do something like reusing the previous result. If allowed to/possible, I'd try to include a raw assembler snippet to be profiled (so you can be sure there's no additional overhead; although it still could be optimized).
2) As for instructions: One assembler command is one instruction. E.g. a += i will - depending on available instruction set and stuff - most likely result in 4 instructions: read a, read i, add, write a. Reading assembly is pretty much straightforward. Depending on the instruction set/processor, there might be different "directions" for reading (i.e. "from -> to"). x86 assemblers (and those for most other common processors) will prefer instruction target, source, while DSPs prefer to use instruction source, target. Just important to know: moving data has to happen through registers. So even a single assignment like a = b will result in two instructions (b to register and register to a).
In general, if this answer goes into the wrong direction, try to elaborate a bit more on your specific task and its requirements (e.g. which compiler is to be used) and drop me a short comment.