Operator chaining, why grow left branch? - c++

According to this question (that is, OP stated the belief and was not corrected) - chained operators grow to the left:
a << b << c << d;
// ==
operator<<( operator<<( operator<<(a, b), c ), d);
Why is this the case? Would it not be more efficient to do:
operator<<( operator<<(a, b), operator<<(c, d) );
i.e., to balance as much as possible?
Surely the compiler could figure this out for better runtime performance?

When you overload an operator in C++, it retains the same precedence and associativity that operator has when it's not overloaded.
In the case of shift operators (<< and >>) the standard requires that:
The shift operators << and >> group left-to-right.
That means an operation like a<<b<<c must be parsed as (a<<b)<<c.
Overloading the operator changes the code that's invoked for each of those operations, but has no effect on grouping--the grouping will be the same whether a, b and c are of built-in types (e.g., int) for which the compiler-provided operator will be used, or whether they're some class type for which an overloaded operator will be used. The parser that handles the grouping remains identical either way.
Note, however, that order of evaluation is independent of precedence or associativity, so this does not necessarily affect execution speed.

Figuring this out for "better runtime performance" is better served by turning turning operations into CISC. (MULADDs for example.)
The synchronization cost between two processors would far outweigh the benefits for any reasonable number of operators that were crammed into one line.
This would render out-of-order processing impossible effectively making each core operate at the speed of a Pentium Pro.
Aside from the std::cout mentioned by Mystical, think of the following integer arithmetic problem: 3 * 5 / 2 * 4 that should result in (15 / 2) * 4, meaning 28, but if split out it would result in 15 / 8, meaning 1. This is true even though the operators have the same significance.
EDIT:
It's easy to think that we could get away with splitting commutative operations across cores, for example 3 * 5 * 2 * 4.
Now lets consider that the only way to communicate between cores is through shared memory. And that the average load time is an order of magnitude slower than the average CPU operation time. That alone means that a tremendous number of mathematical operations would need to be done before even accommodating 1 load would make sense. But even worse:
The master core must write the data and operations to be performed to memory. Because these must be separate there could be a page fault on write.
After the master core has finished it's work it must load a dirty bit to check if the slave core has finished processing. It could be busy or have been preempted so this could require multiple loads.
The result from the slave core must then be loaded.
You can see how bad that would get given the expense of loads. Lets look at optimizations that can be done if everything is kept on one core:
If some values are known ahead of time they can be factored into 2s and shifts can be done instead. Even better a shift and add potentially performing more operations than 1 at a time.
For 32-bit or smaller numbers the upper and lower bits of a 64-bit processor may be used for addition.
Out of order processing may be done at the CPU's discretion, for example floating point operations are slower than integer operations, so in the situation: 1.2 * 3 * 4 * 5 The CPU will do the 3 * 4 * 5 first and do only 1 floating point operation.

Related

One Operation in C++

I have been recently reading about space complexities of algorithms.
And I wondering how does C++ define "one operation"
So, 2+3 = 5 is treated as one operation or is it:
2+3 = (010+ 011)b2 = (101)b2 =5
And thus leading to 3 operations.
This question arises from the curiosity about bit shifting since it is more basic than the addition of multiple bits.
And having read that complexity of bit-shifting depends on the language used. I wanted to ask how does C++ define "one operation".
Different compilers or compiler options / optimisations can affect how such statements are handled, leading to there being no consistent definition of 'one operation' in c++. If you wanted to know how a particular piece of code is executed, you could put it into compilerexplorer/godbolt and setting the compiler & settings and looking at its assembly output.
That said, the 'number of operations' in this respect is not the point of consideration in algorithmic complexity for either time or space. They are defined it is in regards to the input of a function - space complexity represents the amount of memory taken to perform the operation and time represents how long the function takes to execute. 3 + 5 would be represented as O(1) space (and time, but that's not what you were asking about) complexity because the amount of memory taken to perform it is constant (2 or 3 registers usually, 2 inputs and an output, which depending on other considerations may be in the same or a different register).

Enforcing order of execution

I would like to ensure that the calculations requested are executed exactly in the order I specify, without any alterations from either the compiler or CPU (including the linker, assembler, and anything else you can think of).
Operator left-to-right associativity is assumed in the C language
I am working in C (possibly also interested in C++ solutions), which states that for operations of equal precedence there is an assumed left-to-right operator associativity, and hence
a = b + c - d + e + f - g ...;
is equivalent to
a = (...(((((b + c) - d) + e) + f) - g) ...);
A small example
However, consider the following example:
double a, b = -2, c = -3;
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;
So many opportunities for optimisation
For many compilers and pre-processors they may be clever enough to recognise the "+ 2 - 2" is redundant and optimise this away. Similarly they could recognise that the "+= 2*b" followed by the "+= c" can be written using a single FMA. Even if they don't optimise in an FMA, they may switch the order of these operations etc. Furthermore, if the compiler doesn't do any of these optimisations, the CPU may well decide to do some out of order execution, and decide it can do the "+= c" before the "+= 2*b", etc.
As floating-point arithmetic is non-associative, each type of optimisation may result in a different end result, which may be noticeable if the following is inlined somewhere.
Why worry about floating point associativity?
For most of my code I would like as much optimisation as I can have and don't care about floating-point associativity or bit-wise reproduciblilty, but occasionally there is a small snippet (similar to the above example) which I would like to be untampered with and totally respected. This is because I am working with a mathematical method which exactly requires a reproducible result.
What can I do to resolve this?
A few ideas which have come to mind:
Disable compiler optimisations and out of order execution
I don't want this, as I want the other 99% of my code to be heavily optimised. (This seems to be cutting off my nose to spite my face). I also most likely won't have permission to change my hardware settings.
Use a pragma
Write some assembly
The code snippets are small enough that this might be reasonable, although I'm not very confident in this, especially if (when) it comes to debugging.
Put this in a separate file, compile separately as un-optimised as possible, and then link using a function call
Volatile variables
To my mind these are just for ensuring that memory access is respected and un-optimised, but perhaps they might prove useful.
Access everything through judicious use of pointers
Perhaps, but this seems like a disaster in readability, performance, and bugs waiting to happen.
If anyone can think of any feasibly solutions (either from any of the ideas I've suggested or otherwise) that would be ideal. The "pragma" option or "function call" to my mind seem like the best approaches.
The ultimate goal
To have something that marks off a small chuck of simple and largely vanilla C code as protected and untouchable to any (realistically most) optimisations, while allowing for the rest of the code to be heavily optimised, covering optimisations from both the CPU and compiler.
This is not a complete answer, but it is informative, partially answers, and is too long for a comment.
Clarifying the Goal
The question actually seeks reproducibility of floating-point results, not order of execution. Also, order of execution is irrelevant; we do not care if, in (a+b)+(c+d), a+b or c+d is executed first. We care that the result of a+b is added to the result of c+d, without any reassociation or other rewriting of arithmetic unless the result is known to be the same.
Reproducibility of floating-point arithmetic is in general an unsolved technological problem. (There is no theoretical barrier; we have reproducible elementary operations. Reproducibility is a matter of what hardware and software vendors have provided and how hard it is to express the computations we want performed.)
Do you want reproducibility on one platform (e.g., always using the same version of the same math library)? Does your code use any math library routines like sin or log? Do you want reproducibility across different platforms? With multithreading? Across changes of compiler version?
Addressing Some Specific Issues
The samples shown in the question can largely be handled by writing each individual floating-point operation in its own statement, as by replacing:
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;
with:
t0 = 1 + 2;
t0 = t0 - 2;
t0 = t0 + 3;
t0 = t0 + 4;
t1 = 2*b;
t0 += t1;
a += c;
The basis for this is that both C and C++ permit an implementation to use “excess precision” when evaluating an expression but require that precision to be “discarded” when an assignment or cast is performed. Limiting each assignment expression to one operation or executing a cast after each operation effectively isolates the operations.
In many cases, a compiler will then generate code using instructions of the nominal type, instead of instructions using a type with excess precision. In particular, this should avoid a fused multiply-add (FMA) being substituted for a multiplication followed by an addition. (An FMA has effectively infinite precision in the product before it is added to the addend, thus falling under the “excess precision is permitted” rule.) There are caveats, however. An implementation might first evaluate an operation with excess precision and then round it to the nominal precision. In general, this can cause a different result than doing a single operation in the nominal precision. For the elementary operations of addition, subtract, multiplication, division, and even square root, this does not happen if the excess precision is sufficient greater than the nominal precision. (There are proofs that a result with sufficient excess precision is always close enough to the infinitely precise result that the rounding to nominal precision gets the same result.) This is true for the case where the nominal precision is the IEEE-754 basic 32-bit binary floating-point format, and the excess precision is the 64-bit format. However, it is not true where the nominal precision is the 64-bit format and the excess precision is Intel’s 80-bit format.
So, whether this workaround works depends on the platform.
Other Issues
Aside from the use of excess precision and features like FMA or the optimizer rewriting expressions, there are other things that affect reproducibility, such as non-standard treatment of subnormals (notably replacing them with zeroes), variations between math library routines. (sin, log, and similar functions return different results on different platforms. Nobody has fully implemented correctly rounded math library routines with known bounded performance.)
These are discussed in other Stack Overflow questions about floating-point reproducibility, as well as papers, specifications, and standards documents.
Irrelevant Issues
The order in which a processor executes floating-point operations is irrelevant. Processor reordering of calculations obeys rigid semantics; the results are identical regardless of the chronological order of execution. (Processor timing can affect results if, for example, a task is partitioned into subtasks, such as assigning multiple threads or processes to process different parts of the arrays. Among other issues, their results could arrive in different orders, and the process receiving their results might then add or otherwise combine their results in different orders.)
Using pointers will not fix anything. As far as C or C++ is concerned, *p where p is a pointer to double is the same as a where a is a double. One the objects has a name (a) and one of them does not, but they are like roses: They smell the same. (There are issues where, if you have some other pointer q, the compiler might not know whether *q and *p refer to the same thing. But that also holds true for *q and a.)
Using volatile qualifiers will not aid in reproducibility regarding the excess precision or expression rewriting issue. That is because only an object (not a value) is volatile, which means it has no effect until you write it or read it. But, if you write it, you are using an assignment expression1, so the rule about discarding excess precision already applies. When reading the object, you would force the compiler to retrieve the actual value from memory, but this value will not be any different than the non-volatile object has after assignment, so nothing is accomplished.
Footnote
1 I would have to check on other things that modify an object, such as ++, but those are likely not significant for this discussion.
Write this critical chunk of code in assembly language.
The situation you're in is unusual. Most of the time people want the compiler to do optimizations, so compiler developers don't spend much development effort on means to avoid them. Even with the knobs you do get (pragmas, separate compilation, indirections, ...) you can never be sure something won't be optimized. Some of the undesirable optimizations you mention (constant folding, for instance) cannot be turned off by any means in modern compilers.
If you use assembly language you can be sure you're getting exactly what you wrote. If you do it any other way you won't have that level of confidence.
"clever enough to recognise the + 2 - 2 is redundant and optimise this
away"
No ! All decent compilers will apply constant propagation and figure out that a is constant and optimize all your statement away, into something equivalent to a = 1;. Here the example with assembly.
Now if you make a volatile, the compiler has to assume that any change of a could have an impact outside the C++ programme. Constant propagation will still be performed to optimise each of these calculations, but the intermediary assignments are guaranteed to happen. Here the example with assembly.
If you don't want constant propagation to happen, you need to deactivate optimizations. In this case, the best would be to keep your code separate so to compile the rest with all optilizations on.
However this is not ideal. The optimizer could outperform you and with this approach, you'll loose global optimisation across the function boundaries.
Recommendation/quote of the day:
Don't diddle code; Find better algorithms
- B.W.Kernighan & P.J.Plauger

Can an expression containing post increment execute in parallel with other parts of that expression in C++?

I came up with this question from following answer:
Efficiency of postincrement v.s. preincrement in C++
There I've found this expression:
a = b++ * 2;
They said, above b++ can run parallel with multiplication.
How b++ run parallel with multiplication?
What i've understood about the procedure is:
First we copy b's value to a temporary variable , then increment b , finally multiply 2 with that temporary variable.
We're not multiplying with b but with that temporary variable , then how it will run parallel?
I've got above idea about temporary variable from another answer Is there a performance difference between i++ and ++i in C++?
What you are talking about is instruction level parallelism. The processor in this case can execute an increment on the copy of b while also multiplying the old b.
This is very fine-grained parallelism, at the processor level, and in general you can expect it to give you some advantages, depending on the architecture.
In the case of pre-increment, instead, the result of the increment operation must be waited in the processor's pipeline before the multiplication can be executed, hence giving you a penalty.
However, this is not semantically equivalent as the value of a will be different.
#SimpleGuy's answer is a pretty reasonable first approximation. The trouble is, it assumes a halway point between the simple theoretic model (no parallelization) and the real world. This answer tries to look at a more realistic model, still without assuming one particular CPU.
The chief thing to realize is that real CPU's have registers and cache. These exist because memory operations are far more expensive than simple math. Parallelization of integer increment and integer bitshift (*2 is optimized to <<1 on real CPU's) is a minor concern; the optimizer will chiefly look at avoiding load stalls.
So let's assume that a and b aren't in a CPU register. The relevant memory operations are LOAD b, STORE a and STORE b. Everything starts with LOAD b so the optimizer may move that up as far as possible, even before a previous instruction when possible (aliasing is the chief concern here). The STORE b can start as soon as b++ has finished. The STORE a can happen after the STORE b so it's not a big problem that it's delayed by one CPU instruction (the <<1), and there's little to be gained by parallelizing the two operations.
The reason that b++ can run in parallel is as under
b++ is a post increment, which means the value of a variable would be incremented after use. The below line can be broken into two parts
a = b++ * 2
Part-1: Multiply b with 2
Part-2: Increment the value of b by 1
Since above two are not dependent on each other, they can be run in parallel.
Had case been of pre-increment, which means to increment before use
a = ++b * 2
The parts would have been
Part-1: Increment the value of b by 1
Part-2: Multiply (new) b with 2
As can be seen above, the part two can be run only after part 1 is executed, so there is a dependency and hence no parallelism.

Performance wise, how fast are Bitwise Operators vs. Normal Modulus?

Does using bitwise operations in normal flow or conditional statements like for, if, and so on increase overall performance and would it be better to use them where possible? For example:
if(i++ & 1) {
}
vs.
if(i % 2) {
}
Unless you're using an ancient compiler, it can already handle this level of conversion on its own. That is to say, a modern compiler can and will implement i % 2 using a bitwise AND instruction, provided it makes sense to do so on the target CPU (which, in fairness, it usually will).
In other words, don't expect to see any difference in performance between these, at least with a reasonably modern compiler with a reasonably competent optimizer. In this case, "reasonably" has a pretty broad definition too--even quite a few compilers that are decades old can handle this sort of micro-optimization with no difficulty at all.
TL;DR Write for semantics first, optimize measured hot-spots second.
At the CPU level, integer modulus and divisions are among the slowest operations. But you are not writing at the CPU level, instead you write in C++, which your compiler translates to an Intermediate Representation, which finally is translated into assembly according to the model of CPU for which you are compiling.
In this process, the compiler will apply Peephole Optimizations, among which figure Strength Reduction Optimizations such as (courtesy of Wikipedia):
Original Calculation Replacement Calculation
y = x / 8 y = x >> 3
y = x * 64 y = x << 6
y = x * 2 y = x << 1
y = x * 15 y = (x << 4) - x
The last example is perhaps the most interesting one. Whilst multiplying or dividing by powers of 2 is easily converted (manually) into bit-shifts operations, the compiler is generally taught to perform even smarter transformations that you would probably think about on your own and who are not as easily recognized (at the very least, I do not personally immediately recognize that (x << 4) - x means x * 15).
This is obviously CPU dependent, but you can expect that bitwise operations will never take more, and typically take less, CPU cycles to complete. In general, integer / and % are famously slow, as CPU instructions go. That said, with modern CPU pipelines having a specific instruction complete earlier doesn't mean your program necessarily runs faster.
Best practice is to write code that's understandable, maintainable, and expressive of the logic it implements. It's extremely rare that this kind of micro-optimisation makes a tangible difference, so it should only be used if profiling has indicated a critical bottleneck and this is proven to make a significant difference. Moreover, if on some specific platform it did make a significant difference, your compiler optimiser may already be substituting a bitwise operation when it can see that's equivalent (this usually requires that you're /-ing or %-ing by a constant).
For whatever it's worth, on x86 instructions specifically - and when the divisor is a runtime-variable value so can't be trivially optimised into e.g. bit-shifts or bitwise-ANDs, the time taken by / and % operations in CPU cycles can be looked up here. There are too many x86-compatible chips to list here, but as an arbitrary example of recent CPUs - if we take Agner's "Sunny Cove (Ice Lake)" (i.e. 10th gen Intel Core) data, DIV and IDIV instructions have a latency between 12 and 19 cycles, whereas bitwise-AND has 1 cycle. On many older CPUs DIV can be 40-60x worse.
By default you should use the operation that best expresses your intended meaning, because you should optimize for readable code. (Today most of the time the scarcest resource is the human programmer.)
So use & if you extract bits, and use % if you test for divisibility, i.e. whether the value is even or odd.
For unsigned values both operations have exactly the same effect, and your compiler should be smart enough to replace the division by the corresponding bit operation. If you are worried you can check the assembly code it generates.
Unfortunately integer division is slightly irregular on signed values, as it rounds towards zero and the result of % changes sign depending on the first operand. Bit operations, on the other hand, always round down. So the compiler cannot just replace the division by a simple bit operation. Instead it may either call a routine for integer division, or replace it with bit operations with additional logic to handle the irregularity. This may depends on the optimization level and on which of the operands are constants.
This irregularity at zero may even be a bad thing, because it is a nonlinearity. For example, I recently had a case where we used division on signed values from an ADC, which had to be very fast on an ARM Cortex M0. In this case it was better to replace it with a right shift, both for performance and to get rid of the nonlinearity.
C operators cannot be meaningfully compared in therms of "performance". There's no such thing as "faster" or "slower" operators at language level. Only the resultant compiled machine code can be analyzed for performance. In your specific example the resultant machine code will normally be exactly the same (if we ignore the fact that the first condition includes a postfix increment for some reason), meaning that there won't be any difference in performance whatsoever.
Here is the compiler (GCC 4.6) generated optimized -O3 code for both options:
int i = 34567;
int opt1 = i++ & 1;
int opt2 = i % 2;
Generated code for opt1:
l %r1,520(%r11)
nilf %r1,1
st %r1,516(%r11)
asi 520(%r11),1
Generated code for opt2:
l %r1,520(%r11)
nilf %r1,2147483649
ltr %r1,%r1
jhe .L14
ahi %r1,-1
oilf %r1,4294967294
ahi %r1,1
.L14: st %r1,512(%r11)
So 4 extra instructions...which are nothing for a prod environment. This would be a premature optimization and just introduce complexity
Always these answers about how clever compilers are, that people should not even think about the performance of their code, that they should not dare to question Her Cleverness The Compiler, that bla bla bla… and the result is that people get convinced that every time they use % [SOME POWER OF TWO] the compiler magically converts their code into & ([SOME POWER OF TWO] - 1). This is simply not true. If a shared library has this function:
int modulus (int a, int b) {
return a % b;
}
and a program launches modulus(135, 16), nowhere in the compiled code there will be any trace of bitwise magic. The reason? The compiler is clever, but it did not have a crystal ball when it compiled the library. It sees a generic modulus calculation with no information whatsoever about the fact that only powers of two will be involved and it leaves it as such.
But you can know if only powers of two will be passed to a function. And if that is the case, the only way to optimize your code is to rewrite your function as
unsigned int modulus_2 (unsigned int a, unsigned int b) {
return a & (b - 1);
}
The compiler cannot do that for you.
Bitwise operations are much faster.
This is why the compiler will use bitwise operations for you.
Actually, I think it will be faster to implement it as:
~i & 1
Similarly, if you look at the assembly code your compiler generates, you may see things like x ^= x instead of x=0. But (I hope) you are not going to use this in your C++ code.
In summary, do yourself, and whoever will need to maintain your code, a favor. Make your code readable, and let the compiler do these micro optimizations. It will do it better.

Is the first operation supposed to be faster and if so then Why?

Is the first opeartion faster than the second one ?
u+= (u << 3) + (u << 1) //first operation
u+= u*10 //second operation
Basically both of them does the same thing that is u= u+(10*u)
But i came to knew that first operation is faster than second .
Does the cpu time when operation + different from * . Is multiplication by 10
actually equivalent to 10 addition operations being performed ?
It depends on the capabilities of the underlying CPU, and of the compiler.
Any decent compiler should optimise u*10 into the appropriate bit-shift operations if it believes they would be faster. It may not be able to do the opposite. So always write u*10 if you mean u*10, unless you know you're working with a bad compiler.
Use a profiler and observe the generated machine code.
It is unlikely that there will be any difference in execution time as the compiler will probably optimise both to the same machine code.
I just ran quick profile test on the 2 in order to assert my claims. I made 2 small binaries (one for each operation) and timed the execution for processing 10e6 integer values. Both report ~38 milliseconds on my machine (mac i7 using g++). Therefore, it is safe to assume that both end up as an identical number of operations in the end. It is likely that the result will be the same for other compiler/processor combinations.
. . . if both give identical performance use:
u+= u*10 //second operation
.. just because it is a lot easier to understand at a glance.
Depends on the compiler's translation and the processor. Some processors have multiplication units so that actually multiplying only takes one instruction.
So far the first requires at least 3 instructions.
When in doubt, profile.