Elimination of if-else for improving performance in pipelined architecture - if-statement

My professor insists on writing code which is "truly sequential" by omitting use of "if-else" constructs or "looping" constructs.
His argument is that any branch instruction causes a pipeline flush and is inefficient.
He suggested use of signals and exception handling.
Also use certain flags viz, Overflow flag, sign flag, carry flag to replace if-else conditions.
My question is whether such a program is feasible. If YES is it really efficient?Examples would be helpful.

This kind of micro-optimization makes sense in highly compute intensive loops, where reducing by a few cycles can have a noticeable effect. There are a few buts:
if the code is not mature enough, such optimization can be premature and counterproductive;
it is of little use to optimize code that already runs fast;
use profiling to be sure where the bottlenecks are;
only assembly language gives you access to the arithmetic flags;
good compilers know the tricks and do the work for you to some extent;
if you care about branches, also care about divisions and the memory access patterns.
With the current level of sophistication of modern processors and high-level languages, be them compiled or interpreted, you have less and less control on the code actually generated.

Related

Require compiler to emit branchless/constant-time code

In cryptography, any piece of code that depends on secret data (such as a private key) must execute in constant time in order to avoid side-channel timing attacks.
The most popular architectures currently (x86-64 and ARM AArch64) both support certain kinds of conditional execution instructions, such as:
CMOVcc, SETcc for x86-64
CSINCcc, CSINVcc, CSNEGcc for AArch64
Even when such instructions are not available, there are techniques to convert a piece of code into a branchless version. Performance may suffer, but in this scenario it's not the primary goal -- running in constant time is.
Therefore, it should in principle be possible to write branchless code in e.g. C/C++, and indeed it is seen that gcc/clang will often emit branchless code with optimizations turned on (there is even a specific flag for this in gcc: -fif-conversion2). However, this appears to be an optimization decision, and if the compiler thinks branchless will perform worse (say, if the "then" and "else" clauses perform a lot of computation, more than the cost of flushing the pipeline in case of a wrongly predicted branch), then I assume the compiler will emit regular code.
If constant-time is a non-negotiable goal, one may be forced to use some of the aforementioned tricks to generate branchless code, making the code less clear. Also, performance is often a secondary and quite important goal, so the developer has to hope that the compiler will infer the intended operation behind the branchless code and emit an efficient instruction sequence, often using the instructions mentioned above. This may require rewriting the code over and over while looking at the assembly output, until a magic incantation satisfies the compilers -- and this may change from compiler to compiler, or when a new version comes out.
Overall, this is an awful situation on both sides: compiler writers must infer intent from obfuscated code, transforming it into a much simpler instruction sequence; while developers must write such obfuscated code, since there are no guarantees that simple, clear code would actually run in constant time.
Making this into a question: if a certain piece of code must be emitted in constant-time (or not at all), is there a compiler flag or pragma that will force the code to be emitted as such, even if the compiler predicts worse performance than the branched version, or abort the compilation if it is not possible? Developers would be able to write clear code with the peace of mind that it will be constant-time, while supplying the compiler with clear and easy to analyze code. I understand this is probably a language- and compiler-dependent question, so I would be satisfied with either C or C++ answers, for either gcc or clang.
I found this question by going down a similar rabbit hole. For security purposes I require my code to not branch on secret data and to not leak information trough timing attacks.
While not an answer per se I can recommend this paper from the S&P 2018: https://ieeexplore.ieee.org/document/8406587.
The authors also wrote and extension for CLang/LLVM. I am not sure how well this extension works but it's a first step and gives a good overview on where we currently stand in the research context.

Why don't likely/unlikely show performance improvements?

I have many validation checks in the code where program is crashed if any check gets failed. So all the checks are more unlikely.
if( (msg = newMsg()) == (void *)0 )//this is more unlikely
{
panic()//crash
}
So I have used the macro unlikely which hints compiler in branch prediction. But I have seen no improvement with this(I have some performance tests). I am using gcc4.6.3.
Why is there no improvement ? Is it because there are no else case for this? Should I use any optimization flag while building my application?
Should I use any optimization flag while building my application?
Absolutely! Even the optimizations turned at the lowest level, -O1 for GCC/clang/icc, are likely to outperform most of your optimization efforts. For free essentially, so why not?
I am using gcc4.6.3.
GCC 4.6 is old. You should consider working with modern tools, unless you're constrained otherwise.
But I have seen no improvement with this(I have some performance tests).
You haven't seen visible performance improvements, which is very common when dealing with micro-optimizations like those. Unfortunately, achieving visible improvements is not very easy with today's hardware: this is because we have got faster (unbelievably faster) components than they used to be. So saving up cycles is not as sensible as it used to be.
It's though worth noticing that sequential micro-optimizations can still make your code much faster, as in tight loops. Avoiding stalls, branch mispredictions, maximizing cache use do make a difference when handling chunks of data. And SO's most voted question shows clearly that.
It's even stated on the GCC manual:
— Built-in Function: long __builtin_expect (long exp, long c)
You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.
(emphasis mine)
See other answers related to this on SO:
likely(x) and __builtin_expect((x),1)
Why do we use __builtin_expect when a straightforward way is to use if-else
etc.
Search for it: https://stackoverflow.com/search?q=__builtin_expect

Convert SSE intrinsics to readable C/C++ code?

I've inherited some highly optimized (SSE4), but uncommented c code. Are there any tools or utilities that will convert the SSE intrinsics into more readable code or pseudocode? This would be primarily for readability so that I could understand the code better before digging in and making changes.
I do not know of any such tool.
But it most likely would not help much anyway. If the SSE code is optimized well, the hard part is probably not decoding the intrinsics. The hard part is following all the tricks to improve locality and eliminate intra-iteration data dependencies (stripmining, polyhedral loop transformations, etc.)
I can give you a suggestion going forward, however: Always have a well-commented scalar version of the same routine written in the simplest possible way. This "reference code" should care only about readability and correctness, not speed... So it should have plenty of assertions. Also have a test suite that can exercise both the scalar version and the optimized variant(s).
Whether implementing a routine for the first time, or updating an existing routine, always start with the reference code and the test suite. Not necessarily in that order.
This approach is more expensive up front, but much much cheaper in the long run.

How much faster if write in-line assembly rather than regular c/c++ code? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
One of my senior collegues optimizes a function (he is implementing image filtering) by writing in-line assembly. Is that really necessary? Wouldn't modern compiler do that for us? Typically, how much gain do we have by converting C code into assembly? If assembly code really brings lots of benefits, when should we convert C/C++ code into assembly and when should we leave the code as it is, since assembly code is hard to read and maintain.
If you are smarter than the compiler, you may be able to make your code faster on one specific platform by writing it by hand in assembly.
However, most big C/C++ compilers are extremely good optimizers; you are unlikely to be smarter than them.
No, that's not really necessary, and this also makes porting the app much more different. This is the main concern about inline assembly.
And, of course, 80% of the time compiler can do this better.
First find an efficient algorithm.
Then implement it in clearly readable code.
Then evaluate its performance.
If your code's performance is inadequate, consider alternative algorithms
Repeat steps 3 and 4 until either performance is acceptable or you have exhausted all algorithmic alternatives
Drink some coffee.
Take a walk.
Repeat steps 3 and 4 again some more.
Have a beer.
Give steps 3 and 4 another few tries.
Get some rest
Back to 3 and 4.
Spend years studying the architecture of the CPU(s) your code will run on
Now consider hand-writing some assembly.
I'd imagine that for image filtering you might benefit from e.g. the availability of SIMD instructions, but not all compilers can automatically compile your code to use them, and not all the time. So in-line assembly or intrinsics can help with that.
One of my senior collegues optimizes a function (he is implementing image filtering) by writing in-line assembly. Is that really necessary?
Obviously I can't comment on your colleagues exact situation, but I wouldn't be surprised if it was necessary. There's many specialised instructions that are used for image filters that won't necessarily be used by the compiler. Inline assembly is often the only way to access those instructions (or through intrinsics).
Wouldn't modern compiler do that for us?
Obviously this depends on what 'that' is, but while modern compilers are certainly good at generating code, they aren't magic. It is often the case where you know something about your code that the compiler doesn't (or can't).
If your line of work involves high performance code then there are definitely places where you can get major improvements from using inline assembly (or even just compiler intrinsics).
If assembly code really brings lots of benefits, when should we convert C/C++ code into assembly and when should we leave the code as it is, since assembly code is hard to read and maintain.
Here's how:
First, profile your code to see what potential benefits are to be gained.
Look at the disassembly to see what the compiler is doing. If it is already doing things optimally then there is no point going further.
If there are opportunities for improvement, consider using compiler intrinsics before hand-written assembly as it is generally easier to maintain and more portable.
Only if all that fails should you go to inline assembly.
The short answer is no it's not necessary, the longer answer is... well, it depends. Modern compilers do indeed do a very good job of optimizing code, but they don't necessarily have access to all the assumptions a human does when optimizing. Hand coded assembler can beat compiled code, but there is a tradeoff between portability and maintenance.
Assuming that you have already determined this bit of code is a hotspot, the first thing you should do is tweak algorithms, then tweak the C++ code to make it faster (for example unrolling loops), and then tweak compiler flags. As a last resort, if you still can't make it go as fast as you need, consider whether its worth paying the cost of hand-optimizing, given all the future cost you will incur in maintenance and portability.
Where image processing is concerned I would be cautious either way as it depends on the input data, the algorithm and the compiler. Intel's ICC has a very good parallelizer and vectorizer for generating SSE code, it may be hard to beat by hand in most general purpose image processing cases. VCC on the other hand might not do such a good job.
However, I would expect that most benefit could be gained using compiler intrinsics rather than inline assembler.
programming languages are very well coded. Unless you are using very simple bitwise operations, like add, bitshift or using pointers or new instruction sets, you should use a practical programming lanugage. you DO NOT need assembly language for anything in your life. standard c operations call the relevant cpu instructions. If somebody makes a new CPU and it supports new instructions and you want to use those instructions, the programming languages or libraries do not support them and adaptation takes time. A new instruction in a cpu, makes things faster but you won't ever work in teams like DirectX or Opengl or MMX, SSE bla bla bla. Think of a day when graphic libraries like directx or opengl were not developed and intel,for say, creates some isntruction sets currently supported by none of the languages or inexistent in none of the libraries developed. then you would want to call some method from cpu and pass your parameters to that, for better performance. You can still do the same things without the new instruction in cpu. Another example, a new cpu by intel can support md5 hash checking, it doesnt mean you can't use md5, it means a library developed which uses md5 instructions will work faster because the cpu has a separate entity inside which will execute the operation efficiently. but normally you would wait until somebody publishes a library which uses md5 instructions int the cpu. cpus today add instrucion sets for zip, hash check, encryption and so on. you would use assembly language for some specific instruction. not for the good old add, multiply, subtract or divide because your programming language is already using them in the most efficient way possible.

Mixing assembler code with c/c++

Why is assembly language code often needed along with C/C++ ?
What can't be done in C/C++, which is possible when assembly language code is mixed?
I have some source code of some 3D computer games. There are a lot of assembler code in use.
Things that pop to mind, in no particular order:
Special instructions. In an embedded application, I need to invalidate the cache after a DMA transfer has filled the memory buffer. The only way to do that on an SH-4 CPU is to execute a special instruction, so inline assembly (or a free-standing assembly function) is the only way to go.
Optimizations. Once upon a time, it was common for compilers to not know every trick that was possible to do. In some of those cases, it was worth the effort to replace an inner loop with a hand-crafted version. On the kinds of CPUs you find in small embedded systems (think 8051, PIC, and so forth) it can be valuable to push inner loops into assembly. I will emphasize that for modern processors with pipelines, multi-issue execution, extensive caching and more, it is often exceptionally difficult for hand coding to even approach the capabilities of the optimizer.
Interrupt handling. In an embedded application it is often needed to catch system events such as interrupts and exceptions. It is often the case that the first few instructions executed by an interrupt have special responsibilities and the only way to guarantee that the right things happen is to write the outer layer of a handler in assembly. For example, on a ColdFire (or any descendant of the 68000) only the very first instruction is guaranteed to execute. To prevent nested interrupts, that instruction must modify the interrupt priority level to mask out the priority of the current interrupt.
Certain portions of an OS kernel. For example, task switching requires that the execution state (at least most registers including PC and stack pointer) be saved for the current task and the state loaded for the new task. Fiddling with execution state of the CPU is well outside of the feature set of the language, but can be wrapped in a small amount of assembly code in a way that allows the rest of the kernel to be written in C or C++.
Edit: I've touched up the wording about optimization. Let me emphasize that for targets with large user populations and well supported compilers with decent optimization, it is highly unlikely that an assembly coder can beat the performance of the optimizer.
Before attempting, start by careful profiling to determine where the bottlenecks really lie. With that information in hand, examine assumptions and algorithms carefully, because the best optimization of all is usually to find a better way to handle the larger picture. Then, if all else fails, isolate the bottleneck in a test case, benchmark it carefully, and begin tweaking in assembly.
Why is assembly language code often
needed along with C/C++ ?
Competitive advantage. Like, if you are writing software for the (soon-to-be) #1 gaming company in the world.
What can't be done in C/C++, which is
possible when assembly language code
is mixed?
Nothing, unless some absolute performance level is needed, say, X frames per second or Y billions of polygons per second.
Edit: based on other replies, it seems the consensus is that embedded systems (iPhone, Android etc) have hardware accelerators that certainly require the use of assembly.
I have some source code of some 3D
computer games. There are a lot of
assembler code in use.
They are either written in the 80's-90's, or they are used sparingly (maybe 1% - 5% of total source code) inside a game engine.
Edit: to this date, compiler auto-vectorization quality is still poor. So, you may see programs that contain vectorization intrinsics, and since it's not really much different from writing in actual assembly (most intrinsics have one-one mapping to assembly instructions) some folks might just decide to write in assembly.
Update:
According to anecdotal evidence, RollerCoaster Tycoon is written in 99% assembly.
http://www.chrissawyergames.com/faq3.htm
In the past, compilers used to be pretty poor at optimizing for a particular architecture, and architectures used to be simpler. Now the reverse is true. These days, it's pretty hard for a human to write better assembly than an optimizing compiler, for deeply-pipelined, branch-predicting processors. And so you won't see it much. What there is will be short, and highly targeted.
In short, you probably won't need to do this. If you think you do, profile your code to make sure you've identified a hotspot - don't optimize something just because it's slow, if you're only spending 0.1% of your execution time there. See if you can improve your design or algorithm. If you don't find any improvement there, or if you need functionality not exposed by your higher-level language, look into hand-coding assembly.
There are certain things that can only be done in assembler and cannot be done in C/C++.
These include:
generating software interrupts (SWI or INT instructions)
Use of instructions like SWP for creating mutexes
specialist coporcessor instructions (such as those needed to program the MMU and manage RAM caches)
Access to carry and overflow flags.
You may also be able to optimize code better in assembler than C/C++ (eg memcpy on Android is written in assembler)
There may be new instructions that your compiler cannot yet generate, or the compiler does a bad job, or you may need to control the CPU directly.
Why is assembly language code often
needed along with C/C++ ?needed along with C/C++ ?
It isn't
What can't be done in C/C++, which is
possible when assembly language code
is mixed?
Accessing system registers or IO ports on the CPU.
Accessing BIOS functions.
Using specialized instructions that doesn't map directly to the programming language,
e.g. SIMD instructions.
Provide optimized code that's better than the compiler produces.
The two first points you usually don't need unless you're writing an operating system, or code
running without an operatiing system.
Modern CPUs are quite complex, and you'll be hard pressed to find people that actually can write assembly than what the compiler produces. Many compilers come with libraries giving you access
to more advanced features, like SIMD instructions, so nowadays you often don't need to fall back to
assembly for that.
One more thing worth mentioning is:
C & C++ do not provide any convenient way to setup stack frames when one needs to implement a binary level interop with a script language - or to implement some kind of support for closures.
Assembly can be very optimal than what any compiler can generate in certain situations.