optimisation advice on value clamping in a loop - c++

I have a tight loop exactly like what Chandler Carruth presented in CPP CON 2017:
https://www.youtube.com/watch?v=2EWejmkKlxs
at 25 mins in this video, there is a loop like this:
for (int& i:v)
i = i>255?255:i;
where v is a vector. This is exactly the same code used in my program which after profiling, proves to take a good amount of time.
In his presentation, Chandler modified the assembly and speed up the loop. My question is, in practice, in a production code, what is the recommended approach to optimise this? Shall we use inline assembly in c++ code? Or like Chandler did, compile C++ code into assembly then optimise the assembler?
An example to optimise the above for loop will be really appreciated, assuming x86 architecture.

My question is, in practice, in a production code, what is the recommended approach to optimise this? Shall we use inline assembly in c++ code? Or like Chandler did, compile C++ code into assembly then optimise the assembler?
For production code you need to consider that the software might be compiled and linked in an automatic build system.
How would you want to apply the code changes to assembler code in such a system? You might apply a diff file, but that might break if if optimisation (or other) settings are changed, if switching to another compiler or ...
Now remaining two options: write the entire function in an assembler file (.s) or have inline assembler code inside the C++ code – the latter possibly with the advantage of keeping related code in the same translation unit.
Still I'd let the compiler generate assembler code once – with highest optimisation level available. This code can then serve as a (already pre-optimised) base for your hand-made optimisations, of which the outcome should then be pasted back as inline assembly to the C++ source file or placed into a separate assembly source file.

Chandler modified the compiler's asm output because that's an easy way to do a one-off experiment to find out whether a change would be useful, without doing all the stuff you'd normally want to include an asm loop or function as part of the source code for a project.
Compiler-generated asm is usually a good starting point for an optimized loop, but actually keeping the whole file as-is isn't a good or even viable way to actually maintain an asm implementation of a loop as part of a program. See #Aconcagua's answer.
Plus it defeats the purpose of having any other functions in the file written in C++ and being available for link-time optimization.
Re: actually clamping:
Note that Chandler was just experimenting with changes to the non-vectorized code-gen, and disabled unrolling + auto-vectorization. In real life hopefully you can target SSE4.1 or AVX2 and let the compiler auto-vectorize with pminsd or pminud for signed or unsigned int clamping to an upper bound. (Also available in other element sizes. Or without SSE4.1, just SSE2, maybe you can 2x PACKSSDW => packuswb (unsigned saturation) then unpack with zeros back up to 4 vectors of dword elements. (If you can't just use an output of uint8_t[]!)
And BTW, in the comments of the video, Chandler said it turns out that he made a mistake and the effect he was seeing wasn't really due to a predictable branch vs. a cmov. It might have been a code-alignment thing, because changing from mov %ebx, (%rdi) to movl $255, (%rdi) made a difference!
(AMD CPUs aren't known to have register-read stalls the way P6-family did, should have no trouble hiding the dep chain of a cmov coupling a store to a load vs. breaking it with branch prediction + speculation past a branch.)
You very rarely would actually want to use a hand-written loop. Often you can hand-hold and/or trick your compiler into making asm more like what you want, just by modifying the C++ source. Then a future compiler is free to tune differently for -march=some_future_cpu.

Related

Can I improve branch prediction with my code?

This is a naive general question open to any platform, language, or compiler. Though I am most curious about Aarch64, C++, GCC.
When coding an unavoidable branch in program flow dependent on I/O state (compiler cannot predict), and I know that one state is much more likely than another, how do I indicate that to the compiler?
Is this better
if(true == get(gpioVal))
unlikelyFunction();
else
likelyFunction();
than this?
if(true == get(gpioVal))
likelyFunction(); // performance critical, fill prefetch caches from this branch
else
unlikelyFunction(); // missed prediction not consequential on this branch
Does it help if the communication protocol makes the more likely or critical value true(high), or false(low)?
TL:DR: Yes, in C or C++ use a likely() macro, or C++20 [[likely]], to help the compiler make better asm. That's separate from influencing actual CPU branch-prediction, though. If writing in asm, lay out your code to minimize taken branches.
For most ISAs, there's no way in asm to hint the CPU whether a branch is likely to be taken or not. (Some exceptions include Pentium 4 (but not earlier or later x86), PowerPC, and some MIPS, which allow branch hints as part of conditional-branch asm instructions.)
Is it possible to tell the branch predictor how likely it is to follow the branch?
But not-taken straight-line code is cheaper than taken, so hinting high-level language to lay out code with the fast-path contiguous doesn't help branch prediction accuracy, but can help (or hurt) performance. (I-cache locality, front-end bandwidth: remember code-fetch happens in contiguous 16 or 32-byte blocks, so a taken branch means a later part of that fetch block isn't useful. Also, branch prediction throughput; some CPUs like Intel Skylake for example can't handle a predicted-taken branch at more than 1 per 2 clocks, other than loop branches. That include unconditional branches like jmp or ret.)
Taken branches are hard; not-taken branches keep the CPU on its toes, but if the prediction is accurate it's just a normal instruction for an execution unit (verifying the prediction), with nothing special for the front-end. See also Modern Microprocessors
A 90-Minute Guide! which has a section on branch prediction. (And is overall excellent.)
What exactly happens when a skylake CPU mispredicts a branch?
Avoid stalling pipeline by calculating conditional early
How does the branch predictor know if it is not correct?
Many people misunderstand source-level branch hints as branch prediction hints. That could be one effect if compiling for a CPU that supports branch hints in asm, but for most the significant effect is in layout, and deciding whether to use branchless (cmov) or not; a [[likely]] condition also means it should predict well.
With some CPUs, especially older, layout of a branch did sometimes influence runtime prediction: if the CPU didn't remember anything about the branch in its dynamic predictors, the standard static prediction heuristic is that forward conditional branches are not-taken, backward conditional are assumed taken (because that's normally the bottom of a loop. See the BTFNT section in https://danluu.com/branch-prediction/.
A compiler can lay out an if(c) x else y; either way, either matching the source with jump over x if !c as the opening thing, or swap the if and else blocks and use the opposite branch condition. Or it can put one block out-of-line (e.g. after the ret at the end of the function) so the fast path has no taken branches conditional or otherwise, while the less likely path has to jump there and then jump back.
It's easy to do more harm than good with branch hints in high-level source, especially if surrounding code changes without paying attention to them, so profile-guided optimization is the best way for compilers to learn about branch predictability and likelihood. (e.g. gcc -O3 -fprofile-generate / run with some representative inputs that exercise code-paths in relevant ways / gcc -O3 -fprofile-use)
But there are ways to hint in some languages, like C++20 [[likely]] and [[unlikely]], which are the portable version of GNU C likely() / unlikely() macros around __builtin_expect.
https://en.cppreference.com/w/cpp/language/attributes/likely C++20 [[likely]]
How to use C++20's likely/unlikely attribute in if-else statement syntax help
Is there a compiler hint for GCC to force branch prediction to always go a certain way? (to the literal question, no. To what's actually wanted, branch hints to the compiler, yes.)
How do the likely/unlikely macros in the Linux kernel work and what is their benefit? The GNU C macros using __builtin_expect, same effect but different syntax than C++20 [[likely]]
What is the advantage of GCC's __builtin_expect in if else statements? example asm output. (Also see CiroSantilli's answers to some of the other questions where he made examples.)
Simple example where [[likely]] and [[unlikely]] affect program assembly?
I don't know of ways to annotate branches for languages other than GNU C / C++, and ISO C++20.
Absent any hints or profile data
Without that, optimizing compilers have to use heuristics to guess which side of a branch is more likely. If it's a loop branch, they normally assume that the loop will run multiple times. On an if, they have some heuristics based on the actual condition and maybe what's in the blocks being controlled; IDK I haven't looked into what gcc or clang do.
I have noticed that GCC does care about the condition, though. It's not as naive as assuming that int values are uniformly randomly distributed, although I think it normally assumes that if (x == 10) foo(); is somewhat unlikely.
JIT compilers like in a JVM have an advantage here: they can potentially instrument branches in the early stages of running, to collect branch-direction information before making final optimized asm. OTOH they need to compile fast because compile time is part of total run time, so they don't try as hard to make good asm, which is a major disadvantage in terms of code quality.

Coding for ARM NEON: How to start?

I'm looking to optimize C++ code (mainly some for loops) using the NEON capability of computing 4 or 8 array elements at a time. Is there some kind of library or set of functions that can be used in C++ environment?
I use Eclipse IDE in Linux Gentoo to write C++ code.
UPDATE
After reading the answers I did some tests with the software. I compiled my project with the following flags:
-O3 -mcpu=cortex-a9 -ftree-vectorize -mfloat-abi=hard -mfpu=neon
Keep in mind that this project includes extensive libraries such as open frameworks, OpenCV, and OpenNI, and everything was compiled with these flags.
To compile for the ARM board we use a Linaro toolchain cross-compiler, and GCC's version is 4.8.3.
Would you expect this to improve the performance of the project? Because we experienced no changes at all, which is rather weird considering all the answers I read here.
Another question: all the for cycles have an apparent number of iterations, but many of them iterate through custom data types (structs or classes). Can GCC optimize these cycles even though they iterate through custom data types?
EDIT:
From your update, you may misunderstand what the NEON processor does. It is an SIMD (Single Instruction, Multiple Data) vector processor. That means that it is very good at performing an instruction (say "multiply by 4") to several pieces of data at the same time. It also loves to do things like "add all these numbers together" or "add each element of these two lists of numbers to create a third list of numbers." So if you problem looks like those things the NEON processor is going to be huge help.
To get that benefit, you must put your data in very specific formats so that the vector processor can load multiple data simultaneously, process it in parallel, and then write it back out simultaneously. You need to organize things such that the math avoids most conditionals (because looking at the results too soon means a roundtrip to the NEON). Vector programming is a different way of thinking about your program. It's all about pipeline management.
Now, for many very common kinds of problems, the compiler automatically can work all of this out. But it's still about working with numbers, and numbers in particular formats. For example, you almost always need to get all of your numbers into a contiguous block in memory. If you're dealing with fields inside of structs and classes, the NEON can't really help you. It's not a general-purpose "do stuff in parallel" engine. It's an SIMD processor for doing parallel math.
For very high-performance systems, data format is everything. You don't take arbitrary data formats (structs, classes, etc.) and try to make them fast. You figure out the data format that will let you do the most parallel work, and you write your code around that. You make your data contiguous. You avoid memory allocation at all costs. But this isn't really something a simple StackOverflow question can address. High-performance programming is a whole skill set and a different way of thinking about things. It isn't something you get by finding the right compiler flag. As you've found, the defaults are pretty good already.
The real question you should be asking is whether you could reorganize your data so that you can use more of OpenCV. OpenCV already has lots of optimized parallel operations that will almost certainly make good use of the NEON. As much as possible, you want to keep your data in the format that OpenCV works in. That's likely where you're going to get your biggest improvements.
My experience is that it is certainly possible to hand-write NEON assembly that will beat clang and gcc (at least from a couple of years ago, though the compiler certainly continues to improve). Having excellent ARM optimization is not the same as NEON optimization. As #Mats notes, the compiler will generally do an excellent job at obvious cases, but does not always handle every case ideally, and it is certainly possible for even a lightly skilled developer to sometimes beat it, sometimes dramatically. (#wallyk is also correct that hand-tuning assembly is best saved for last; but it can still be very powerful.)
That said, given your statement "Assembly, for which I have absolutely no background, and can't possibly afford to learn at this point," then no, you should not even bother. Without first at least understanding the basics (and a few non-basics) of assembly (and specifically vectorized NEON assembly), there is no point in second-guessing the compiler. Step one of beating the compiler is knowing the target.
If you are willing to learn the target, my favorite introduction is Whirlwind Tour of ARM Assembly. That, plus some other references (below), were enough to let me beat the compiler by 2-3x in my particular problems. On the other hand, they were insufficient enough that when I showed my code to an experienced NEON developer, he looked at it for about three seconds and said "you have a halt right there." Really good assembly is hard, but half-decent assembly can still be better than optimized C++. (Again, every year this gets less true as the compiler writers get better, but it can still be true.)
ARM Assembly language
A few things iOS developers ought to know about the ARM architecture (iPhone-focused, but the principles are the same for all uses.)
ARM NEON support in the ARM compiler
Coding for NEON
One side note, my experience with NEON intrinsics is that they are seldom worth the trouble. If you're going to beat the compiler, you're going to need to actually write full assembly. Most of the time, whatever intrinsic you would have used, the compiler already knew about. Where you get your power is more often in restructuring your loops to best manage your pipeline (and intrinsics don't help there). It's possible this has improved over the last couple of years, but I would expect the improving vector optimizer to outpace the value of intrinsics more than the other way around.
Here's a "mee too" with some blog posts from ARM. FIRST, start with the following to get the background information, including 32-bit ARM (ARMV7 and below), Aarch32 (ARMv8 32-bit ARM) and Aarch64 (ARMv8 64-bit ARM):
ARM NEON programming quick reference
Second, checkout the Coding for NEON series. Its a nice introduction with pictures so things like interleaved loads make sense with a glance.
ARM NEON programming quick reference
Coding for NEON - Part 1: Load and Stores
Coding for NEON - Part 2: Dealing With Leftovers
Coding for NEON - Part 3: Matrix Multiplication
Coding for NEON - Part 4: Shifting Left and Right
Coding for NEON - Part 5: Rearranging Vectors
I also went on Amazon looking for some books on ARM assembly with a treatment of NEON. I could only find two, and neither book's treatment of NEON were impressive. They reduced to a single chapter with the obligatory Matrix example.
I believe ARM Intrinsics are a very good idea. The instrinsics allow you to write code for GCC, Clang and Visual C/C++ compilers. We have one code base that works for ARM Linux distros (like Linaro), some iOS devices (using -arch armv7) and Microsoft gadgets (like Windows Phone and Windows Store Apps).
If you have access to a reasonably modern GCC (GCC 4.8 and upwards) I would recommend giving intrinsics a go. The NEON intrinsics are a set of functions that the compiler knows about, which can be used from C or C++ programs to generate NEON/Advanced SIMD instructions. To gain access to them in your program, it is necessary to #include <arm_neon.h>. The verbose documentation of all available intrinsics is available at http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf , but you may find more user-friendly tutorials elsewhere online.
Advice on this site is generally against the NEON intrinsics, and certainly there are GCC versions which have done a poor job of implementing them, but recent versions do reasonably well (and if you spot bad code generation, please do raise it as a bug - https://gcc.gnu.org/bugzilla/ )
They are an easy way to program to the NEON/Advanced SIMD instruction set, and the performance you can achieve is often rather good. They are also "portable", in that when you move to an AArch64 system, a superset of the intrinsics you can use from ARMv7-A are available. They are also portable across implementations of the ARM architecture, which can vary in their performance characteristics, but which the compiler will model for performance tuning.
The principle benefit of the NEON intrinsics over hand-written assembly, is that the compiler can understand them when performing its various optimization passes. By contrast hand-written assembler is an opaque block to GCC, and will not be optimized. On the other hand, expert assembler programmers can often beat the compiler's register allocation policies, particularly when using the instructions which write to or read from to multiple consecutive registers.
In addition to Wally's answer - and probably should be a comment, but I couldn't make it short enough: ARM has a team of compiler developers whose entire role is to improve the parts of GCC and Clang/llvm that does code generation for ARM CPUs, including features that provides "auto-vectorization" - I have not looked deeply into it, but from my experience on x86 code generation, I'd expect for anything that is relatively easy to vectorize, the compiler should do a deecent job. Some code is hard for the compiler to understand when it can vectorize or not, and may need some "encouragement" - such as unrolling loops or marking conditions as "likely" or "unlikely", etc.
Disclaimer: I work for ARM, but have very little to do with the compilers or even CPUs, as I work for the group that does graphics (where I have some involvement with compilers for the GPUs in the OpenCL part of the GPU driver).
Edit:
Performance, and use of various instruction extensions is really depending on EXACTLY what the code is doing. I'd expect that libraries such as OpenCV is already doing a fair amount of clever stuff in their code (such as both handwritten assembler as compiler intrinsics and generally code that is designed to allow the compiler to already do a good job), so it may not really give you much improvement. I'm not a computer vision expert, so I can't really comment on exactly how much such work is done on OpenCV, but I'd certainly expect the "hottest" points of the code to have been fairly well optimised already.
Also, profile your application. Don't just fiddle with optimisation flags, measure it's performance and use a profiling tool (e.g. the Linux "perf" tool) to measure WHERE your code is spending time. Then see what can be done to that particular code. Is it possible to write a more parallel version of it? Can the compiler help, do you need to write assembler? Is there a different algorithm that does the same thing but in a better way, etc, etc...
Although tweaking compiler options CAN help, and often does, it can give tens of percent, where a change in algorithm can often lead to 10 times or 100 times faster code - assuming of course, your algorithm can be improved!
Understanding what part of your application is taking the time, however, is KEY. It's no point in changing things to make the code that takes 5% of the time 10% faster, when a change somewhere else could make a piece of code that is 30 or 60% of the total time 20% faster. Or optimise some math routine, when 80% of the time is spent on reading a file, where making the buffer twice the size would make it twice as fast...
Although a long time has passed since I submitted this question, I realize that it gathers some interest and I decided to tell what I ended up doing regarding this.
My main goal was to optimize a for-loop which was the bottleneck of the project. So, since I don't know anything about Assembly I decided to give NEON intrinsics a go. I ended up having a 40-50% gain in performance (in this loop alone), and a significant overall improvement in performance of the whole project.
The code does some math to transform a bunch of raw distance data into distance to a plane in millimetres. I use some constants (like _constant05, _fXtoZ) that are not defined here, but they are just constant values defined elsewhere.
As you can see, I'm doing the math for 4 elements at a time, talk about real parallelization :)
unsigned short* frameData = frame.ptr<unsigned short>(_depthLimits.y, _depthLimits.x);
unsigned short step = _runWidth - _actWidth; //because a ROI being processed, not the whole image
cv::Mat distToPlaneMat = cv::Mat::zeros(_runHeight, _runWidth, CV_32F);
float* fltPtr = distToPlaneMat.ptr<float>(_depthLimits.y, _depthLimits.x); //A pointer to the start of the data
for(unsigned short y = _depthLimits.y; y < _depthLimits.y + _depthLimits.height; y++)
{
for (unsigned short x = _depthLimits.x; x < _depthLimits.x + _depthLimits.width - 1; x +=4)
{
float32x4_t projX = {(float)x, (float)(x + 1), (float)(x + 2), (float)(x + 3)};
float32x4_t projY = {(float)y, (float)y, (float)y, (float)y};
framePixels = vld1_u16(frameData);
float32x4_t floatFramePixels = {(float)framePixels[0], (float)framePixels[1], (float)framePixels[2], (float)framePixels[3]};
float32x4_t fNormalizedY = vmlsq_f32(_constant05, projY, _yResInv);
float32x4_t auxfNormalizedX = vmulq_f32(projX, _xResInv);
float32x4_t fNormalizedX = vsubq_f32(auxfNormalizedX, _constant05);
float32x4_t realWorldX = vmulq_f32(fNormalizedX, floatFramePixels);
realWorldX = vmulq_f32(realWorldX, _fXtoZ);
float32x4_t realWorldY = vmulq_f32(fNormalizedY, floatFramePixels);
realWorldY = vmulq_f32(realWorldY, _fYtoZ);
float32x4_t realWorldZ = floatFramePixels;
realWorldX = vsubq_f32(realWorldX, _tlVecX);
realWorldY = vsubq_f32(realWorldY, _tlVecY);
realWorldZ = vsubq_f32(realWorldZ, _tlVecZ);
float32x4_t distAuxX, distAuxY, distAuxZ;
distAuxX = vmulq_f32(realWorldX, _xPlane);
distAuxY = vmulq_f32(realWorldY, _yPlane);
distAuxZ = vmulq_f32(realWorldZ, _zPlane);
float32x4_t distToPlane = vaddq_f32(distAuxX, distAuxY);
distToPlane = vaddq_f32(distToPlane, distAuxZ);
*fltPtr = (float) distToPlane[0];
*(fltPtr + 1) = (float) distToPlane[1];
*(fltPtr + 2) = (float) distToPlane[2];
*(fltPtr + 3) = (float) distToPlane[3];
frameData += 4;
fltPtr += 4;
}
frameData += step;
fltPtr += step;
}
If you don't want to mess with assembly code at all, then tweak the compiler flags to maximally optimize for speed. gcc given the proper ARM target should do this provided the number of loop iterations is apparent.
To check gcc code generation, request assembly output by adding the -S flag.
If after several tries (of reading the gcc documentation and tweaking flags) you still can't get it to produce the code you want, then take the assembly output and edit it to your satisfaction.
Beware of premature optimization. The proper development order is to get the code functional, then see if it needs optimization. Only when the code is stable does it makes sense to do so.
Play with some minimal assembly examples on QEMU to understand the instructions
The following setup does not have many examples yet, but it serves as a neat playground:
v7 examples
v8 examples
setup usage
The examples run on QEMU user mode, which dispenses extra hardware, and the GDB is working just fine.
The asserts are done through the C standard library.
You should be a able to easily extend that setup with new instructions as you learn them.
ARM intrinsincs in particular were asked at: Is there a good reference for ARM Neon intrinsics?

the instruction cache and conditional statements

im trying to orient my code to use the cache as efficiently as possible using data oriented design, its my first time thinking about such things as it goes. ive worked out a way to loop over the same instruction that draw a sprite on screen, the vectors sent to the function include positions and sprites for all game entities.
my question is does the conditional statement get rid of the draw function from the instruction cache and therefore ruin my plan? or is what im doing just generally insane?
struct position
{
position(int x_, int y_):x(x_), y(Y_)
int x,y;
};
vector<position> thePositions;
vector<sprite> theSprites;
vector<int> theNoOfEntities; //eg 3 things, 4 thingies, 36 dodahs
int noOfEntitesTotal;
//invoking the draw function
draw(&thePositions[0], &theSprites[0], &theNoOfEntities[0], noOfEntitesTotal)
void draw(position* thepos, sprite* thesp, int* theints, int totalsize)
{
for(int j=0;int i=0;i<totalsize;i++)
{
j+=i%size[j]?1:0;
thesp[j].draw(thepos[i]);
}
}
Did you verify that the conditional stays as a conditional in assembly? generally with simple conditionals such as the one presented above, the expression can be optimized to a branchless sequence (either at machine level using machine specific instructions, or at IR level using some fancy bit math).
In your case, you conditional gets folded down very nicely on x86 to a flat sequence (and AFAIK, this will occur on most non-x86 platforms too, as its a mathematical optimization, not a machine specific one):
IDIV DWORD PTR SS:[ARG.1]
MOV EAX,EDX
NEG EAX ; Converts EAX to boolean
SBB EAX,EAX
NEG EAX
So this means the aren't any branches to predict, other than your outer loop, which follows a pattern, meaning it won't cause any mis-prediction (it might mis-predict on exit, depending on the generated assembly, but its exited, so it doesn't matter).
This brings up a second point, never assume, always profile and test (one of the cases where assembly knowledge helps a lot), that way you can spend time optimizing where it realy matters (and you can understand the inter and inner workings of your code on your target platform better too).
If you really are concerned about branch mis-prediction and the penalties incured, use the resources provided by your target architectures manufacturer (different architectures behave very differently for branch mis-prediction), such as this and this from Intel. AMD's CodeAnalyst is a great tool for checking branch mis-prediction and the penalties it may be causing.
Whoa there buddy! No offence, but it looks like you've read about DOD without fully understanding the how and why of it. Now you're just following the guidelines set in the articles about DOD like they're important. They're not, what's important in DOD is understanding data, understanding the computer architecture and understanding how your code can manipulate that data as efficient as possible using your knowledge of the architecture. The guidelines set out in DOD articles are only there as reminders of common things to think about.
Want to know when how and why you need to use DOD? Learn about the architecture you're working with. Do you know the cost of one cache-miss? It's really really really really low. Do the math. I'm serious, do the math yourself, I could probably give you some numbers but then you wouldn't be learning much.
So find out what you can about the architecture, how a processor works, how memory and caches work, how assembly language works, what the assembly generated by your compiler looks like. Once you know and understand all of that, DOD is really nothing more than stating some almost obvious guidelines to writing really efficient code.

When should I use ASM calls?

I'm planning on writing a game with C++, and it will be extremely CPU-intensive (pathfinding,genetic algorithms, neural networks, ...)
So I've been thinking about how to tackle this situation best so that it would run smoothly.
(let this top section of this question be side information, I don't want it to restrict the main question, but it would be nice if you could give me side notes as well)
Is it worth it to learn how to work with ASM, so I can make ASM calls in C++,
can it give me a significant/notable performance advantage?
In what situations should I use it?
Almost never:
You only want to be using it once you've profiled your C++ code and have identified a particular section as a bottleneck.
And even then, you only want to do it once you've exhausted all C++ optimization options.
And even then, you only want to be using ASM for tight, inner loops.
And even then, it takes quite a lot of effort and skill to beat a C++ compiler on a modern platform.
If your not an experienced assembly programmer, I doubt you will be able to optimize assembly code better than your compiler.
Also note that assembly is not portable. If you decide to go this way, you will have to write different assembly for all the architectures you decide to support.
Short answer: it depends, most likely you won't need it.
Don't start optimizing prematurely. Write code that is also easy to read and to modify. Separate logical sections into modules. Write something that is easy to extend.
Do some profiling.
You can't tell where your bottlenecks are unless you profile your code. 99% of the time you won't get that much performance gain by writing asm. There's a high chance you might even worsen your performance. Optimizers nowadays are very good at what they do. If you do have a bottleneck, it will most probably be because of some poorly chosen algorithm or at least something that can be remedied at a high-level.
My suggestion is, even if you do learn asm, which is a good thing, don't do it just so you can optimize.
Profile profile profile....
A legitimate use case for going low-level (although sometimes a compiler can infer it for you) is to make use of SIMD instructions such as SSE. I would assume that at least some of the algorithms you mention will benefit from parallel processing.
However, you don't need to write actual assembly, instead you can simply use intrinsic functions. See, e.g. this.
Don't get ahead of yourself.
I've posted a sourceforge project showing how a simulation program was massively speeded up (over 700x).
This was not done by assuming in advance what needed to be made fast.
It was done by "profiling", which I put in quotes because the method I use is not to employ a profiler.
Rather I rely on random pausing, a method known and used to good effect by some programmers.
It proceeds through a series of iterations.
In each iteration a large source of time-consumption is identified and fixed, resulting in a certain speedup ratio.
As you proceed through multiple iterations, these speedup ratios multiply together (like compound interest).
That's how you get major speedup.
If, and only if, you get to a point where some code is taking a large fraction of time, and it doesn't contain any function calls, and you think you can write assembly code better than the compiler does, then go for it.
P.S. If you're wondering, the difference between using a profiler and random pausing is that profilers look for "bottlenecks", on the assumption that those are localized things. They look for routines or lines of code that are responsible for a large percent of overall time.
What they miss is problems that are diffuse.
For example, you could have 100 routines, each taking 1% of time.
That is, no bottlenecks.
However, there could be an activity being done within many or all of those routines, accounting for 1/3 of the time, that could be done better or not at all.
Random pausing will see that activity with a small number of samples, because you don't summarize, you examine the samples.
In other words, if you took 9 samples, on average you would notice the activity on 3 of them.
That tells you it's big.
So you can fix it and get your 3/2 speedup ratio.
"To understand recursion, you must first understand recursion." That quote comes to mind when I consider my response to your question, which is "until you understand when to use assembly, you should never use assembly." After you have completely implemented your soution, extensively profiled its performance and determined precise bottlenecks, and experimented with several alternative solutions, then you can begin to consider using assembly. If you code a single line of assembly before you have a working and extensively profiled program, you have made a mistake.
If you need to ask than you don't need it.

What is the optimization level (g++) you use while comparing two different algorithms written in C++?

I have two algorithms written in C++. As far as I know, it is conventional to compile with
-O0 -NDEBUG (g++) while comparing the performance of two algorithms(asymptotically they are same).
But I think the optimization level is unfair to one of them, because it uses STL in every case. The program which uses plain array outperforms the STL-heavy algorithm 5 times faster while compiled with -O0 options. But the performance difference is not much different when I compile them with -O2 -NDEBUG.
Is there any way to get the best out of STL (I am getting heavy performance hit in the vector [] operator) in optimization level -O0?
What optimization level (and possibly variables like -NDEBUG) do you use while comparing two algorithms?
It will be also great help if someone can give some idea about the trend in academic research about comparing the performance of algorithms written in C++?
Ok, To isolate the problem of optimization level, I am using one algorithm but two different implementation now.
I have changed one of the functions with raw pointers(int and boolean) to std::vector and std::vector... With -O0 -NDEBUG the performances are 5.46s(raw pointer) and 11.1s(std::vector). And with -O2 -NDEBUG , the performances are 2.02s(raw pointer) and 2.21s(std::vector). Same algorithm, one implementation is using 4/5 dynamic arrays of int and boolean. And the other one is using using std::vector and std::vector instead. They are same in every other case
You can see that in -O0 std::vector is outperformed with twice faster pointers. While in -O2 they are almost the same.
But I am really confused, because in academic fields, when they publish the results of algorithms in running time, they compile the programs with -O0.
Is there some compiler options I am missing?
It depends on what you want to optimize for.
Speed
I suggest using -O2 -NDEBUG -ftree-vectorize, and if your code is designed to specifically run on x86 or x86_64, add -msse2. This will give you a broad idea on how it will perform with GIMPLE.
Size
I believe you should use -Os -fno-rtti -fno-exceptions -fomit-frame-pointer. This will minimize the size of the executable to a degree (assuming C++).
In both cases, algorithm's speed is not compiler dependent, but a compiler can drastically change the way the code behaves if it can "prove" it can.
GCC detects 'common' code such as hand-coded min() and max() and turns them into one SSE instruction (on x86/x86_64 and when -msse is set) or using cmov when i686 is available (SSE has higher priority). GCC will also take liberty in reordering loops, unrolling and inlining functions if it wants to, and even remove useless code.
As for your latest edit:
You can see that in -O0 std::vector is
outperformed with twice faster
pointers. While in -O2 they are almost
the same.
That's because std::vector still has code that throws exceptions and may use rtti. Try comparing with -O2 -NDEBUG -ftree-vectorize -fno-rtti -fno-exceptions -fomit-frame-pointer, and you'll see that std::vector will be slightly better than your code. GCC knows what 'built-in' types are and how to exploit them in real world use and will gladly do so - just like it knows what memset() and memcpy() does and how to optimize accordingly when copy size is known.
The compiler optimizations usually won't change the complexity order of an algorithm, just the constant and the linear scale factor. Compilers are fairly smart, but they're not that smart.
Are you going to be compiling your code for release with just -O0? Probably not. You might as well compare the performance of the algorithms when compiled with whatever compilation flags you actually intend to use.
You have two algorithms implemented in C++. If you want to compare the relative performance of the two implementations then you should use the optimization level that you are going to use in your final product. For me, that's -O3.
If you want to analyse the complexity of an algorithm, then that's more of an analysis problem where you look at the overall count of operations that must be performed for different sizes and characteristics of inputs.
As a developer writing code where performance is an issue, it is a good idea to be aware of the range of optimizations that a compiler can, and is likely to, apply to your code. Not optimizing unfairly penalises code that is written clearly, but designed to be easily optimized against code that is already 'micro-optimized'.
I see no reason not to compile and run them both at O2. Unless you're doing it as a purely academic exercise (and even if you were it's very unlikely the optimizations would produce fundamental changes in the properties of the algorithm - Though, I think I'd be happy if GCC started turnning O(N) source into O(lgN) assembly) , you'll want information that's consistant what you would get when actually running the final program. You most likely won't be releasing the program with O0 optimizations, so you don't want to compare the algorithms under O0 optimizations.
Such a comparison is less about fairness than producing useful information. You should use the optimization level that you plan to use when/if the code is put into production use. If you're basically doing research, so you don't personally plan to put it into production use, you're stuck with the slightly more difficult job of guessing what somebody who would put it into production would probably do.
Realistically, even if you are doing development, not research, you're stuck with a little of that anyway -- it's nearly impossible to predict what optimization level you might eventually use with this particular code.
Personally, I usually use -O2 with gcc. My general rule of thumb is to use the lowest level of optimization that turns on automatic inlining. I write a lot of my code with the expectation that small functions will be inlined by the compiler -- and write the code specifically to assist in doing that (e.g. often using functors instead of functions). If the compiler isn't set to produce code for those inline, you're not getting what I really intended. The performance of the code when it's compiled that way doesn't really mean anything -- I certainly would not plan on ever really using it that way.