Can I improve branch prediction with my code?

Can I improve branch prediction with my code? - c++

This is a naive general question open to any platform, language, or compiler. Though I am most curious about Aarch64, C++, GCC.
When coding an unavoidable branch in program flow dependent on I/O state (compiler cannot predict), and I know that one state is much more likely than another, how do I indicate that to the compiler?
Is this better
if(true == get(gpioVal))
unlikelyFunction();
else
likelyFunction();
than this?
if(true == get(gpioVal))
likelyFunction(); // performance critical, fill prefetch caches from this branch
else
unlikelyFunction(); // missed prediction not consequential on this branch
Does it help if the communication protocol makes the more likely or critical value true(high), or false(low)?

TL:DR: Yes, in C or C++ use a likely() macro, or C++20 [[likely]], to help the compiler make better asm. That's separate from influencing actual CPU branch-prediction, though. If writing in asm, lay out your code to minimize taken branches.
For most ISAs, there's no way in asm to hint the CPU whether a branch is likely to be taken or not. (Some exceptions include Pentium 4 (but not earlier or later x86), PowerPC, and some MIPS, which allow branch hints as part of conditional-branch asm instructions.)
Is it possible to tell the branch predictor how likely it is to follow the branch?
But not-taken straight-line code is cheaper than taken, so hinting high-level language to lay out code with the fast-path contiguous doesn't help branch prediction accuracy, but can help (or hurt) performance. (I-cache locality, front-end bandwidth: remember code-fetch happens in contiguous 16 or 32-byte blocks, so a taken branch means a later part of that fetch block isn't useful. Also, branch prediction throughput; some CPUs like Intel Skylake for example can't handle a predicted-taken branch at more than 1 per 2 clocks, other than loop branches. That include unconditional branches like jmp or ret.)
Taken branches are hard; not-taken branches keep the CPU on its toes, but if the prediction is accurate it's just a normal instruction for an execution unit (verifying the prediction), with nothing special for the front-end. See also Modern Microprocessors
A 90-Minute Guide! which has a section on branch prediction. (And is overall excellent.)
What exactly happens when a skylake CPU mispredicts a branch?
Avoid stalling pipeline by calculating conditional early
How does the branch predictor know if it is not correct?
Many people misunderstand source-level branch hints as branch prediction hints. That could be one effect if compiling for a CPU that supports branch hints in asm, but for most the significant effect is in layout, and deciding whether to use branchless (cmov) or not; a [[likely]] condition also means it should predict well.
With some CPUs, especially older, layout of a branch did sometimes influence runtime prediction: if the CPU didn't remember anything about the branch in its dynamic predictors, the standard static prediction heuristic is that forward conditional branches are not-taken, backward conditional are assumed taken (because that's normally the bottom of a loop. See the BTFNT section in https://danluu.com/branch-prediction/.
A compiler can lay out an if(c) x else y; either way, either matching the source with jump over x if !c as the opening thing, or swap the if and else blocks and use the opposite branch condition. Or it can put one block out-of-line (e.g. after the ret at the end of the function) so the fast path has no taken branches conditional or otherwise, while the less likely path has to jump there and then jump back.
It's easy to do more harm than good with branch hints in high-level source, especially if surrounding code changes without paying attention to them, so profile-guided optimization is the best way for compilers to learn about branch predictability and likelihood. (e.g. gcc -O3 -fprofile-generate / run with some representative inputs that exercise code-paths in relevant ways / gcc -O3 -fprofile-use)
But there are ways to hint in some languages, like C++20 [[likely]] and [[unlikely]], which are the portable version of GNU C likely() / unlikely() macros around __builtin_expect.
https://en.cppreference.com/w/cpp/language/attributes/likely C++20 [[likely]]
How to use C++20's likely/unlikely attribute in if-else statement syntax help
Is there a compiler hint for GCC to force branch prediction to always go a certain way? (to the literal question, no. To what's actually wanted, branch hints to the compiler, yes.)
How do the likely/unlikely macros in the Linux kernel work and what is their benefit? The GNU C macros using __builtin_expect, same effect but different syntax than C++20 [[likely]]
What is the advantage of GCC's __builtin_expect in if else statements? example asm output. (Also see CiroSantilli's answers to some of the other questions where he made examples.)
Simple example where [[likely]] and [[unlikely]] affect program assembly?
I don't know of ways to annotate branches for languages other than GNU C / C++, and ISO C++20.
Absent any hints or profile data
Without that, optimizing compilers have to use heuristics to guess which side of a branch is more likely. If it's a loop branch, they normally assume that the loop will run multiple times. On an if, they have some heuristics based on the actual condition and maybe what's in the blocks being controlled; IDK I haven't looked into what gcc or clang do.
I have noticed that GCC does care about the condition, though. It's not as naive as assuming that int values are uniformly randomly distributed, although I think it normally assumes that if (x == 10) foo(); is somewhat unlikely.
JIT compilers like in a JVM have an advantage here: they can potentially instrument branches in the early stages of running, to collect branch-direction information before making final optimized asm. OTOH they need to compile fast because compile time is part of total run time, so they don't try as hard to make good asm, which is a major disadvantage in terms of code quality.

Related

optimisation advice on value clamping in a loop

I have a tight loop exactly like what Chandler Carruth presented in CPP CON 2017:
https://www.youtube.com/watch?v=2EWejmkKlxs
at 25 mins in this video, there is a loop like this:
for (int& i:v)
i = i>255?255:i;
where v is a vector. This is exactly the same code used in my program which after profiling, proves to take a good amount of time.
In his presentation, Chandler modified the assembly and speed up the loop. My question is, in practice, in a production code, what is the recommended approach to optimise this? Shall we use inline assembly in c++ code? Or like Chandler did, compile C++ code into assembly then optimise the assembler?
An example to optimise the above for loop will be really appreciated, assuming x86 architecture.

My question is, in practice, in a production code, what is the recommended approach to optimise this? Shall we use inline assembly in c++ code? Or like Chandler did, compile C++ code into assembly then optimise the assembler?
For production code you need to consider that the software might be compiled and linked in an automatic build system.
How would you want to apply the code changes to assembler code in such a system? You might apply a diff file, but that might break if if optimisation (or other) settings are changed, if switching to another compiler or ...
Now remaining two options: write the entire function in an assembler file (.s) or have inline assembler code inside the C++ code – the latter possibly with the advantage of keeping related code in the same translation unit.
Still I'd let the compiler generate assembler code once – with highest optimisation level available. This code can then serve as a (already pre-optimised) base for your hand-made optimisations, of which the outcome should then be pasted back as inline assembly to the C++ source file or placed into a separate assembly source file.

Chandler modified the compiler's asm output because that's an easy way to do a one-off experiment to find out whether a change would be useful, without doing all the stuff you'd normally want to include an asm loop or function as part of the source code for a project.
Compiler-generated asm is usually a good starting point for an optimized loop, but actually keeping the whole file as-is isn't a good or even viable way to actually maintain an asm implementation of a loop as part of a program. See #Aconcagua's answer.
Plus it defeats the purpose of having any other functions in the file written in C++ and being available for link-time optimization.
Re: actually clamping:
Note that Chandler was just experimenting with changes to the non-vectorized code-gen, and disabled unrolling + auto-vectorization. In real life hopefully you can target SSE4.1 or AVX2 and let the compiler auto-vectorize with pminsd or pminud for signed or unsigned int clamping to an upper bound. (Also available in other element sizes. Or without SSE4.1, just SSE2, maybe you can 2x PACKSSDW => packuswb (unsigned saturation) then unpack with zeros back up to 4 vectors of dword elements. (If you can't just use an output of uint8_t[]!)
And BTW, in the comments of the video, Chandler said it turns out that he made a mistake and the effect he was seeing wasn't really due to a predictable branch vs. a cmov. It might have been a code-alignment thing, because changing from mov %ebx, (%rdi) to movl $255, (%rdi) made a difference!
(AMD CPUs aren't known to have register-read stalls the way P6-family did, should have no trouble hiding the dep chain of a cmov coupling a store to a load vs. breaking it with branch prediction + speculation past a branch.)
You very rarely would actually want to use a hand-written loop. Often you can hand-hold and/or trick your compiler into making asm more like what you want, just by modifying the C++ source. Then a future compiler is free to tune differently for -march=some_future_cpu.

Coding for ARM NEON: How to start?

I'm looking to optimize C++ code (mainly some for loops) using the NEON capability of computing 4 or 8 array elements at a time. Is there some kind of library or set of functions that can be used in C++ environment?
I use Eclipse IDE in Linux Gentoo to write C++ code.
UPDATE
After reading the answers I did some tests with the software. I compiled my project with the following flags:
-O3 -mcpu=cortex-a9 -ftree-vectorize -mfloat-abi=hard -mfpu=neon
Keep in mind that this project includes extensive libraries such as open frameworks, OpenCV, and OpenNI, and everything was compiled with these flags.
To compile for the ARM board we use a Linaro toolchain cross-compiler, and GCC's version is 4.8.3.
Would you expect this to improve the performance of the project? Because we experienced no changes at all, which is rather weird considering all the answers I read here.
Another question: all the for cycles have an apparent number of iterations, but many of them iterate through custom data types (structs or classes). Can GCC optimize these cycles even though they iterate through custom data types?

EDIT:
From your update, you may misunderstand what the NEON processor does. It is an SIMD (Single Instruction, Multiple Data) vector processor. That means that it is very good at performing an instruction (say "multiply by 4") to several pieces of data at the same time. It also loves to do things like "add all these numbers together" or "add each element of these two lists of numbers to create a third list of numbers." So if you problem looks like those things the NEON processor is going to be huge help.
To get that benefit, you must put your data in very specific formats so that the vector processor can load multiple data simultaneously, process it in parallel, and then write it back out simultaneously. You need to organize things such that the math avoids most conditionals (because looking at the results too soon means a roundtrip to the NEON). Vector programming is a different way of thinking about your program. It's all about pipeline management.
Now, for many very common kinds of problems, the compiler automatically can work all of this out. But it's still about working with numbers, and numbers in particular formats. For example, you almost always need to get all of your numbers into a contiguous block in memory. If you're dealing with fields inside of structs and classes, the NEON can't really help you. It's not a general-purpose "do stuff in parallel" engine. It's an SIMD processor for doing parallel math.
For very high-performance systems, data format is everything. You don't take arbitrary data formats (structs, classes, etc.) and try to make them fast. You figure out the data format that will let you do the most parallel work, and you write your code around that. You make your data contiguous. You avoid memory allocation at all costs. But this isn't really something a simple StackOverflow question can address. High-performance programming is a whole skill set and a different way of thinking about things. It isn't something you get by finding the right compiler flag. As you've found, the defaults are pretty good already.
The real question you should be asking is whether you could reorganize your data so that you can use more of OpenCV. OpenCV already has lots of optimized parallel operations that will almost certainly make good use of the NEON. As much as possible, you want to keep your data in the format that OpenCV works in. That's likely where you're going to get your biggest improvements.
My experience is that it is certainly possible to hand-write NEON assembly that will beat clang and gcc (at least from a couple of years ago, though the compiler certainly continues to improve). Having excellent ARM optimization is not the same as NEON optimization. As #Mats notes, the compiler will generally do an excellent job at obvious cases, but does not always handle every case ideally, and it is certainly possible for even a lightly skilled developer to sometimes beat it, sometimes dramatically. (#wallyk is also correct that hand-tuning assembly is best saved for last; but it can still be very powerful.)
That said, given your statement "Assembly, for which I have absolutely no background, and can't possibly afford to learn at this point," then no, you should not even bother. Without first at least understanding the basics (and a few non-basics) of assembly (and specifically vectorized NEON assembly), there is no point in second-guessing the compiler. Step one of beating the compiler is knowing the target.
If you are willing to learn the target, my favorite introduction is Whirlwind Tour of ARM Assembly. That, plus some other references (below), were enough to let me beat the compiler by 2-3x in my particular problems. On the other hand, they were insufficient enough that when I showed my code to an experienced NEON developer, he looked at it for about three seconds and said "you have a halt right there." Really good assembly is hard, but half-decent assembly can still be better than optimized C++. (Again, every year this gets less true as the compiler writers get better, but it can still be true.)
ARM Assembly language
A few things iOS developers ought to know about the ARM architecture (iPhone-focused, but the principles are the same for all uses.)
ARM NEON support in the ARM compiler
Coding for NEON
One side note, my experience with NEON intrinsics is that they are seldom worth the trouble. If you're going to beat the compiler, you're going to need to actually write full assembly. Most of the time, whatever intrinsic you would have used, the compiler already knew about. Where you get your power is more often in restructuring your loops to best manage your pipeline (and intrinsics don't help there). It's possible this has improved over the last couple of years, but I would expect the improving vector optimizer to outpace the value of intrinsics more than the other way around.

Here's a "mee too" with some blog posts from ARM. FIRST, start with the following to get the background information, including 32-bit ARM (ARMV7 and below), Aarch32 (ARMv8 32-bit ARM) and Aarch64 (ARMv8 64-bit ARM):
ARM NEON programming quick reference
Second, checkout the Coding for NEON series. Its a nice introduction with pictures so things like interleaved loads make sense with a glance.
ARM NEON programming quick reference
Coding for NEON - Part 1: Load and Stores
Coding for NEON - Part 2: Dealing With Leftovers
Coding for NEON - Part 3: Matrix Multiplication
Coding for NEON - Part 4: Shifting Left and Right
Coding for NEON - Part 5: Rearranging Vectors
I also went on Amazon looking for some books on ARM assembly with a treatment of NEON. I could only find two, and neither book's treatment of NEON were impressive. They reduced to a single chapter with the obligatory Matrix example.
I believe ARM Intrinsics are a very good idea. The instrinsics allow you to write code for GCC, Clang and Visual C/C++ compilers. We have one code base that works for ARM Linux distros (like Linaro), some iOS devices (using -arch armv7) and Microsoft gadgets (like Windows Phone and Windows Store Apps).

If you have access to a reasonably modern GCC (GCC 4.8 and upwards) I would recommend giving intrinsics a go. The NEON intrinsics are a set of functions that the compiler knows about, which can be used from C or C++ programs to generate NEON/Advanced SIMD instructions. To gain access to them in your program, it is necessary to #include <arm_neon.h>. The verbose documentation of all available intrinsics is available at http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf , but you may find more user-friendly tutorials elsewhere online.
Advice on this site is generally against the NEON intrinsics, and certainly there are GCC versions which have done a poor job of implementing them, but recent versions do reasonably well (and if you spot bad code generation, please do raise it as a bug - https://gcc.gnu.org/bugzilla/ )
They are an easy way to program to the NEON/Advanced SIMD instruction set, and the performance you can achieve is often rather good. They are also "portable", in that when you move to an AArch64 system, a superset of the intrinsics you can use from ARMv7-A are available. They are also portable across implementations of the ARM architecture, which can vary in their performance characteristics, but which the compiler will model for performance tuning.
The principle benefit of the NEON intrinsics over hand-written assembly, is that the compiler can understand them when performing its various optimization passes. By contrast hand-written assembler is an opaque block to GCC, and will not be optimized. On the other hand, expert assembler programmers can often beat the compiler's register allocation policies, particularly when using the instructions which write to or read from to multiple consecutive registers.

In addition to Wally's answer - and probably should be a comment, but I couldn't make it short enough: ARM has a team of compiler developers whose entire role is to improve the parts of GCC and Clang/llvm that does code generation for ARM CPUs, including features that provides "auto-vectorization" - I have not looked deeply into it, but from my experience on x86 code generation, I'd expect for anything that is relatively easy to vectorize, the compiler should do a deecent job. Some code is hard for the compiler to understand when it can vectorize or not, and may need some "encouragement" - such as unrolling loops or marking conditions as "likely" or "unlikely", etc.
Disclaimer: I work for ARM, but have very little to do with the compilers or even CPUs, as I work for the group that does graphics (where I have some involvement with compilers for the GPUs in the OpenCL part of the GPU driver).
Edit:
Performance, and use of various instruction extensions is really depending on EXACTLY what the code is doing. I'd expect that libraries such as OpenCV is already doing a fair amount of clever stuff in their code (such as both handwritten assembler as compiler intrinsics and generally code that is designed to allow the compiler to already do a good job), so it may not really give you much improvement. I'm not a computer vision expert, so I can't really comment on exactly how much such work is done on OpenCV, but I'd certainly expect the "hottest" points of the code to have been fairly well optimised already.
Also, profile your application. Don't just fiddle with optimisation flags, measure it's performance and use a profiling tool (e.g. the Linux "perf" tool) to measure WHERE your code is spending time. Then see what can be done to that particular code. Is it possible to write a more parallel version of it? Can the compiler help, do you need to write assembler? Is there a different algorithm that does the same thing but in a better way, etc, etc...
Although tweaking compiler options CAN help, and often does, it can give tens of percent, where a change in algorithm can often lead to 10 times or 100 times faster code - assuming of course, your algorithm can be improved!
Understanding what part of your application is taking the time, however, is KEY. It's no point in changing things to make the code that takes 5% of the time 10% faster, when a change somewhere else could make a piece of code that is 30 or 60% of the total time 20% faster. Or optimise some math routine, when 80% of the time is spent on reading a file, where making the buffer twice the size would make it twice as fast...

Although a long time has passed since I submitted this question, I realize that it gathers some interest and I decided to tell what I ended up doing regarding this.
My main goal was to optimize a for-loop which was the bottleneck of the project. So, since I don't know anything about Assembly I decided to give NEON intrinsics a go. I ended up having a 40-50% gain in performance (in this loop alone), and a significant overall improvement in performance of the whole project.
The code does some math to transform a bunch of raw distance data into distance to a plane in millimetres. I use some constants (like _constant05, _fXtoZ) that are not defined here, but they are just constant values defined elsewhere.
As you can see, I'm doing the math for 4 elements at a time, talk about real parallelization :)
unsigned short* frameData = frame.ptr<unsigned short>(_depthLimits.y, _depthLimits.x);
unsigned short step = _runWidth - _actWidth; //because a ROI being processed, not the whole image
cv::Mat distToPlaneMat = cv::Mat::zeros(_runHeight, _runWidth, CV_32F);
float* fltPtr = distToPlaneMat.ptr<float>(_depthLimits.y, _depthLimits.x); //A pointer to the start of the data
for(unsigned short y = _depthLimits.y; y < _depthLimits.y + _depthLimits.height; y++)
{
for (unsigned short x = _depthLimits.x; x < _depthLimits.x + _depthLimits.width - 1; x +=4)
{
float32x4_t projX = {(float)x, (float)(x + 1), (float)(x + 2), (float)(x + 3)};
float32x4_t projY = {(float)y, (float)y, (float)y, (float)y};
framePixels = vld1_u16(frameData);
float32x4_t floatFramePixels = {(float)framePixels[0], (float)framePixels[1], (float)framePixels[2], (float)framePixels[3]};
float32x4_t fNormalizedY = vmlsq_f32(_constant05, projY, _yResInv);
float32x4_t auxfNormalizedX = vmulq_f32(projX, _xResInv);
float32x4_t fNormalizedX = vsubq_f32(auxfNormalizedX, _constant05);
float32x4_t realWorldX = vmulq_f32(fNormalizedX, floatFramePixels);
realWorldX = vmulq_f32(realWorldX, _fXtoZ);
float32x4_t realWorldY = vmulq_f32(fNormalizedY, floatFramePixels);
realWorldY = vmulq_f32(realWorldY, _fYtoZ);
float32x4_t realWorldZ = floatFramePixels;
realWorldX = vsubq_f32(realWorldX, _tlVecX);
realWorldY = vsubq_f32(realWorldY, _tlVecY);
realWorldZ = vsubq_f32(realWorldZ, _tlVecZ);
float32x4_t distAuxX, distAuxY, distAuxZ;
distAuxX = vmulq_f32(realWorldX, _xPlane);
distAuxY = vmulq_f32(realWorldY, _yPlane);
distAuxZ = vmulq_f32(realWorldZ, _zPlane);
float32x4_t distToPlane = vaddq_f32(distAuxX, distAuxY);
distToPlane = vaddq_f32(distToPlane, distAuxZ);
*fltPtr = (float) distToPlane[0];
*(fltPtr + 1) = (float) distToPlane[1];
*(fltPtr + 2) = (float) distToPlane[2];
*(fltPtr + 3) = (float) distToPlane[3];
frameData += 4;
fltPtr += 4;
}
frameData += step;
fltPtr += step;
}

If you don't want to mess with assembly code at all, then tweak the compiler flags to maximally optimize for speed. gcc given the proper ARM target should do this provided the number of loop iterations is apparent.
To check gcc code generation, request assembly output by adding the -S flag.
If after several tries (of reading the gcc documentation and tweaking flags) you still can't get it to produce the code you want, then take the assembly output and edit it to your satisfaction.
Beware of premature optimization. The proper development order is to get the code functional, then see if it needs optimization. Only when the code is stable does it makes sense to do so.

Play with some minimal assembly examples on QEMU to understand the instructions
The following setup does not have many examples yet, but it serves as a neat playground:
v7 examples
v8 examples
setup usage
The examples run on QEMU user mode, which dispenses extra hardware, and the GDB is working just fine.
The asserts are done through the C standard library.
You should be a able to easily extend that setup with new instructions as you learn them.
ARM intrinsincs in particular were asked at: Is there a good reference for ARM Neon intrinsics?

Benefit of LLVM's SelectInst

LLVM has a SelectInst that is used to represent expressions like something = cond ? true-part : false-part.
What is the benefit of this instruction in the IR, as ?: could also always be lowered to a BranchInst by the compiler? Are there CPUs that support such instructions? Or is select lowered to jumps by the CodeGenerator anyway?
I reckon there may be benefits for analysis passes as the select guarantees two "branches" of the implicit if. But on the other hand, compilers are not required to use the instruction at all, so these passes must be able to deal with brs anyway.

Yes, you can use always use a conditional branch instead of a select instruction, but a select has several advantages:
There are indeed relevant CPU instructions to lower those into, the most obvious example in x86 being cmov and the various setcc instructions.
A select is a lot easier to vectorize - in fact, one of the usual phases of vectorization is "if conversion", the process of converting control flow (a conditional branch) to data flow (a select).

Producing the fastest possible executable

I have a very large program which I have been compiling under visual studio (v6 then migrated to 2008). I need the executable to run as fast as possible. The program spends most of its time processing integers of various sizes and does very little IO.
Obviously I will select maximum optimization, but it seems that there are a variety of things that can be done which don't come under the heading of optimization which do still affect the speed of the executable. For example selecting the __fastcall calling convention or setting structure member alignment to a large number.
So my question is: Are there other compiler/linker options I should be using to make the program faster which are not controlled from the "optimization" page of the "properties" dialog.
EDIT: I already make extensive use of profilers.

Another optimization option to consider is optimizing for size. Sometimes size-optimized code can run faster than speed-optimized code due to better cache locality.
Also, beyond optimization operations, run the code under a profiler and see where the bottlenecks are. Time spent with a good profiler can reap major dividends in performance (especially it if gives feedback on the cache-friendliness of your code).
And ultimately, you'll probably never know what "as fast as possible" is. You'll eventually need to settle for "this is fast enough for our purposes".

Profile-guided optimization can result in a large speedup. My application runs about 30% faster with a PGO build than a normal optimized build. Basically, you run your application once and let Visual Studio profile it, and then it is built again with optimization based on the data collected.

1) Reduce aliasing by using __restrict.
2) Help the compiler in common subexpression elimination / dead code elimination by using __pure.
3) An introduction to SSE/SIMD can be found here and here. The internet isn't exactly overflowing with articles about the topic, but there's enough. For a reference list of intrinsics, you can search MSDN for 'compiler intrinsics'.
4) For 'macro parallelization', you can try OpenMP. It's a compiler standard for easy task parallelization -- essentially, you tell the compiler using a handful of #pragmas that certain sections of the code are reentrant, and the compiler creates the threads for you automagically.
5) I second interjay's point that PGO can be pretty helpful. And unlike #3 and #4, it's almost effortless to add in.

You're asking which compiler options can help you speed up your program, but here's some general optimisation tips:
1) Ensure your algorithms are appropriate for the job. No amount of fiddling with compiler options will help you if you write an O(shit squared) algorithm.
2) There's no hard and fast rules for compiler options. Sometimes optimise for speed, sometimes optimise for size, and make sure you time the differences!
3) Understand the platform you are working on. Understand how the caches for that CPU operate, and write code that specifically takes advantage of the hardware. Make sure you're not following pointers everywhere to get access to data which will thrash the cache. Understand the SIMD operations available to you and use the intrinsics rather than writing assembly. Only write assembly if the compiler is definitely not generating the right code (i.e. writing to uncached memory in bad ways). Make sure you use __restrict on pointers that will not alias. Some platforms prefer you to pass vector variables by value rather than by reference as they can sit in registers - I could go on with this but this should be enough to point you in the right direction!
Hope this helps,
-Tom

Forget micro-optimization such as what you are describing. Run your application through a profiler (there is one included in Visual Studio, at least in some editions). The profiler will tell you where your application is spending its time.
Micro-optimization will rarely give you more than a few percentage points increase in performance. To get a really big boost, you need to identify areas in your code where inefficient algorithms and/or data structures are being used. Focus on those, for example by changing algorithms. The profiler will help identify these problem areas.

Check which /precision mode you are using. Each one generates quite different code and you need to choose based on what accuracy is required in your app. Our code needs precision (geometry, graphics code) but we still use /fp:fast (C/C++ -> Code generation options).
Also make sure you have /arch:SSE2, assuming your deployment covers processors that all support SSE2. This will result is quite a big difference in performance, as compile will use very few cycles. Details are nicely covered in the blog SomeAssemblyRequired
Since you are already profiling, I would suggest loop unrolling if it is not happening. I have seen VS2008 not doing it more frequently (templates, references etc..)
Use __forceinline in hotspots if applicable.
Change hotspots of your code to use SSE2 etc as your app seems to be compute intense.

You should always address your algorithm and optimise that before relying on compiler optimisations to get you significant improvements in most cases.
Also you can throw hardware at the problem. Your PC may already have the necessary hardware lying around mostly unused: the GPU! One way of improving performance of some types of computationally expensive processing is to execute it on the GPU. This is hardware specific but NVIDIA provide an API for exactly that: CUDA. Using the GPU is likely to get you far greater improvement than using the CPU.

I agree with what everyone has said about profiling. However you mention "integers of various sizes". If you are doing much arithmetic with mismatched integers a lot of time can be wasted in changing sizes, shorts to ints for example, when the expressions are evaluated.
I'll throw in one more thing too. Probably the most significant optimisation is in choosing and implementing the best algorithm.

You have three ways to speed up your application:
Better algorithm - you've not specified the algorithm or the data types (is there an upper limit to integer size?) or what output you want.
Macro parallelisation - split the task into chunks and give each chunk to a separate CPU, so, on a two core cpu divide the integer set into two sets and give half to each cpu. This depends on the algorithm you're using - not all algorithms can be processed like this.
Micro parallelisation - this is like the above but uses SIMD. You can combine this with point 2 as well.

You say the program is very large. That tells me it probably has many classes in a hierarchy.
My experience with that kind of program is that, while you are probably assuming that the basic structure is just about right, and to get better speed you need to worry about low-level optimization, chances are very good that there are large opportunities for optimization that are not of the low-level kind.
Unless the program has already been tuned aggressively, there may be room for massive speedup in the form of mid-stack operations that can be done differently. These are usually very innocent-looking and would never grab your attention. They are not cases of "improve the algorithm". They are usually cases of "good design" that just happen to be on the critical path.
Unfortunately, you cannot rely on profilers to find these things, because they are not designed to look for them.
This is an example of what I'm talking about.

What language/platform would you recommend for CPU-bound application?

I'm developing non-interactive cpu-bound application which does only computations, almost no IO. Currently it works too long and while I'm working on improving the algorithm, I also think if it can give any benefit to change language or platform. Currently it is C++ (no OOP so it is almost C) on windows compiled with Intel C++ compiler. Can switching to ASM help and how much? Can switching to Linux and GCC help?

Just to be thorough: the first thing to do is to gather profile data and the second thing to do is consider your algorithms. I'm sure you know that, but they've got to be #included into any performance-programming discussion.
To be direct about your question "Can switching to ASM help?" the answer is "If you don't know the answer to that, then probably not." Unless you're very familiar with the CPU architecture and its ins and outs, it's unlikely that you'll do a significantly better job than a good optimizing C/C++ compiler on your code.
The next point to make is that significant speed-ups in your code (aside from algorithmic improvements) will almost certainly come from parallelism, not linear increases. Desktop machines can now throw 4 or 8 cores at a task, which has much more performance potential than a slightly better code generator. Since you're comfortable with C/C++, OpenMP is pretty much a no-brainer; it's very easy to use to parallelize your loops (obviously, you have to watch loop-carried dependencies, but it's definitely "the simplest parallelism that could possibly work").
Having said all that, code generation quality does vary between C/C++ compilers. The Intel C++ compiler is well-regarded for its optimization quality and has full support not just for OpenMP but for other technologies such as the Threading Building Blocks.
Moving into the question of what programming languages might be even better than C++, the answer would be "programming languages that actively promote / facilitate concepts of parallelism and concurrent programming." Erlang is the belle of the ball in that regard, and is a "hot" language right now and most people interested in performance programming are paying at least some attention to it, so if you want to improve your skills in that area, you might want to check it out.

It's always algorithm, rarely language. Here's my clue: "while I'm working on improving the algorithm".
Tweaking may not be enough.
Consider radical changes to the algorithm. You've got to eliminate processing, not make the processing go faster. The culprit is often "search" -- looping through data looking for something. Find ways to eliminate search. If you can't eliminate it, replace linear search with some kind of tree search or a hash map of some kind.

Switching to ASM is not going to help much, unless you're very good at it and/or have a specific critical path routine which you know you can do better. As several people have remarked, modern compilers are just better in most cases at taking advantages of caching/etc. than anyone can do by hand.
I'd suggest:
Try a different compiler, and/or different optimization options
Run a code coverage/analysis utility, and figure out where the critical paths are, and work on optimizing those in the code
C++ should be able to give you very near the best possible performance from the code, so I wouldn't recommend switching the language. Depending on the app, you may be able to get better performance on multi code/processor systems using multiple thread, as another suggestion.

While just switching to asm won't give any benefits, since the Intel C++ Compiler is likely better at optimizing than you, you can try one of the following options:
Try a compiler that will parallelize your code, like the VectorC compiler.
Try to switch to asm with heavy use of MMX, 3DNow!, SSE or whatever fits your needs (and your CPU). This will give more of a benefit than pure asm.
You can also try GPGPU, i.e. execute large parts of your algorithm on a GPU instead of a CPU. Depending on your algorithm, it can be dramatically faster.
Edit: I also second the profile approach. I recommend AQTime, which supports the Intel C++ compiler.

Personally I'd look at languages which allow you to take advantage of parallelism most easily, unless it's a thoroughly non-parallelisable situation. Being able to bolt on some extra cores and get (if possible!) near-linear improvement may well be a lot more cost-effective than squeezing the extra few percent of efficiency out.
When it comes to parallelisation, I believe functional languages are often regarded as the best way to go, or you could look at OpenMP for C/C++. (Personally, as a managed language guy, I'd be looking at libraries for Java/.NET, but I quite understand that not everyone has the same preferences!)

Try Fortran 77 - when it comes to computations still nothing beats the granddaddy of programming languages. Also, try it with OpenMP to take advantage of multiple cores.

Hand optimizing your ASM code compared to what C++ can do for you is rarely cost effective.
If you've done anything you can to the algorithm from a traditional algorithmic view, and you've also eliminated excesses, then you may either be SOL, or you can consider optimizing your program from a hardware point of view.
For example, any time you follow a pointer around the heap you are paying a huge cost due to cache misses, possibly paging, etc., which all affect branching predictions. Most programmers (even C gurus) tend to look at the CPU from the functional standpoint rather than what happens behind the scenes. Sometimes reorganizing memory, for example by "flattening" or manually allocating memory to fit on the same page can obtain ENORMOUS speedups. I managed to get 2X speedups on graph traversals just by flattening my structures.
These are not things that your compiler will do for you since they are based on your high-level understanding of the program.

As lobrien said, you haven't given us any information to tell you if hand-optimized ASM code would help... which means the answer is probably, "not yet."
Have you run your code with a profiler?
Do you know if the code is slow because of memory constraints or processor constraints?
Are you using all your available cores?
Have you identified any algorithms you're using that aren't O(1)? Can you get them to O(1)? If not, why not?
If you've done all that, how much control do you have over the environment your program is running in? (presumably a lot if you're thinking of switching operating systems) Can you disable other processes, give your process highest priority, etc? What about just finding a machine with a faster processor, more cores, or more memory (depending on what you're constrained on)
And on and on.
If you've already done all that and more, it's certainly possible you'll get to a point where you think, "I wonder if these few lines of code right here could be optimized better than the assembly that I'm looking at in the debugger right now?" And at that point you can ask specifically.
Good luck! You're solving a problem that's fun to solve.

Sometimes you can find libraries that have optimized implementations of the algorithms you care about. Often times they will have done the multithreading for you.
For example switching from LINPACK to LAPACK got us a 10x speed increase in LU factorization/solve with a good BLAS library.

First, figure out if you can change the algorithm, as S.Lott suggested.
Assuming the algorithm choice is correct, you might look a the memory access patterns, if you have a lot of data you are processing. For a lot of number crunching applications these days, they're bound by the memory bus, not by the ALU(s). I recently optimized some code that was of the form:
// Assume N is a big number
for (int i=0; i<N; i++) {
myArray[i] = dosomething(i);
}
for (int i=0; i<N; i++) {
myArray[i] = somethingElse(myArray[i]);
}
...
and converted it to look like:
for (int i=0; i<N; i++) {
double tmp = dosomething(i);
tmp = somethingElse(tmp);
...
myArray[i] = tmp;
}
...
In this particular case, this yielded about a 2x speedup.

As Oregonghost already hinted - The VectorC compiler might help. It does not really parallelize the code though, instead you can use it to leverage on extended command sets like mmx or sse. I used it for the most time-critical parts in a software rendering engine and it resulted in a speedup of about 150%-200% on most processors.

For an alternative approach, you could look into Distributed Computing which sounds like it could suit your needs.

If you're sticking with C++ on the intel compiler, take a look at the compiler intrinsics (full reference here). I know that VC++ has similar functionality, and I'm sure you can do the same thing with gcc. These can let you take full advantage of the parallelism built into your CPU. You can use the MMX, SSE and SSE2 instructions to improve performance to a degree. Like others have said, you're probably best looking at the algorithm first.

I suggest you rethink your algorithm, or maybe even better, your approach. On the other hand maybe what you are trying to calculate just takes a lot of computing time. Have you considered to make it distributed so it can run in a cluster of some sort? If you want to focus on pure code optimization by introducing Assembler for your inner loops then often that can be very beneficial (if you know what you're doing).

For modern processors, learning ASM will take you a long time. Further, with all the different versions of SSE around, your code will end up very processor dependant.
I do quite a lot of CPU-bound work, and have found that the difference between intel's C++ compiler and g++ usually isn't that big (at most 15% or so), and there is no measurable difference between Mac OS X, Windows and Linux.
You are going to have to optimise your code and improve your algorithm by hand. There is no "magic fairy dust" which can make existing code that much faster I'm afraid.
If you haven't yet, and you care about performance, you MUST run your code through a good profiler (personally, I like kcachegrind & valgrind on Linux, or Shark on Mac OS X. I don't know what is good for windows I'm afraid).
Based on my past experience, there is a very good chance you'll find some method is taking 95% of your CPU time, and some simple change or addition of caching will make a massive improvement to your performance. On a similar note, if some method is only taking 1% of your CPU time, no amount of optimising is going to gain you anything.

The 2 obvious answers to "CPU-bound" are:
1. Use more CPU (core)s
2. Use something else.
Using 2 threads instead of 1 will cut the time spent by up to 50%. In comparision, C++ to ASM rarely gives you 5% (and for novice ASM programmers, it's often -5%!). Some problems scale well, and may benefit from 8 or 16 cores. That kind of hardware is still pretty mainstream, so see if your problems fall in that category.
The other solution is to throw more specialized hardware at the task. This could be the vector unit of your CPU - considering Windows=x86/x64, that's going to be a flavor of SSE. Another kind of vector hardware is the modern GPU. The GPU also has its own memory bus, which is quite speedy.

First get the lead out. Then if it's as fast as it can possibly be without going to ASM, so be it. But thinking you have to go to ASM assumes you know what's making it slow, and I'll bet a donut that you're guessing.

If you feel you have optimized your code to a point there is no improvement, increase your CPU's. This can be done on different platforms. One I develop with is Appistry. A few links:
http://www.appistry.com/resource-library/index.html
and you can download the product free from here:
http://www.appistry.com/developers/
I work for Appistry and we have done many installations for tasks that were cpu bound by spreading work out over 10's or 100's of machines.
Hope this helps,
-Brett

Probable small help:
Optimization of 64-bit programs
AMD64 (EM64T) architecture
Debugging and optimization of multi-thread OpenMP-programs
Introduction into the problems of developing parallel programs
Development of Resource-intensive Applications in Visual C++

Linux
Switching to Linux can help, if you strip it down to only the parts you actually need.

CrowdProcess has about 2000 workers you can use to compute your algorithm. The API is extremely simple and we've been observing speedups close to the number of workers. Also you can write Javascript which should make you more productive than C++ or ASM.
So if you're in between C++ or ASM, I'd say you should first use all your CPU cores, then if it's not enough, CrowdProcess should be an interesting platform.
Disclaimer: I built CrowdProcess.

It is hard to produce ASM code that is faster than naive C or C++ code. In most cases if you do this job really well, you probably gain not much than few percents and getting like 10% speedup is considered great success but in most cases it is just impossible.
Compilers are capable of understanding how to compile efficiently. You should profile in order to figure out where to optimize.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js