Operations speed

Operations speed - c++

I'm coding a game and it's very important to make speed calculations in render-code.
How can I get the speed of some operations?
For example, how to know whether multiplying is faster then sqrt, etc? Or I have to make tests and calculate the time.
Programming language is c++, thanks.

This kind of micro-optimisation is just the thing to waste your time for minimal gain.
Use a profiler and start by improving your own algorithms and code wherever the profiler tells you that the game is spending most of its time.
Note that in some cases you may have to overhaul the whole software - or a major part of it - in order to implement a more efficient design. In that case the profiler results can be misleading to the inexperienced. E.g. optimising a complex computation may procure minimal gain, when compared to caching its result once and for all.
See also this somewhat related thread.

Determining the speed of a particular operation is often known as profiling.The best solution for profiling an operation is to use a profiler. Visual Studio has a good profiler. Linux has gprof . If your compiler doesn't have a profiler, it might be worthwhile purchasing a compiler that does if you will often be profiling your code.
If you have to get by without using a professional profiler, then you can usually get by embedding your own into your program
check this out for codes of some profilers.

Your best bet is to use a tool like AQTime and do a profiling run. Then you will know where to spend your time optimizing. But doing it prematurely or based on guess work likely wont get you much, and just complicate your code or break something. The best thing is to take any floating point calculations, especially sin, cos and the like, and sqrt out of any loops if you can.
I once had something like this:
for i = 0 to nc
for j = 0 to nc
aij = sqrt(a[i]*b[j])
which calculates nc*nc square roots. But since sqrt(a*b) is equal to sqrt(a)*sqrt(b), you can precompute the square roots for all the a's and b's beforehand so the loop then just becomes what is shown below. So instead of nc*nc square roots, you have 2*nc square roots.
for i = 0 to nc
for j = 0 to nc
aij = asqrt[i]*bsqrt[j]

The question you are asking is highly dependent on the platform you are developing for at the hardware level. Not only will there be variation between different chipsets (Intel / AMD) but there will also be variations on the platform (I suspect that the iPhone doesn't have as many instructions for doing certain things quicker).
You state in your question that you are talking about 'render code'. The rules change massively if you're talking about code that will actually run on the GPU (shader code) instead of the CPU.
As #thkala states, I really wouldn't worry about this before you start. I've found it not only easier, but quicker to code it in a way that works first, and then (only if it needs improving) rewriting the bits that are slow when you profile your code. Better algorithms will usually provide better performance than trying to make use of only specific functions.
In the game(s) that we're developing for the iPhone, the only thing that I've kept in mind are that big math operations (sqrt) are slow (not basic maths) and that for loops that run every frame can quickly eat up CPU. Keeping that in mind, we've not had to optimise hardly any code - as it's all running at 60fps anyway - so I'm glad I didn't worry about it at the start.

Related

About optimized math functions, ranges and intervals

I'm trying to wrap my head around how people that code math functions for games and rendering engines can use an optimized math function in an efficient way; let me explain that further.
There is an high need for fast trigonometric functions in those fields, at times you can optimize a sin, a cos or other functions by rewriting them in a different form that is valid only for a given interval, often times this means that your approximation of f(x) is just for the first quadrant, meaning 0 <= x <= pi/2 .
Now the input for your f(x) is still about all 4 quadrants, but the real formula only covers 1/4 of that interval, the straightforward solution is to detect the quadrant by analyzing the input and see in which range it belongs to, then you adjust the result of the formula accordingly if the input comes from a quadrant that is not the first quadrant .
This is good in theory but this also presents a couple of really bad problems, especially considering the fact that you are doing all this to steal a couple of cycles from your CPU ( you also get a consistent implementation, that is not platform dependent like an hardcoded fsin in Intel x86 asm that only works on x86 and has a certain error range, all of this may differ on other platforms with other asm instructions ), so you should keep things working at a concurrent and high performance level .
The reason I can't wrap my head around the "switch case" with quadrants solution is:
it just prevents possible optimizations, namely memoization, considering that you usually want to put that switch-case inside the same functions that actually computes the f(x), probably the situation can be improved by implementing the formula for f(x) outside said function, but this is will lead to a doubling in the number of functions to maintain for any given math library
increase probability of more branching with a concurrent execution
generally speaking doesn't lead to better, clean, dry code, and conditional statements are often times a potential source of bugs, I don't really like switch-cases and similar things .
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?

Note: In the below answer I am speaking very generally about code.
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?
The general answer to this is: In the most obvious and simplest way possible that achieves your purpose.
I'm not entirely sure I follow most of your arguments/questions but I have a feeling you are looking for problems where really none exist. Do you truly have the need to re-implement the trigonometric functions? Don't fall into the trap of NIH (Not Invented Here).
the straightforward solution is to detect the quadrant
Yes! I love straightforward code. Code that is perfectly obvious at a glance what it does. Now, sometimes, just sometimes, you have to do some crazy things to get it to do what you want: for performance, or avoiding bugs out of your control. But the first version should be most obvious and simple code that solves your problem. From there you do testing, profiling, and benchmarking and if (only if) you find performance or other issues, then you go into the crazy stuff.
This is good in theory but this also presents a couple of really bad problems,
I would say that this is good in theory and in practice for most cases and I definitely don't see any "bad" problems. Minor issues in specific corner cases or design requirements at most.
A few things on a few of your specific comments:
approximation of f(x) is just for the first quadrant: Yes, and there are multiple reasons for this. One simply is that most trigonometric functions have identities so you can easily use these to reduce range of input parameters. This is important as many numerical techniques only work over a specific range of inputs, or are more accurate/performant for small inputs. Next, for very large inputs you'll have to trim the range anyways for most numerical techniques to work or at least work in a decent amount of time and have sufficient accuracy. For example, look at the Taylor expansion for cos() and see how long it takes to converge sufficiently for large vs small inputs.
it just prevents possible optimizations: Chances are your c++ compiler these days is way better at optimizations than you are. Sometimes it isn't but the general procedure is to let the compiler do its optimization and only do manual optimizations where you have measured and proven that you need it. Theses days it is very non-intuitive to tell what code is faster by just looking at it (you can read all the questions on SO about performance issues and how crazy some of the root causes are).
namely memoization: I've never seen memoization in place for a double function. Just think how many doubles are there between 0 and 1. Now in reduced accuracy situations you can take advantage of it but this is easily implemented as a custom function tailored for that exact situation. Thinking about it, I'm not exactly sure how to implement memoization for a double function that actually means anything and doesn't loose accuracy or performance in the process.
increase probability of more branching with a concurrent execution: I'm not sure I'd implement trigonometric functions in a concurrent manner but I suppose its entirely possible to get some performance benefits. But again, the compiler is generally better at optimizations than you so let it do its jobs and then benchmark/profile to see if you really need to do better.
doesn't lead to better, clean, dry code: I'm not sure what exactly you mean here, or what "dry code" is for that matter. Yes, sometimes you can get into trouble by too many or too complex if/switch blocks but I can't see a simple check for 4 quadrants apply here...it's a pretty basic and simple case.
So for any platform I get the same y for the same values of x: My guess is that getting "exact" values for all 53 bits of double across multiple platforms and systems is not going to be possible. What is the result if you only have 52 bits correct? This would be a great area to do some tests in and see what you get.
I've used trigonometric functions in C for over 20 years and 99% of the time I just use whatever built-in function is supplied. In the rare case I need more performance (or accuracy) as proven by testing or benchmarking, only then do I actually roll my own custom implementation for that specific case. I don't rewrite the entire gamut of <math.h> functions in the hope that one day I might need them.
I would suggest try coding a few of these functions in as many ways as you can find and do some accuracy and benchmark tests. This will give you some practical knowledge and give you some hard data on whether you actually need to reimplement these functions or not. At the very least this should give you some practical experience with implementing these types of functions and chances are answer a lot of your questions in the process.

Performance Tradeoff - When is MATLAB better/slower than C/C++

I am aware that C/C++ is a lower-level language and generates relatively optimized machine code when we compare with any other high-level language. But I guess there is pretty much more than that, which is also evident from the practice.
When I do simple calculations like montecarlo averaging of a Gaussian sample collection or so, I see there is not much of a difference between a C++ implementation or MATLAB implementation, sometimes in fact MATLAB performs a bit better in time.
When I move on to larger scale simulations with thousands of lines of code, slowly the real picture shows up. C++ simulations show superior performance like 100x better in time complexity than an equivalent MATLAB implementation.
The code in C++ most of the times, is pretty much serial and no hi-fi optimization is done explicitly. Whereas, as per my awareness, MATLAB inherently does a lot of optimization. This shows up for example when I try to generate a huge chunk of random samples, where as the equivalent in C++ using some library like IT++/GSL/Boost performs relatively slower (the algorithm used is the same namely mt19937).
My question is simply to know if there is a simpler tradeoff between MATLAB/C++ in performance. Is it just like what people say, "Whenever you can, C/C++ is the better"(The frequently experienced)?. In a different perspective, "What is MATLAB good for, other than comfort?"
By the way, I don't see coding efficiency parameter being significant here, thinking of the same programmer in both cases. And also, I think the other alternatives like python,R are not relevant here. But dependence on the specific libraries we use should be interesting.
[I am a phd student in Coding Theory in communication systems. I do simulations using matlab/C++ all the time, and have reasonable experience of coding few 10K's of lines in both cases]

I have been using Matlab and C++ for about 10 years. For every numerical algorithms implemented for my research, I always start from prototyping with Matlab and then translate the project to C++ to gain a 10x to 100x (I am not kidding) performance improvement. Of course, I am comparing optimized C++ code to the fully vectorized Matlab code. On average, the improvement is about 50x.
There are lot of subtleties behind both of the two programming languages, and the following are some misunderstandings:
Matlab is a script language but C++ is compiled
Matlab uses JIT compiler to translate your script to machine code, you can improve your speed at most by a factor 1.5 to 2 by using the compiler that Matlab provides.
Matlab code might be able to get fully vectorized but you have to optimize your code by hand in C++
Fully vectorized Matlab code can call libraries written in C++/C/Assembly (for example Intel MKL). But plain C++ code can be reasonably vectorized by modern compilers.
Toolboxes and routines that Matlab provides should be very well tuned and should have reasonable performance
No. Other than linear algebra routines, the performance is generally bad.
The reasons why you can gain 10x~100x performance in C++ comparing to vectorized Matlab code:
Calling external libraries (MKL) in Matlab costs time.
Memory in Matlab is dynamically allocated and freed. For example, small matrices multiplication:
A = B*C + D*E + F*G
requires Matlab to create 2 temporary matrices. And in C++, if you allocate your memory before hand, you create NONE. And now imagine you loop that statement for 1000 times. Another solution in C++ is provided by C++11 Rvalue reference. This is the one of the biggest improvement in C++, now C++ code can be as fast as plain C code.
If you want to do parallel processing, Matlab model is multi-process and the C++ way is multi-thread. If you have many small tasks needing to be parallelized, C++ provides linear gain up to many threads but you might have negative performance gain in Matlab.
Vectorization in C++ involves using intrinsics/assembly, and sometimes SIMD vectorization is only possible in C++.
In C++, it is possible for an experienced programmer to completely avoid L2 cache miss and even L1 cache miss, hence pushing CPU to its theoretical throughput limit. Performance of Matlab can lag behind C++ by a factor of 10x due to this reason alone.
In C++, computational intensive instructions sometimes can be grouped according to their latencies (code carefully in assembly or intrinsics) and dependencies (most of time is done automatically by compiler or CPU hardware), such that theoretical IPC (instructions per clock cycle) could be reached and CPU pipelines are filled.
However, development time in C++ is also a factor of 10x comparing to Matlab!
The reasons why you should use Matlab instead of C++:
Data visualization. I think my career can go on without C++ but I won't be able to survive without Matlab just because it can generate beautiful plots!
Low efficiency but mathematically robust build-in routines and toolboxes. Get the correct answer first and then talk about efficiency. People can make subtle mistakes in C++ (for example implicitly convert double to int) and get sort of correct results.
Express your ideas and present your code to your colleagues. Matlab code is much easier to read and much shorter than C++, and Matlab code can be correctly executed without compiler. I just refuse to read other people's C++ code. I don't even use C++ GNU scientific libraries because the code quality is not guaranteed. It is dangerous for a researcher/engineer to use a C++ library as a black box and take the accuracy as granted. Even for commercial C/C++ libraries, I remember Intel compiler had a sign error in its sin() function last year and numerical accuracy problems also occurred in MKL.
Debugging Matlab script with interactive console and workspace is a lot more efficient than C++ debugger. Finding an index calculation bug in Matlab could be done within minutes, but it could take hours in C++ figuring out why the program crashes randomly if boundary check is removed for the sake of speed.
Last but not the least:
Because once Matlab code is vectorized, there is not much left for a programmer to optimize, Matlab code performance is much less sensitive to the quality of the code comparing with C++ code. Therefore it is best to optimize computation algorithms in Matlab, and marginally better algorithms normally have marginally better performance in Matlab. On the other hand, algorithm test in C++ requires decent programmer to write algorithms optimized more or less in the same way, and to make sure the compiler does not optimize the algorithms differently.
My recent experience in C++ and Matlab:
I made several large Matlab data analysis tools in the past year and suffered from the slow speed of Matlab. But I was able to improve my Matlab program speed by 10x through the following techniques:
Run/profile the Matlab script, re-implement critical routines in C/C++ and compile with MEX. Critical routines are mostly likely logically simple but numerically heavy. This improves speed by 5x.
Simplify ".m" files shipped with Matlab tool boxes by commenting all unnecessary safety checks and output parameter computations. Please be reminded that the modified code cannot be distributed with the rest of the user scripts. This improves speed by another 2x (after C/C++ and MEX).
The improved code is ~98% in Matlab and ~2% in C++.
I believe it is possible to improve the speed by another 2x (total 20x) if the entire tool is coded in C++, this is ~100x speed improvement of the computation routines. The hard drive I/O will then dominate the program run time.
Question for Mathworks engineers:
When Matlab code is fully vectorized, one of the performance limiting factor is the matrix indexing operation. For instance, a finite difference operation needs to be performed on Matrix A which has a dimension of 5000x5000:
B = A(:,2:end)-A(:,1:end-1)
The matrix indexing operation makes the Matlab code multiple times slower than the C++ code. Can the matrix indexing performance be improved?

In my experience (several years of Computer Vision and image processing in both languages) there is no simple answer to this question, as Matlab performance depends strongly (and much more than C++ performance) on your coding style.
Generally, Matlab wraps the classic C++ / Fortran based linear algebra libraries. So anything like x = A\b is going to be very fast. Also, Matlab does a good job in choosing the most efficient solver for these types of problems, so for x = A\b Matlab will look at the size of your matrices and chose the appropriate low-level routines.
Matlab also shines in data manipulation of large matrices if you "vectorize" your code, i.e. if you avoid for loops and use index arrays or boolean arrays to access your data. This stuff is highly optimised.
For other routines, some are written in Matlab code, while others point to a C/C++ implementation (e.g. the Delaunay stuff). You can check this yourself by typing edit some_routine.m. This opens the code and you see whether it is all Matlab or just a wrapper for something compiled.
Matlab, I think, is primarily for comfort - but comfort translates to coding time and ultimately money which is why Matlab is used in the industry. Also, it is easy to learn for engineers from other fields than computer science, with little training in programming.

As a PhD Student too, and a 10years long Matlab user, I'm glad to share my POV:
Matlab is a great tool for developing and prototyping algorithms, especially when dealing with GUIs, high-level analysis (Frequency Domain, LS Optimization etc.): fast coding, powerful syntaxis (think about [],{},: etc.).
As soon as your processing chain is more stable and defined and data dimensions grows move to C/C++.
The main Matlab limit rises when considering its language is script-like: as long as you avoid any cycle (using arrayfun, cellfun or other matrix procedures) performances are high since the called subroutine is again in C/C++.

Your question is difficult to answer. In general C++ is faster, but if make use of the well written algorithms of Matlab it can outperform C++. In some cases Matlab can parallelize your code which has to be done manually in many cases for C++. Mathlab can kind of export C++ code.
So my conclusion is, that you have to measure the performance of both programs to get an answer. But then you compare your two implementations and not Matlab and C++ in general.

Matlab does very well with linear algebra and array/matrix operations, since they seem to have been doing some extra optimizations on the underlying operations - if you want to beat Matlab there, you would need a similarly optimized BLAS/LAPACK library.
As an interpreted language, Matlab loses time whenever a Matlab function is called, due to internal overhead, which traditionally meant that Matlab loops were slow. This has been alleviated somewhat in recent years thanks to significant improvement in the JIT compiler (search for "performance" questions on Matlab on SO for examples). As a consequence of the function call overhead, all Matlab functions that have not been implemented in C/C++ behind the scenes (call edit functionName to see whether it's written in Matlab) risks being slower than a C/C++ counterpart.
Finally, Matlab attempts to be user friendly, and may do "unnecessary" input checking that can take time (due to function call overhead). For example, if you know that ismember gets sorted inputs, you can call ismembc directly (the behind-the-scene compiled function), saving quite a bit of time.

I think you can consider the difference in four folds at least.
Compiled vs Interpreted
Strongly-typed vs Dynamically-typed
Performance vs Fast-prototyping
Special strength
For 1-3 can be easily generalized into comparison between two family of programming languages.
For 4, MATLAB is optimized for matrix operations. So if you can vectorize more code in MATLAB, the performance can be drastically boosted. Conversely, if many loops are required, never hesitate to use C++ or create a mex file.
It is a difficult quesion after all.

I saw a 5.5x speed improvement when switching from MATLAB to C++. This was for a robot controller- lots of loops and ode solving. I spent many hours trying to optimize the MATLAB code, hardly any time optimizing the C++ (I'm sure it could have been 10x faster with a little more effort).
However, it was easy to add a GUI for the MATLAB code, so I still use it more often. Like others have said, it was nice to prototype first on MATLAB. That made the implementation on C++ much simpler.

Besides the speed of the final program, you should also take into account the total development time of your code, ie., not only the time to write, but also to debug, etc. Matlab (and its open-source counterpart, Octave) can be good for quick prototyping due to its visualisation capabilities.
If you're using straight C++ (ie. no matrix libraries), it may take you much longer to write C++ code that's equivalent to Matlab code (eg. there might be no point in spending 10 hours writing C++ code that only runs 10 seconds quicker, compared to a Matlab program that took 5 minutes to write).
However, there are dedicated C++ matrix libraries, such as Armadillo, which provide a Matlab-like API. This can be useful for writing performance critical code that can be called from Matlab, or for converting Matlab code into "real" programs.

Some Matlab code uses standard linear algebra fictions with multithreading built into it. So, it appears that they are faster than a sequential C code.

When should I use ASM calls?

I'm planning on writing a game with C++, and it will be extremely CPU-intensive (pathfinding,genetic algorithms, neural networks, ...)
So I've been thinking about how to tackle this situation best so that it would run smoothly.
(let this top section of this question be side information, I don't want it to restrict the main question, but it would be nice if you could give me side notes as well)
Is it worth it to learn how to work with ASM, so I can make ASM calls in C++,
can it give me a significant/notable performance advantage?
In what situations should I use it?

Almost never:
You only want to be using it once you've profiled your C++ code and have identified a particular section as a bottleneck.
And even then, you only want to do it once you've exhausted all C++ optimization options.
And even then, you only want to be using ASM for tight, inner loops.
And even then, it takes quite a lot of effort and skill to beat a C++ compiler on a modern platform.

If your not an experienced assembly programmer, I doubt you will be able to optimize assembly code better than your compiler.
Also note that assembly is not portable. If you decide to go this way, you will have to write different assembly for all the architectures you decide to support.

Short answer: it depends, most likely you won't need it.
Don't start optimizing prematurely. Write code that is also easy to read and to modify. Separate logical sections into modules. Write something that is easy to extend.
Do some profiling.
You can't tell where your bottlenecks are unless you profile your code. 99% of the time you won't get that much performance gain by writing asm. There's a high chance you might even worsen your performance. Optimizers nowadays are very good at what they do. If you do have a bottleneck, it will most probably be because of some poorly chosen algorithm or at least something that can be remedied at a high-level.
My suggestion is, even if you do learn asm, which is a good thing, don't do it just so you can optimize.
Profile profile profile....

A legitimate use case for going low-level (although sometimes a compiler can infer it for you) is to make use of SIMD instructions such as SSE. I would assume that at least some of the algorithms you mention will benefit from parallel processing.
However, you don't need to write actual assembly, instead you can simply use intrinsic functions. See, e.g. this.

Don't get ahead of yourself.
I've posted a sourceforge project showing how a simulation program was massively speeded up (over 700x).
This was not done by assuming in advance what needed to be made fast.
It was done by "profiling", which I put in quotes because the method I use is not to employ a profiler.
Rather I rely on random pausing, a method known and used to good effect by some programmers.
It proceeds through a series of iterations.
In each iteration a large source of time-consumption is identified and fixed, resulting in a certain speedup ratio.
As you proceed through multiple iterations, these speedup ratios multiply together (like compound interest).
That's how you get major speedup.
If, and only if, you get to a point where some code is taking a large fraction of time, and it doesn't contain any function calls, and you think you can write assembly code better than the compiler does, then go for it.
P.S. If you're wondering, the difference between using a profiler and random pausing is that profilers look for "bottlenecks", on the assumption that those are localized things. They look for routines or lines of code that are responsible for a large percent of overall time.
What they miss is problems that are diffuse.
For example, you could have 100 routines, each taking 1% of time.
That is, no bottlenecks.
However, there could be an activity being done within many or all of those routines, accounting for 1/3 of the time, that could be done better or not at all.
Random pausing will see that activity with a small number of samples, because you don't summarize, you examine the samples.
In other words, if you took 9 samples, on average you would notice the activity on 3 of them.
That tells you it's big.
So you can fix it and get your 3/2 speedup ratio.

"To understand recursion, you must first understand recursion." That quote comes to mind when I consider my response to your question, which is "until you understand when to use assembly, you should never use assembly." After you have completely implemented your soution, extensively profiled its performance and determined precise bottlenecks, and experimented with several alternative solutions, then you can begin to consider using assembly. If you code a single line of assembly before you have a working and extensively profiled program, you have made a mistake.

If you need to ask than you don't need it.

Producing the fastest possible executable

I have a very large program which I have been compiling under visual studio (v6 then migrated to 2008). I need the executable to run as fast as possible. The program spends most of its time processing integers of various sizes and does very little IO.
Obviously I will select maximum optimization, but it seems that there are a variety of things that can be done which don't come under the heading of optimization which do still affect the speed of the executable. For example selecting the __fastcall calling convention or setting structure member alignment to a large number.
So my question is: Are there other compiler/linker options I should be using to make the program faster which are not controlled from the "optimization" page of the "properties" dialog.
EDIT: I already make extensive use of profilers.

Another optimization option to consider is optimizing for size. Sometimes size-optimized code can run faster than speed-optimized code due to better cache locality.
Also, beyond optimization operations, run the code under a profiler and see where the bottlenecks are. Time spent with a good profiler can reap major dividends in performance (especially it if gives feedback on the cache-friendliness of your code).
And ultimately, you'll probably never know what "as fast as possible" is. You'll eventually need to settle for "this is fast enough for our purposes".

Profile-guided optimization can result in a large speedup. My application runs about 30% faster with a PGO build than a normal optimized build. Basically, you run your application once and let Visual Studio profile it, and then it is built again with optimization based on the data collected.

1) Reduce aliasing by using __restrict.
2) Help the compiler in common subexpression elimination / dead code elimination by using __pure.
3) An introduction to SSE/SIMD can be found here and here. The internet isn't exactly overflowing with articles about the topic, but there's enough. For a reference list of intrinsics, you can search MSDN for 'compiler intrinsics'.
4) For 'macro parallelization', you can try OpenMP. It's a compiler standard for easy task parallelization -- essentially, you tell the compiler using a handful of #pragmas that certain sections of the code are reentrant, and the compiler creates the threads for you automagically.
5) I second interjay's point that PGO can be pretty helpful. And unlike #3 and #4, it's almost effortless to add in.

You're asking which compiler options can help you speed up your program, but here's some general optimisation tips:
1) Ensure your algorithms are appropriate for the job. No amount of fiddling with compiler options will help you if you write an O(shit squared) algorithm.
2) There's no hard and fast rules for compiler options. Sometimes optimise for speed, sometimes optimise for size, and make sure you time the differences!
3) Understand the platform you are working on. Understand how the caches for that CPU operate, and write code that specifically takes advantage of the hardware. Make sure you're not following pointers everywhere to get access to data which will thrash the cache. Understand the SIMD operations available to you and use the intrinsics rather than writing assembly. Only write assembly if the compiler is definitely not generating the right code (i.e. writing to uncached memory in bad ways). Make sure you use __restrict on pointers that will not alias. Some platforms prefer you to pass vector variables by value rather than by reference as they can sit in registers - I could go on with this but this should be enough to point you in the right direction!
Hope this helps,
-Tom

Forget micro-optimization such as what you are describing. Run your application through a profiler (there is one included in Visual Studio, at least in some editions). The profiler will tell you where your application is spending its time.
Micro-optimization will rarely give you more than a few percentage points increase in performance. To get a really big boost, you need to identify areas in your code where inefficient algorithms and/or data structures are being used. Focus on those, for example by changing algorithms. The profiler will help identify these problem areas.

Check which /precision mode you are using. Each one generates quite different code and you need to choose based on what accuracy is required in your app. Our code needs precision (geometry, graphics code) but we still use /fp:fast (C/C++ -> Code generation options).
Also make sure you have /arch:SSE2, assuming your deployment covers processors that all support SSE2. This will result is quite a big difference in performance, as compile will use very few cycles. Details are nicely covered in the blog SomeAssemblyRequired
Since you are already profiling, I would suggest loop unrolling if it is not happening. I have seen VS2008 not doing it more frequently (templates, references etc..)
Use __forceinline in hotspots if applicable.
Change hotspots of your code to use SSE2 etc as your app seems to be compute intense.

You should always address your algorithm and optimise that before relying on compiler optimisations to get you significant improvements in most cases.
Also you can throw hardware at the problem. Your PC may already have the necessary hardware lying around mostly unused: the GPU! One way of improving performance of some types of computationally expensive processing is to execute it on the GPU. This is hardware specific but NVIDIA provide an API for exactly that: CUDA. Using the GPU is likely to get you far greater improvement than using the CPU.

I agree with what everyone has said about profiling. However you mention "integers of various sizes". If you are doing much arithmetic with mismatched integers a lot of time can be wasted in changing sizes, shorts to ints for example, when the expressions are evaluated.
I'll throw in one more thing too. Probably the most significant optimisation is in choosing and implementing the best algorithm.

You have three ways to speed up your application:
Better algorithm - you've not specified the algorithm or the data types (is there an upper limit to integer size?) or what output you want.
Macro parallelisation - split the task into chunks and give each chunk to a separate CPU, so, on a two core cpu divide the integer set into two sets and give half to each cpu. This depends on the algorithm you're using - not all algorithms can be processed like this.
Micro parallelisation - this is like the above but uses SIMD. You can combine this with point 2 as well.

You say the program is very large. That tells me it probably has many classes in a hierarchy.
My experience with that kind of program is that, while you are probably assuming that the basic structure is just about right, and to get better speed you need to worry about low-level optimization, chances are very good that there are large opportunities for optimization that are not of the low-level kind.
Unless the program has already been tuned aggressively, there may be room for massive speedup in the form of mid-stack operations that can be done differently. These are usually very innocent-looking and would never grab your attention. They are not cases of "improve the algorithm". They are usually cases of "good design" that just happen to be on the critical path.
Unfortunately, you cannot rely on profilers to find these things, because they are not designed to look for them.
This is an example of what I'm talking about.

What language/platform would you recommend for CPU-bound application?

I'm developing non-interactive cpu-bound application which does only computations, almost no IO. Currently it works too long and while I'm working on improving the algorithm, I also think if it can give any benefit to change language or platform. Currently it is C++ (no OOP so it is almost C) on windows compiled with Intel C++ compiler. Can switching to ASM help and how much? Can switching to Linux and GCC help?

Just to be thorough: the first thing to do is to gather profile data and the second thing to do is consider your algorithms. I'm sure you know that, but they've got to be #included into any performance-programming discussion.
To be direct about your question "Can switching to ASM help?" the answer is "If you don't know the answer to that, then probably not." Unless you're very familiar with the CPU architecture and its ins and outs, it's unlikely that you'll do a significantly better job than a good optimizing C/C++ compiler on your code.
The next point to make is that significant speed-ups in your code (aside from algorithmic improvements) will almost certainly come from parallelism, not linear increases. Desktop machines can now throw 4 or 8 cores at a task, which has much more performance potential than a slightly better code generator. Since you're comfortable with C/C++, OpenMP is pretty much a no-brainer; it's very easy to use to parallelize your loops (obviously, you have to watch loop-carried dependencies, but it's definitely "the simplest parallelism that could possibly work").
Having said all that, code generation quality does vary between C/C++ compilers. The Intel C++ compiler is well-regarded for its optimization quality and has full support not just for OpenMP but for other technologies such as the Threading Building Blocks.
Moving into the question of what programming languages might be even better than C++, the answer would be "programming languages that actively promote / facilitate concepts of parallelism and concurrent programming." Erlang is the belle of the ball in that regard, and is a "hot" language right now and most people interested in performance programming are paying at least some attention to it, so if you want to improve your skills in that area, you might want to check it out.

It's always algorithm, rarely language. Here's my clue: "while I'm working on improving the algorithm".
Tweaking may not be enough.
Consider radical changes to the algorithm. You've got to eliminate processing, not make the processing go faster. The culprit is often "search" -- looping through data looking for something. Find ways to eliminate search. If you can't eliminate it, replace linear search with some kind of tree search or a hash map of some kind.

Switching to ASM is not going to help much, unless you're very good at it and/or have a specific critical path routine which you know you can do better. As several people have remarked, modern compilers are just better in most cases at taking advantages of caching/etc. than anyone can do by hand.
I'd suggest:
Try a different compiler, and/or different optimization options
Run a code coverage/analysis utility, and figure out where the critical paths are, and work on optimizing those in the code
C++ should be able to give you very near the best possible performance from the code, so I wouldn't recommend switching the language. Depending on the app, you may be able to get better performance on multi code/processor systems using multiple thread, as another suggestion.

While just switching to asm won't give any benefits, since the Intel C++ Compiler is likely better at optimizing than you, you can try one of the following options:
Try a compiler that will parallelize your code, like the VectorC compiler.
Try to switch to asm with heavy use of MMX, 3DNow!, SSE or whatever fits your needs (and your CPU). This will give more of a benefit than pure asm.
You can also try GPGPU, i.e. execute large parts of your algorithm on a GPU instead of a CPU. Depending on your algorithm, it can be dramatically faster.
Edit: I also second the profile approach. I recommend AQTime, which supports the Intel C++ compiler.

Personally I'd look at languages which allow you to take advantage of parallelism most easily, unless it's a thoroughly non-parallelisable situation. Being able to bolt on some extra cores and get (if possible!) near-linear improvement may well be a lot more cost-effective than squeezing the extra few percent of efficiency out.
When it comes to parallelisation, I believe functional languages are often regarded as the best way to go, or you could look at OpenMP for C/C++. (Personally, as a managed language guy, I'd be looking at libraries for Java/.NET, but I quite understand that not everyone has the same preferences!)

Try Fortran 77 - when it comes to computations still nothing beats the granddaddy of programming languages. Also, try it with OpenMP to take advantage of multiple cores.

Hand optimizing your ASM code compared to what C++ can do for you is rarely cost effective.
If you've done anything you can to the algorithm from a traditional algorithmic view, and you've also eliminated excesses, then you may either be SOL, or you can consider optimizing your program from a hardware point of view.
For example, any time you follow a pointer around the heap you are paying a huge cost due to cache misses, possibly paging, etc., which all affect branching predictions. Most programmers (even C gurus) tend to look at the CPU from the functional standpoint rather than what happens behind the scenes. Sometimes reorganizing memory, for example by "flattening" or manually allocating memory to fit on the same page can obtain ENORMOUS speedups. I managed to get 2X speedups on graph traversals just by flattening my structures.
These are not things that your compiler will do for you since they are based on your high-level understanding of the program.

As lobrien said, you haven't given us any information to tell you if hand-optimized ASM code would help... which means the answer is probably, "not yet."
Have you run your code with a profiler?
Do you know if the code is slow because of memory constraints or processor constraints?
Are you using all your available cores?
Have you identified any algorithms you're using that aren't O(1)? Can you get them to O(1)? If not, why not?
If you've done all that, how much control do you have over the environment your program is running in? (presumably a lot if you're thinking of switching operating systems) Can you disable other processes, give your process highest priority, etc? What about just finding a machine with a faster processor, more cores, or more memory (depending on what you're constrained on)
And on and on.
If you've already done all that and more, it's certainly possible you'll get to a point where you think, "I wonder if these few lines of code right here could be optimized better than the assembly that I'm looking at in the debugger right now?" And at that point you can ask specifically.
Good luck! You're solving a problem that's fun to solve.

Sometimes you can find libraries that have optimized implementations of the algorithms you care about. Often times they will have done the multithreading for you.
For example switching from LINPACK to LAPACK got us a 10x speed increase in LU factorization/solve with a good BLAS library.

First, figure out if you can change the algorithm, as S.Lott suggested.
Assuming the algorithm choice is correct, you might look a the memory access patterns, if you have a lot of data you are processing. For a lot of number crunching applications these days, they're bound by the memory bus, not by the ALU(s). I recently optimized some code that was of the form:
// Assume N is a big number
for (int i=0; i<N; i++) {
myArray[i] = dosomething(i);
}
for (int i=0; i<N; i++) {
myArray[i] = somethingElse(myArray[i]);
}
...
and converted it to look like:
for (int i=0; i<N; i++) {
double tmp = dosomething(i);
tmp = somethingElse(tmp);
...
myArray[i] = tmp;
}
...
In this particular case, this yielded about a 2x speedup.

As Oregonghost already hinted - The VectorC compiler might help. It does not really parallelize the code though, instead you can use it to leverage on extended command sets like mmx or sse. I used it for the most time-critical parts in a software rendering engine and it resulted in a speedup of about 150%-200% on most processors.

For an alternative approach, you could look into Distributed Computing which sounds like it could suit your needs.

If you're sticking with C++ on the intel compiler, take a look at the compiler intrinsics (full reference here). I know that VC++ has similar functionality, and I'm sure you can do the same thing with gcc. These can let you take full advantage of the parallelism built into your CPU. You can use the MMX, SSE and SSE2 instructions to improve performance to a degree. Like others have said, you're probably best looking at the algorithm first.

I suggest you rethink your algorithm, or maybe even better, your approach. On the other hand maybe what you are trying to calculate just takes a lot of computing time. Have you considered to make it distributed so it can run in a cluster of some sort? If you want to focus on pure code optimization by introducing Assembler for your inner loops then often that can be very beneficial (if you know what you're doing).

For modern processors, learning ASM will take you a long time. Further, with all the different versions of SSE around, your code will end up very processor dependant.
I do quite a lot of CPU-bound work, and have found that the difference between intel's C++ compiler and g++ usually isn't that big (at most 15% or so), and there is no measurable difference between Mac OS X, Windows and Linux.
You are going to have to optimise your code and improve your algorithm by hand. There is no "magic fairy dust" which can make existing code that much faster I'm afraid.
If you haven't yet, and you care about performance, you MUST run your code through a good profiler (personally, I like kcachegrind & valgrind on Linux, or Shark on Mac OS X. I don't know what is good for windows I'm afraid).
Based on my past experience, there is a very good chance you'll find some method is taking 95% of your CPU time, and some simple change or addition of caching will make a massive improvement to your performance. On a similar note, if some method is only taking 1% of your CPU time, no amount of optimising is going to gain you anything.

The 2 obvious answers to "CPU-bound" are:
1. Use more CPU (core)s
2. Use something else.
Using 2 threads instead of 1 will cut the time spent by up to 50%. In comparision, C++ to ASM rarely gives you 5% (and for novice ASM programmers, it's often -5%!). Some problems scale well, and may benefit from 8 or 16 cores. That kind of hardware is still pretty mainstream, so see if your problems fall in that category.
The other solution is to throw more specialized hardware at the task. This could be the vector unit of your CPU - considering Windows=x86/x64, that's going to be a flavor of SSE. Another kind of vector hardware is the modern GPU. The GPU also has its own memory bus, which is quite speedy.

First get the lead out. Then if it's as fast as it can possibly be without going to ASM, so be it. But thinking you have to go to ASM assumes you know what's making it slow, and I'll bet a donut that you're guessing.

If you feel you have optimized your code to a point there is no improvement, increase your CPU's. This can be done on different platforms. One I develop with is Appistry. A few links:
http://www.appistry.com/resource-library/index.html
and you can download the product free from here:
http://www.appistry.com/developers/
I work for Appistry and we have done many installations for tasks that were cpu bound by spreading work out over 10's or 100's of machines.
Hope this helps,
-Brett

Probable small help:
Optimization of 64-bit programs
AMD64 (EM64T) architecture
Debugging and optimization of multi-thread OpenMP-programs
Introduction into the problems of developing parallel programs
Development of Resource-intensive Applications in Visual C++

Linux
Switching to Linux can help, if you strip it down to only the parts you actually need.

CrowdProcess has about 2000 workers you can use to compute your algorithm. The API is extremely simple and we've been observing speedups close to the number of workers. Also you can write Javascript which should make you more productive than C++ or ASM.
So if you're in between C++ or ASM, I'd say you should first use all your CPU cores, then if it's not enough, CrowdProcess should be an interesting platform.
Disclaimer: I built CrowdProcess.

It is hard to produce ASM code that is faster than naive C or C++ code. In most cases if you do this job really well, you probably gain not much than few percents and getting like 10% speedup is considered great success but in most cases it is just impossible.
Compilers are capable of understanding how to compile efficiently. You should profile in order to figure out where to optimize.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js