Performance Tradeoff - When is MATLAB better/slower than C/C++

Performance Tradeoff - When is MATLAB better/slower than C/C++ - c++

I am aware that C/C++ is a lower-level language and generates relatively optimized machine code when we compare with any other high-level language. But I guess there is pretty much more than that, which is also evident from the practice.
When I do simple calculations like montecarlo averaging of a Gaussian sample collection or so, I see there is not much of a difference between a C++ implementation or MATLAB implementation, sometimes in fact MATLAB performs a bit better in time.
When I move on to larger scale simulations with thousands of lines of code, slowly the real picture shows up. C++ simulations show superior performance like 100x better in time complexity than an equivalent MATLAB implementation.
The code in C++ most of the times, is pretty much serial and no hi-fi optimization is done explicitly. Whereas, as per my awareness, MATLAB inherently does a lot of optimization. This shows up for example when I try to generate a huge chunk of random samples, where as the equivalent in C++ using some library like IT++/GSL/Boost performs relatively slower (the algorithm used is the same namely mt19937).
My question is simply to know if there is a simpler tradeoff between MATLAB/C++ in performance. Is it just like what people say, "Whenever you can, C/C++ is the better"(The frequently experienced)?. In a different perspective, "What is MATLAB good for, other than comfort?"
By the way, I don't see coding efficiency parameter being significant here, thinking of the same programmer in both cases. And also, I think the other alternatives like python,R are not relevant here. But dependence on the specific libraries we use should be interesting.
[I am a phd student in Coding Theory in communication systems. I do simulations using matlab/C++ all the time, and have reasonable experience of coding few 10K's of lines in both cases]

I have been using Matlab and C++ for about 10 years. For every numerical algorithms implemented for my research, I always start from prototyping with Matlab and then translate the project to C++ to gain a 10x to 100x (I am not kidding) performance improvement. Of course, I am comparing optimized C++ code to the fully vectorized Matlab code. On average, the improvement is about 50x.
There are lot of subtleties behind both of the two programming languages, and the following are some misunderstandings:
Matlab is a script language but C++ is compiled
Matlab uses JIT compiler to translate your script to machine code, you can improve your speed at most by a factor 1.5 to 2 by using the compiler that Matlab provides.
Matlab code might be able to get fully vectorized but you have to optimize your code by hand in C++
Fully vectorized Matlab code can call libraries written in C++/C/Assembly (for example Intel MKL). But plain C++ code can be reasonably vectorized by modern compilers.
Toolboxes and routines that Matlab provides should be very well tuned and should have reasonable performance
No. Other than linear algebra routines, the performance is generally bad.
The reasons why you can gain 10x~100x performance in C++ comparing to vectorized Matlab code:
Calling external libraries (MKL) in Matlab costs time.
Memory in Matlab is dynamically allocated and freed. For example, small matrices multiplication:
A = B*C + D*E + F*G
requires Matlab to create 2 temporary matrices. And in C++, if you allocate your memory before hand, you create NONE. And now imagine you loop that statement for 1000 times. Another solution in C++ is provided by C++11 Rvalue reference. This is the one of the biggest improvement in C++, now C++ code can be as fast as plain C code.
If you want to do parallel processing, Matlab model is multi-process and the C++ way is multi-thread. If you have many small tasks needing to be parallelized, C++ provides linear gain up to many threads but you might have negative performance gain in Matlab.
Vectorization in C++ involves using intrinsics/assembly, and sometimes SIMD vectorization is only possible in C++.
In C++, it is possible for an experienced programmer to completely avoid L2 cache miss and even L1 cache miss, hence pushing CPU to its theoretical throughput limit. Performance of Matlab can lag behind C++ by a factor of 10x due to this reason alone.
In C++, computational intensive instructions sometimes can be grouped according to their latencies (code carefully in assembly or intrinsics) and dependencies (most of time is done automatically by compiler or CPU hardware), such that theoretical IPC (instructions per clock cycle) could be reached and CPU pipelines are filled.
However, development time in C++ is also a factor of 10x comparing to Matlab!
The reasons why you should use Matlab instead of C++:
Data visualization. I think my career can go on without C++ but I won't be able to survive without Matlab just because it can generate beautiful plots!
Low efficiency but mathematically robust build-in routines and toolboxes. Get the correct answer first and then talk about efficiency. People can make subtle mistakes in C++ (for example implicitly convert double to int) and get sort of correct results.
Express your ideas and present your code to your colleagues. Matlab code is much easier to read and much shorter than C++, and Matlab code can be correctly executed without compiler. I just refuse to read other people's C++ code. I don't even use C++ GNU scientific libraries because the code quality is not guaranteed. It is dangerous for a researcher/engineer to use a C++ library as a black box and take the accuracy as granted. Even for commercial C/C++ libraries, I remember Intel compiler had a sign error in its sin() function last year and numerical accuracy problems also occurred in MKL.
Debugging Matlab script with interactive console and workspace is a lot more efficient than C++ debugger. Finding an index calculation bug in Matlab could be done within minutes, but it could take hours in C++ figuring out why the program crashes randomly if boundary check is removed for the sake of speed.
Last but not the least:
Because once Matlab code is vectorized, there is not much left for a programmer to optimize, Matlab code performance is much less sensitive to the quality of the code comparing with C++ code. Therefore it is best to optimize computation algorithms in Matlab, and marginally better algorithms normally have marginally better performance in Matlab. On the other hand, algorithm test in C++ requires decent programmer to write algorithms optimized more or less in the same way, and to make sure the compiler does not optimize the algorithms differently.
My recent experience in C++ and Matlab:
I made several large Matlab data analysis tools in the past year and suffered from the slow speed of Matlab. But I was able to improve my Matlab program speed by 10x through the following techniques:
Run/profile the Matlab script, re-implement critical routines in C/C++ and compile with MEX. Critical routines are mostly likely logically simple but numerically heavy. This improves speed by 5x.
Simplify ".m" files shipped with Matlab tool boxes by commenting all unnecessary safety checks and output parameter computations. Please be reminded that the modified code cannot be distributed with the rest of the user scripts. This improves speed by another 2x (after C/C++ and MEX).
The improved code is ~98% in Matlab and ~2% in C++.
I believe it is possible to improve the speed by another 2x (total 20x) if the entire tool is coded in C++, this is ~100x speed improvement of the computation routines. The hard drive I/O will then dominate the program run time.
Question for Mathworks engineers:
When Matlab code is fully vectorized, one of the performance limiting factor is the matrix indexing operation. For instance, a finite difference operation needs to be performed on Matrix A which has a dimension of 5000x5000:
B = A(:,2:end)-A(:,1:end-1)
The matrix indexing operation makes the Matlab code multiple times slower than the C++ code. Can the matrix indexing performance be improved?

In my experience (several years of Computer Vision and image processing in both languages) there is no simple answer to this question, as Matlab performance depends strongly (and much more than C++ performance) on your coding style.
Generally, Matlab wraps the classic C++ / Fortran based linear algebra libraries. So anything like x = A\b is going to be very fast. Also, Matlab does a good job in choosing the most efficient solver for these types of problems, so for x = A\b Matlab will look at the size of your matrices and chose the appropriate low-level routines.
Matlab also shines in data manipulation of large matrices if you "vectorize" your code, i.e. if you avoid for loops and use index arrays or boolean arrays to access your data. This stuff is highly optimised.
For other routines, some are written in Matlab code, while others point to a C/C++ implementation (e.g. the Delaunay stuff). You can check this yourself by typing edit some_routine.m. This opens the code and you see whether it is all Matlab or just a wrapper for something compiled.
Matlab, I think, is primarily for comfort - but comfort translates to coding time and ultimately money which is why Matlab is used in the industry. Also, it is easy to learn for engineers from other fields than computer science, with little training in programming.

As a PhD Student too, and a 10years long Matlab user, I'm glad to share my POV:
Matlab is a great tool for developing and prototyping algorithms, especially when dealing with GUIs, high-level analysis (Frequency Domain, LS Optimization etc.): fast coding, powerful syntaxis (think about [],{},: etc.).
As soon as your processing chain is more stable and defined and data dimensions grows move to C/C++.
The main Matlab limit rises when considering its language is script-like: as long as you avoid any cycle (using arrayfun, cellfun or other matrix procedures) performances are high since the called subroutine is again in C/C++.

Your question is difficult to answer. In general C++ is faster, but if make use of the well written algorithms of Matlab it can outperform C++. In some cases Matlab can parallelize your code which has to be done manually in many cases for C++. Mathlab can kind of export C++ code.
So my conclusion is, that you have to measure the performance of both programs to get an answer. But then you compare your two implementations and not Matlab and C++ in general.

Matlab does very well with linear algebra and array/matrix operations, since they seem to have been doing some extra optimizations on the underlying operations - if you want to beat Matlab there, you would need a similarly optimized BLAS/LAPACK library.
As an interpreted language, Matlab loses time whenever a Matlab function is called, due to internal overhead, which traditionally meant that Matlab loops were slow. This has been alleviated somewhat in recent years thanks to significant improvement in the JIT compiler (search for "performance" questions on Matlab on SO for examples). As a consequence of the function call overhead, all Matlab functions that have not been implemented in C/C++ behind the scenes (call edit functionName to see whether it's written in Matlab) risks being slower than a C/C++ counterpart.
Finally, Matlab attempts to be user friendly, and may do "unnecessary" input checking that can take time (due to function call overhead). For example, if you know that ismember gets sorted inputs, you can call ismembc directly (the behind-the-scene compiled function), saving quite a bit of time.

I think you can consider the difference in four folds at least.
Compiled vs Interpreted
Strongly-typed vs Dynamically-typed
Performance vs Fast-prototyping
Special strength
For 1-3 can be easily generalized into comparison between two family of programming languages.
For 4, MATLAB is optimized for matrix operations. So if you can vectorize more code in MATLAB, the performance can be drastically boosted. Conversely, if many loops are required, never hesitate to use C++ or create a mex file.
It is a difficult quesion after all.

I saw a 5.5x speed improvement when switching from MATLAB to C++. This was for a robot controller- lots of loops and ode solving. I spent many hours trying to optimize the MATLAB code, hardly any time optimizing the C++ (I'm sure it could have been 10x faster with a little more effort).
However, it was easy to add a GUI for the MATLAB code, so I still use it more often. Like others have said, it was nice to prototype first on MATLAB. That made the implementation on C++ much simpler.

Besides the speed of the final program, you should also take into account the total development time of your code, ie., not only the time to write, but also to debug, etc. Matlab (and its open-source counterpart, Octave) can be good for quick prototyping due to its visualisation capabilities.
If you're using straight C++ (ie. no matrix libraries), it may take you much longer to write C++ code that's equivalent to Matlab code (eg. there might be no point in spending 10 hours writing C++ code that only runs 10 seconds quicker, compared to a Matlab program that took 5 minutes to write).
However, there are dedicated C++ matrix libraries, such as Armadillo, which provide a Matlab-like API. This can be useful for writing performance critical code that can be called from Matlab, or for converting Matlab code into "real" programs.

Some Matlab code uses standard linear algebra fictions with multithreading built into it. So, it appears that they are faster than a sequential C code.

Related

GPGPU programming architecture for HSA in C++ for Matrix Math

GPU Compute Programmers,
I have a C++ program which currently relies on the ACML (LAPACK) to invert and multiple fairly large matrices of single precision fp values (E.g. 4,000 x 4,000). These matrices are very sparse although they do not always fit nicely into a diagonal matrix so I cannot presently reduce them. The other thing about this program is I have to do this invert and multiply several times (serially) as part of a Newton Rapson. However, I have several thousand permutations which can be done in parallel, each with a small change to the matrix before again calculating and inverting the Jacobian. This is all single precision fp, and seems perfectly suited for the GPU. My question is this...
I suspect I will need to use the AMD Accelerated Parallel Processing Math Libraries (APPML) for OpenGL as that is the only thing (non-CUDA, I want to be GPU agnostic) I know of which is available with BLAS functionality. My problem is I do not see the LAPACK dgetrf and dgetri functions included in APPML (yes, these are fp64 but I don't need that precision). Would C++ AMP be a better alternative? I am very interested in HSA features of passing pointers rather than copying data as there is a lot of data in flight here and some calculations still are done on the CPU. I believe that copy overhead would kill me otherwise. Ultimately, performance is the key and I want to make the right architectural decisions to set myself up for the most performance I can wring out of HSA GPUs coming out over the next 6 months.
I am using VS 2013 Ultimate preview and would be able to take advantage of C++ AMP for these HSA capabilities. I just want to make sure I am making the right long term architectural decision now while my program is in its infancy. Here is a link and snippet from some interesting data I found on Anandtech:
http://anandtech.com/show/7118/windows-81-and-vs2013-bring-gpu-computing-updates-to-direct3d-and-c-amp-
C++ AMP, Microsoft's C++ extension for GPU computing, has also been updated with the upcoming VS2013. I think the biggest feature update is that C++ AMP programs will also gain a shared memory feature on APUs/SoCs where the compiler and runtime will be able to eliminate extra data copies between CPU and GPU. This feature will also be available only on Windows 8.1 and it is likely built on top of the "map default buffer" as Microsoft's AMP implementation uses Direct3D under the hood. C++ AMP also brings some other nice additions including enhanced texture support and better debugging abilities.
Any thoughts, additional questions or discussion would be greatly appreciated!

Fortran vs C++, does Fortran still hold any advantage in numerical analysis these days? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
With the rapid development of C++ compilers,especially the intel ones, and the abilities of directly applying SIMD functions in your C/C++ code, does Fortran still hold any real advantage in the world of numerical computations?
I am from an applied maths background, my job involves a lot of numerical analysis, computations, optimisations and such, with a strictly defined performance-requirement.
I hardly know anything about Fortran, I have some experience in C/CUDA/matlab(if you consider the latter as a computer language to begin with), and my daily task involves analysis of very large data (e.g. 10GB-large matrix), and it seems the program at least spend 2/3 of its time on memory-accessing (thats why I send some of its job to GPU), do you people think it may worth the effects for me to trying the fortran routine on at least some performance-critical part of my code to improve the performance of my program?
Because the complexity and things need to be done involved there, I will only go that routine if only there is significant performance benefit there, thanks in advance.

Fortran has strict aliasing semantics compared to C++ and has been aggressively tuned for numerical performance for decades. Algorithms that uses the CPU to work with arrays of data often have the potential to benefit from a Fortran implementation.
The programming languages shootout should not be taken too seriously, but of the 15 benchmarks, Fortran ranks #1 for speed on four of them (for Intel Q6600 one core), more than any other single language. You can see that the benchmarks where Fortran shines are the heavily numerical ones:
spectral norm 27% faster
fasta 67% faster
mandelbrot 56% faster
pidigits 18% faster
Counterexample:
k-nucleotide 500% slower (this benchmark focuses heavily on more sophisticated data structures and string processing, which is not Fortran's strength)
You can also see a summary page "how many times slower" that shows that out of all implementations, the Fortran code is on average closest to the fastest implementation for each benchmark -- although the quantile bars are much larger than for C++, indicating Fortran is unsuited for some tasks that C++ is good at, but you should know that already.
So the questions you will need to ask yourself are:
Is the speed of this function so critical that reimplementing it in Fortran is worth my time?
Is performance so important that my investment in learning Fortran will pay off?
Is it possible to use a library like ATLAS instead of writing the code myself?
Answering these questions would require detailed knowledge of your code base and business model, so I can't answer those. But yes, Fortran implementations are often faster than C++ implementations.
Another factor in your decision is the amount of sample code and the quantity of reference implementations available. Fortran's strong history means that there is a wealth of numerical code available for download and even with a trip to the library. As always you will need to sift through it to find the good stuff.

The complete and correct answer to your question is, "yes, Fortran does hold some advantages".
C++ also holds some, different, advantages. So do Python, R, etc etc. They're different languages. It's easier and faster to do some things in one language, and some in others. All are widely used in their communities, and for very good reasons.
Anything else, in the absence of more specific questions, is just noise and language-war-bait, which is why I've voted to close the question and hope others will too.

Fortran is just naturally suited for numerical programming. You tend to have a large amount of numbers in such programs, typically arranged arrays. Arrays are first class citizens in Fortran and it is often pretty straight forward to translate numerical kernels from Matlab into Fortran.
Regarding potential performance advantages see the other answers, that cover this quite nicely. The baseline is probably you can create highly efficient numerical applications with most compiled languages today, but you might jump through some loops to get there. Fortran was carefully designed to allow the compiler to recognize most spots for optimizations, due to the language features. Of course you can also write arbitrary slow code with any compiled language, including Fortran.
In any case you should pick the tools as suited. Fortran suits numerical applications, C suits system related development. On a final remark, learning Fortran basics is not hard, and it is always worthwhile to have a look into other languages. This opens a different view on problems you want to solve.

Also worth mentioning is that Fortran is a lot easier to master than C++. In fact, Fortran has a shorter language spec than plain C and it's syntax is arguably simpler. You can pick it up very quickly.
Meaning that if you are only interested in learning C++ or Fortran to solve a single specific problem you have at the moment (say, to speed up the bottlenecks in something you wrote in a prototyping language), Fortran might give you a better return on investment.

Fortran code is better for matrix and vector type operation in general. But you also can produce similar performance with c/c++ code by passing hints/suggestions to the compiler to produce similar quality vector instructions. One option that gave me good boost was not to assume memory aliasing among input variables that are array objects. This way, the compiler can aggressively do inner loop unrolling and pipelining for ILP where it can overlap loads and store operation across loop iteration with right prefetches.

C++ or Python for an Extensive Math Program?

I'm debating whether to use C++ or Python for a largely math-based program.
Both have great math libraries, but which language is generally faster for complex math?

You could also consider a hybrid approach. Python is generally easier and faster to develop in, specially for things like user interface, input/output etc.
C++ should certainly be faster for some math operations (although if your problem can be formulated in terms of vector operations or linear algebra than numpy provides a python interface to very efficient vector manipulations).
Python is easy to extend with Cython, Swig, Boost Python etc. so one strategy is write all the bookkeeping type parts of the program in Python and just do the computational code in C++.

I guess it is safe to say that C++ is faster. Simply because it is a compiled language which means that only your code is running, not an interpreter as with python.
It is possible to write very fast code with python and very slow code with C++ though. So you have to program wisely in any language!
Another advantage is that C++ is type safe, which will help you to program what you actually want.
A disadvantage in some situations is that C++ is type safe, which will result in a design overhead. You have to think (maybe long and hard) about function and class interfaces, for instance.
I like python for many reasons. So don't understand this a plea against python.

It all depends if faster is "faster to execute" or "faster to develop". Overall, python will be quicker for development, c++ faster for execution. For working with integers (arithmetic), it has full precision integers, it has a lot of external tools (numpy, pylab...) My advice would be go python first, if you have performance issue, then switch to cpp (or use external libraries written in cpp from python, in an hybrid approach)
There is no good answer, it all depends on what you want to do in terms of research / calculus

It goes witout saying that C++ is going to be faster for intensive numeric computations. However, there are so many pre-existing libraries out there (written in C/C++/Haskell etc..), with Python wrappers - it's just more convenient to utilise the convenience of Python and let the existing libraries carry the load.
One comprehensive system is http://www.sagemath.org and a fairly interesting link is the components it uses at http://sagemath.org/links-components.html.
A system with numpy/scipy and pandas from my experience is normally sufficient for most things.

Use the one you like better (and you should like python better :)).
In either case, any math-intensive computations should be carried out by existing libraries - which aren't language dependent (usually BLAS / LAPACK are used to perform the actual math).
If you choose to go with python, use numpy
for calculations.
Edit: From your comments, it seems that you are very concerned with the speed of your program. The only way to know for sure how much time is wasted by the high level pythonic code is to profile your program (for example, use ipython with run -p).
In most cases, you will find that the high level stuff takes about 10% of the total running time, and therefore switching from python to C++ will only improve that 10% by some factor, for a total gain of perhaps 5% in running time.

I sincerely doubt that Google and Stanford don't know C++.
"Generally faster" is more than just language. Algorithms can make or break a solution, regardless of what language it's written in. A poor choice written in C++ and be beaten by Java or Python if either makes a better algorithm choice.
For example, an in-memory, single CPU linear algebra library will have its doors blown in by a parallelized version done properly.
An implicit algorithm may actually be slower than an explicit one, despite time step stability restrictions, because the latter doesn't have to invert a matrix. This is often true for hyperbolic partial differential equations.
You shouldn't care about "generally faster". You ought to look deeply into the problem you're trying to solve and the algorithms used to solve it. You'll do better that way than a blind language choice.

I would go with Python running on Java platform. This approach is implemented in DataMelt program. Algorithm that call Java libraries from Python can be faster, since JVM optimizes the code for you.

OpenCV: C++ and C performance comparison

Right now I'm developing some application using OpenCV API (C++). This application does processing with video.
On the PC everything works really fast. And today I decided to port this application on Android (to use camera as videoinput). Fortunately, there's OpenCV for Android so I just added my native code to sample Android application. Everything works fine except perfomance. I benchmarked my application and found that application works with 4-5 fps, what is actually not acceptable (my device has singlecore 1ghz processor) - I want it to work with about 10 fps.
Does it make a sence to fully rewrite my application on C? I know that using such things as std::vector is much comfortable for developer, but I don't care about it.
It seems that OpenCV's C interface has same functions/methods as C++ interface.
I googled this question but didn't find anything.
Thanks for any advice.

I've worked quite a lot with Android and optimizations (I wrote a video processing app that processes a frame in 4ms) so I hope I will give you some pertinent answers.
There is not much difference between the C and C++ interface in OpenCV. Some of the code is written in C, and has a C++ wrapper, and some viceversa. Any significant differences between the two (as measured by Shervin Emami) are either regressions, bug fixes or quality improvements. You should stick with the latest OpenCV version.
Why not rewrite?
You will spend a good deal of time, which you could use much better. The C interface is cumbersome, and the chance to introduce bugs or memory leaks is high. You should avoid it, in my opinion.
Advice for optimization
A. Turn on optimizations.
Both compiler optimizations and the lack of debug assertions can make a big difference in your running time.
B. Profile your app.
Do it first on your computer, since it is much easier. Use visual studio profiler, to identify the slow parts. Optimize them. Never optimize because you think is slow, but because you measure it. Start with the slowest function, optimize it as much as possible, then take the second slower. Measure your changes to make sure it's indeed faster.
C. Focus on algorithms.
A faster algorithm can improve performance with orders of magnitude (100x). A C++ trick will give you maybe 2x performance boost.
Classical techniques:
Resize you video frames to be smaller. Often you can extract the information from a 200x300px image, instead of a 1024x768. The area of the first one is 10 times smaller.
Use simpler operations instead of complicated ones. Use integers instead of floats. And never use double in a matrix or a for loop that executes thousands of times.
Do as little calculation as possible. Can you track an object only in a specific area of the image, instead of processing it all for all the frames? Can you make a rough/approximate detection on a very small image and then refine it on a ROI in the full frame?
D. Use C where it matters
In loops, it may make sense to use C style instead of C++. A pointer to a data matrix or a float array is much faster than mat.at or std::vector<>. Often the bottleneck is a nested loop. Focus on it. It doesn't make sense to replace vector<> all over the place and spaghettify your code.
E. Avoid hidden costs
Some OpenCV functions convert data to double, process it, then convert back to the input format. Beware of them, they kill performance on mobile devices. Examples: warping, scaling, type conversions. Also, color space conversions are known to be lazy. Prefer grayscale obtained directly from native YUV.
F. Use vectorization
ARM processors implement vectorization with a technology called NEON. Learn to use it. It is powerful!
A small example:
float* a, *b, *c;
// init a and b to 1000001 elements
for(int i=0;i<1000001;i++)
c[i] = a[i]*b[i];
can be rewritten as follows. It's more verbose, but much faster.
float* a, *b, *c;
// init a and b to 1000001 elements
float32x4_t _a, _b, _c;
int i;
for(i=0;i<1000001;i+=4)
{
a_ = vld1q_f32( &a[i] ); // load 4 floats from a in a NEON register
b_ = vld1q_f32( &b[i] );
c_ = vmulq_f32(a_, b_); // perform 4 float multiplies in parrallel
vst1q_f32( &c[i], c_); // store the four results in c
}
// the vector size is not always multiple of 4 or 8 or 16.
// Process the remaining elements
for(;i<1000001;i++)
c[i] = a[i]*b[i];
Purists say you must write in assembler, but for a regular programmer that's a bit daunting. I had good results using gcc intrinsics, like in the above example.
Another way to jump-start is to convrt handcoded SSE-optimized code in OpenCV into NEON. SSE is the NEON equivalent in Intel processors, and many OpenCV functions use it, like here. This is the image filtering code for uchar matrices (the regular image format). You should't blindly convert instructions one by one, but take it as an example to start with.
You can read more about NEON in this blog and the following posts.
G. Pay attention to image capture
It can be surprisingly slow on a mobile device. Optimizing it is device and OS specific.

Before making any decision like this, you should profile your code to locate the hotspots in your code. Without this information, any changes you make to speed things up will be guesswork. Have you tried this Android NDK profiler?

There is some performance tests done by shervin imami on his website. You can check it to get some ideas.
http://www.shervinemami.info/timingTests.html
Hope it helps.
(And also, it would be nice if you share your own findings somewhere if you get any way for performance boost.)

I guess the question needs to be formulated to: is C faster than C++? and the answer is NO. Both are compiled to the native machine language and C++ is designed to be as fast as C
As for the STL (espeically ISO standard) are also designed and taken care that they are as fast as pointers + they offer flexibility.
The only reason to use C is that your platform doesn't support C++
In my humble openion, don't convert everything to C, as you'll probably get almost the same performance. and try instead to improve your code or use other functionalities of opencv to do what you want.
Not convinced? well then write a simple function, once in C and once in C++, and run it in a loop of 100 million times and measure the time yourself. Maybe this helps you taking the right decision

I've never used C or C++ in Android. But in a PC you can get C++ to run as fast as C code (sometimes even faster). Most of C++ was designed specifically to allow more features, but not at the cost of speed (Templates are solved at compile time). Most compilers are pretty good at optimizing your code, and your std::vector calls will be inlined and the code will be almost the same as using a native C array.
I'd suggest you look for another way of improving your performance. Maybe there are some multimedia hardware extensions in the Android you can get access to and use to optimize the code.

I noticed in multiple tests that:
C interface (IplImage) is a number of times faster when accessing the pixels directly instead of using Mat.at(x,y) method, when I converted my C++ application to C, I had a 3x performance increase in my blob detection routine
C++ interface crashes in certain routines when called from external applications (e.g. LabView) whereas it works when calling the same routines in C. Example of this is FindContours and cvFindContours
C is far more compatible with embedded devices. However, I have not done anything in this field yet.

I had similar problems on IOS devices and discussion Maximum speed from IOS/iPad/iPhone includes some hints applicable to other mobile platforms too.

What language/platform would you recommend for CPU-bound application?

I'm developing non-interactive cpu-bound application which does only computations, almost no IO. Currently it works too long and while I'm working on improving the algorithm, I also think if it can give any benefit to change language or platform. Currently it is C++ (no OOP so it is almost C) on windows compiled with Intel C++ compiler. Can switching to ASM help and how much? Can switching to Linux and GCC help?

Just to be thorough: the first thing to do is to gather profile data and the second thing to do is consider your algorithms. I'm sure you know that, but they've got to be #included into any performance-programming discussion.
To be direct about your question "Can switching to ASM help?" the answer is "If you don't know the answer to that, then probably not." Unless you're very familiar with the CPU architecture and its ins and outs, it's unlikely that you'll do a significantly better job than a good optimizing C/C++ compiler on your code.
The next point to make is that significant speed-ups in your code (aside from algorithmic improvements) will almost certainly come from parallelism, not linear increases. Desktop machines can now throw 4 or 8 cores at a task, which has much more performance potential than a slightly better code generator. Since you're comfortable with C/C++, OpenMP is pretty much a no-brainer; it's very easy to use to parallelize your loops (obviously, you have to watch loop-carried dependencies, but it's definitely "the simplest parallelism that could possibly work").
Having said all that, code generation quality does vary between C/C++ compilers. The Intel C++ compiler is well-regarded for its optimization quality and has full support not just for OpenMP but for other technologies such as the Threading Building Blocks.
Moving into the question of what programming languages might be even better than C++, the answer would be "programming languages that actively promote / facilitate concepts of parallelism and concurrent programming." Erlang is the belle of the ball in that regard, and is a "hot" language right now and most people interested in performance programming are paying at least some attention to it, so if you want to improve your skills in that area, you might want to check it out.

It's always algorithm, rarely language. Here's my clue: "while I'm working on improving the algorithm".
Tweaking may not be enough.
Consider radical changes to the algorithm. You've got to eliminate processing, not make the processing go faster. The culprit is often "search" -- looping through data looking for something. Find ways to eliminate search. If you can't eliminate it, replace linear search with some kind of tree search or a hash map of some kind.

Switching to ASM is not going to help much, unless you're very good at it and/or have a specific critical path routine which you know you can do better. As several people have remarked, modern compilers are just better in most cases at taking advantages of caching/etc. than anyone can do by hand.
I'd suggest:
Try a different compiler, and/or different optimization options
Run a code coverage/analysis utility, and figure out where the critical paths are, and work on optimizing those in the code
C++ should be able to give you very near the best possible performance from the code, so I wouldn't recommend switching the language. Depending on the app, you may be able to get better performance on multi code/processor systems using multiple thread, as another suggestion.

While just switching to asm won't give any benefits, since the Intel C++ Compiler is likely better at optimizing than you, you can try one of the following options:
Try a compiler that will parallelize your code, like the VectorC compiler.
Try to switch to asm with heavy use of MMX, 3DNow!, SSE or whatever fits your needs (and your CPU). This will give more of a benefit than pure asm.
You can also try GPGPU, i.e. execute large parts of your algorithm on a GPU instead of a CPU. Depending on your algorithm, it can be dramatically faster.
Edit: I also second the profile approach. I recommend AQTime, which supports the Intel C++ compiler.

Personally I'd look at languages which allow you to take advantage of parallelism most easily, unless it's a thoroughly non-parallelisable situation. Being able to bolt on some extra cores and get (if possible!) near-linear improvement may well be a lot more cost-effective than squeezing the extra few percent of efficiency out.
When it comes to parallelisation, I believe functional languages are often regarded as the best way to go, or you could look at OpenMP for C/C++. (Personally, as a managed language guy, I'd be looking at libraries for Java/.NET, but I quite understand that not everyone has the same preferences!)

Try Fortran 77 - when it comes to computations still nothing beats the granddaddy of programming languages. Also, try it with OpenMP to take advantage of multiple cores.

Hand optimizing your ASM code compared to what C++ can do for you is rarely cost effective.
If you've done anything you can to the algorithm from a traditional algorithmic view, and you've also eliminated excesses, then you may either be SOL, or you can consider optimizing your program from a hardware point of view.
For example, any time you follow a pointer around the heap you are paying a huge cost due to cache misses, possibly paging, etc., which all affect branching predictions. Most programmers (even C gurus) tend to look at the CPU from the functional standpoint rather than what happens behind the scenes. Sometimes reorganizing memory, for example by "flattening" or manually allocating memory to fit on the same page can obtain ENORMOUS speedups. I managed to get 2X speedups on graph traversals just by flattening my structures.
These are not things that your compiler will do for you since they are based on your high-level understanding of the program.

As lobrien said, you haven't given us any information to tell you if hand-optimized ASM code would help... which means the answer is probably, "not yet."
Have you run your code with a profiler?
Do you know if the code is slow because of memory constraints or processor constraints?
Are you using all your available cores?
Have you identified any algorithms you're using that aren't O(1)? Can you get them to O(1)? If not, why not?
If you've done all that, how much control do you have over the environment your program is running in? (presumably a lot if you're thinking of switching operating systems) Can you disable other processes, give your process highest priority, etc? What about just finding a machine with a faster processor, more cores, or more memory (depending on what you're constrained on)
And on and on.
If you've already done all that and more, it's certainly possible you'll get to a point where you think, "I wonder if these few lines of code right here could be optimized better than the assembly that I'm looking at in the debugger right now?" And at that point you can ask specifically.
Good luck! You're solving a problem that's fun to solve.

Sometimes you can find libraries that have optimized implementations of the algorithms you care about. Often times they will have done the multithreading for you.
For example switching from LINPACK to LAPACK got us a 10x speed increase in LU factorization/solve with a good BLAS library.

First, figure out if you can change the algorithm, as S.Lott suggested.
Assuming the algorithm choice is correct, you might look a the memory access patterns, if you have a lot of data you are processing. For a lot of number crunching applications these days, they're bound by the memory bus, not by the ALU(s). I recently optimized some code that was of the form:
// Assume N is a big number
for (int i=0; i<N; i++) {
myArray[i] = dosomething(i);
}
for (int i=0; i<N; i++) {
myArray[i] = somethingElse(myArray[i]);
}
...
and converted it to look like:
for (int i=0; i<N; i++) {
double tmp = dosomething(i);
tmp = somethingElse(tmp);
...
myArray[i] = tmp;
}
...
In this particular case, this yielded about a 2x speedup.

As Oregonghost already hinted - The VectorC compiler might help. It does not really parallelize the code though, instead you can use it to leverage on extended command sets like mmx or sse. I used it for the most time-critical parts in a software rendering engine and it resulted in a speedup of about 150%-200% on most processors.

For an alternative approach, you could look into Distributed Computing which sounds like it could suit your needs.

If you're sticking with C++ on the intel compiler, take a look at the compiler intrinsics (full reference here). I know that VC++ has similar functionality, and I'm sure you can do the same thing with gcc. These can let you take full advantage of the parallelism built into your CPU. You can use the MMX, SSE and SSE2 instructions to improve performance to a degree. Like others have said, you're probably best looking at the algorithm first.

I suggest you rethink your algorithm, or maybe even better, your approach. On the other hand maybe what you are trying to calculate just takes a lot of computing time. Have you considered to make it distributed so it can run in a cluster of some sort? If you want to focus on pure code optimization by introducing Assembler for your inner loops then often that can be very beneficial (if you know what you're doing).

For modern processors, learning ASM will take you a long time. Further, with all the different versions of SSE around, your code will end up very processor dependant.
I do quite a lot of CPU-bound work, and have found that the difference between intel's C++ compiler and g++ usually isn't that big (at most 15% or so), and there is no measurable difference between Mac OS X, Windows and Linux.
You are going to have to optimise your code and improve your algorithm by hand. There is no "magic fairy dust" which can make existing code that much faster I'm afraid.
If you haven't yet, and you care about performance, you MUST run your code through a good profiler (personally, I like kcachegrind & valgrind on Linux, or Shark on Mac OS X. I don't know what is good for windows I'm afraid).
Based on my past experience, there is a very good chance you'll find some method is taking 95% of your CPU time, and some simple change or addition of caching will make a massive improvement to your performance. On a similar note, if some method is only taking 1% of your CPU time, no amount of optimising is going to gain you anything.

The 2 obvious answers to "CPU-bound" are:
1. Use more CPU (core)s
2. Use something else.
Using 2 threads instead of 1 will cut the time spent by up to 50%. In comparision, C++ to ASM rarely gives you 5% (and for novice ASM programmers, it's often -5%!). Some problems scale well, and may benefit from 8 or 16 cores. That kind of hardware is still pretty mainstream, so see if your problems fall in that category.
The other solution is to throw more specialized hardware at the task. This could be the vector unit of your CPU - considering Windows=x86/x64, that's going to be a flavor of SSE. Another kind of vector hardware is the modern GPU. The GPU also has its own memory bus, which is quite speedy.

First get the lead out. Then if it's as fast as it can possibly be without going to ASM, so be it. But thinking you have to go to ASM assumes you know what's making it slow, and I'll bet a donut that you're guessing.

If you feel you have optimized your code to a point there is no improvement, increase your CPU's. This can be done on different platforms. One I develop with is Appistry. A few links:
http://www.appistry.com/resource-library/index.html
and you can download the product free from here:
http://www.appistry.com/developers/
I work for Appistry and we have done many installations for tasks that were cpu bound by spreading work out over 10's or 100's of machines.
Hope this helps,
-Brett

Probable small help:
Optimization of 64-bit programs
AMD64 (EM64T) architecture
Debugging and optimization of multi-thread OpenMP-programs
Introduction into the problems of developing parallel programs
Development of Resource-intensive Applications in Visual C++

Linux
Switching to Linux can help, if you strip it down to only the parts you actually need.

CrowdProcess has about 2000 workers you can use to compute your algorithm. The API is extremely simple and we've been observing speedups close to the number of workers. Also you can write Javascript which should make you more productive than C++ or ASM.
So if you're in between C++ or ASM, I'd say you should first use all your CPU cores, then if it's not enough, CrowdProcess should be an interesting platform.
Disclaimer: I built CrowdProcess.

It is hard to produce ASM code that is faster than naive C or C++ code. In most cases if you do this job really well, you probably gain not much than few percents and getting like 10% speedup is considered great success but in most cases it is just impossible.
Compilers are capable of understanding how to compile efficiently. You should profile in order to figure out where to optimize.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js