Python maxes one core while c++ doesn't - c++

I wrote a simple program(calculating the number of steps for the numbers 1-10^10 using the collatz conjecture)in both python and c++. They were nearly identical, and neither were written for multithreading. I ran the one in python and according to my system manager, one core went straight to 100% usage, the others staying the same. I ran the c++ program and the cores stayed at their same fluctuating between 10 and 15% usage states, never really changing. They both completed around the same time, within seconds. Could someone explain to me why this is happening?

Python is, in general, quite slow at raw number crunching. This is because it uses its full, general purpose object model for everything, including numbers. You can contrast this with Java and C++, which have "native types" which don't offer any of the niceties of a real class (methods, inheritance, data attributes, etc), but do offer access to the raw speed of the underlying CPU.
So, x = a + b in C++ generally has far less work to do at runtime than x = a + b in Python, despite the superficially identical syntax. Python's unified object model is one of the things that makes it comparatively easy to use, but it can have a downside on the raw speed front.
There are multiple alternative approaches to recovering that lost speed:
use a custom C extension to drop back down to raw CPU calculations and recover the speed directly
use an existing numeric library to do the same thing
use a just-in-time compiler (e.g. via the psyco project or PyPy)
use multiprocessing or concurrent.futures to take advantage of multiple cores, or even a distributed computing library to make use of multiple machines
P.S. This is a much better question now that the algorithm is described :)

Related

C++ or Python for an Extensive Math Program?

I'm debating whether to use C++ or Python for a largely math-based program.
Both have great math libraries, but which language is generally faster for complex math?
You could also consider a hybrid approach. Python is generally easier and faster to develop in, specially for things like user interface, input/output etc.
C++ should certainly be faster for some math operations (although if your problem can be formulated in terms of vector operations or linear algebra than numpy provides a python interface to very efficient vector manipulations).
Python is easy to extend with Cython, Swig, Boost Python etc. so one strategy is write all the bookkeeping type parts of the program in Python and just do the computational code in C++.
I guess it is safe to say that C++ is faster. Simply because it is a compiled language which means that only your code is running, not an interpreter as with python.
It is possible to write very fast code with python and very slow code with C++ though. So you have to program wisely in any language!
Another advantage is that C++ is type safe, which will help you to program what you actually want.
A disadvantage in some situations is that C++ is type safe, which will result in a design overhead. You have to think (maybe long and hard) about function and class interfaces, for instance.
I like python for many reasons. So don't understand this a plea against python.
It all depends if faster is "faster to execute" or "faster to develop". Overall, python will be quicker for development, c++ faster for execution. For working with integers (arithmetic), it has full precision integers, it has a lot of external tools (numpy, pylab...) My advice would be go python first, if you have performance issue, then switch to cpp (or use external libraries written in cpp from python, in an hybrid approach)
There is no good answer, it all depends on what you want to do in terms of research / calculus
It goes witout saying that C++ is going to be faster for intensive numeric computations. However, there are so many pre-existing libraries out there (written in C/C++/Haskell etc..), with Python wrappers - it's just more convenient to utilise the convenience of Python and let the existing libraries carry the load.
One comprehensive system is http://www.sagemath.org and a fairly interesting link is the components it uses at http://sagemath.org/links-components.html.
A system with numpy/scipy and pandas from my experience is normally sufficient for most things.
Use the one you like better (and you should like python better :)).
In either case, any math-intensive computations should be carried out by existing libraries - which aren't language dependent (usually BLAS / LAPACK are used to perform the actual math).
If you choose to go with python, use numpy
for calculations.
Edit: From your comments, it seems that you are very concerned with the speed of your program. The only way to know for sure how much time is wasted by the high level pythonic code is to profile your program (for example, use ipython with run -p).
In most cases, you will find that the high level stuff takes about 10% of the total running time, and therefore switching from python to C++ will only improve that 10% by some factor, for a total gain of perhaps 5% in running time.
I sincerely doubt that Google and Stanford don't know C++.
"Generally faster" is more than just language. Algorithms can make or break a solution, regardless of what language it's written in. A poor choice written in C++ and be beaten by Java or Python if either makes a better algorithm choice.
For example, an in-memory, single CPU linear algebra library will have its doors blown in by a parallelized version done properly.
An implicit algorithm may actually be slower than an explicit one, despite time step stability restrictions, because the latter doesn't have to invert a matrix. This is often true for hyperbolic partial differential equations.
You shouldn't care about "generally faster". You ought to look deeply into the problem you're trying to solve and the algorithms used to solve it. You'll do better that way than a blind language choice.
I would go with Python running on Java platform. This approach is implemented in DataMelt program. Algorithm that call Java libraries from Python can be faster, since JVM optimizes the code for you.

OpenCV: C++ and C performance comparison

Right now I'm developing some application using OpenCV API (C++). This application does processing with video.
On the PC everything works really fast. And today I decided to port this application on Android (to use camera as videoinput). Fortunately, there's OpenCV for Android so I just added my native code to sample Android application. Everything works fine except perfomance. I benchmarked my application and found that application works with 4-5 fps, what is actually not acceptable (my device has singlecore 1ghz processor) - I want it to work with about 10 fps.
Does it make a sence to fully rewrite my application on C? I know that using such things as std::vector is much comfortable for developer, but I don't care about it.
It seems that OpenCV's C interface has same functions/methods as C++ interface.
I googled this question but didn't find anything.
Thanks for any advice.
I've worked quite a lot with Android and optimizations (I wrote a video processing app that processes a frame in 4ms) so I hope I will give you some pertinent answers.
There is not much difference between the C and C++ interface in OpenCV. Some of the code is written in C, and has a C++ wrapper, and some viceversa. Any significant differences between the two (as measured by Shervin Emami) are either regressions, bug fixes or quality improvements. You should stick with the latest OpenCV version.
Why not rewrite?
You will spend a good deal of time, which you could use much better. The C interface is cumbersome, and the chance to introduce bugs or memory leaks is high. You should avoid it, in my opinion.
Advice for optimization
A. Turn on optimizations.
Both compiler optimizations and the lack of debug assertions can make a big difference in your running time.
B. Profile your app.
Do it first on your computer, since it is much easier. Use visual studio profiler, to identify the slow parts. Optimize them. Never optimize because you think is slow, but because you measure it. Start with the slowest function, optimize it as much as possible, then take the second slower. Measure your changes to make sure it's indeed faster.
C. Focus on algorithms.
A faster algorithm can improve performance with orders of magnitude (100x). A C++ trick will give you maybe 2x performance boost.
Classical techniques:
Resize you video frames to be smaller. Often you can extract the information from a 200x300px image, instead of a 1024x768. The area of the first one is 10 times smaller.
Use simpler operations instead of complicated ones. Use integers instead of floats. And never use double in a matrix or a for loop that executes thousands of times.
Do as little calculation as possible. Can you track an object only in a specific area of the image, instead of processing it all for all the frames? Can you make a rough/approximate detection on a very small image and then refine it on a ROI in the full frame?
D. Use C where it matters
In loops, it may make sense to use C style instead of C++. A pointer to a data matrix or a float array is much faster than mat.at or std::vector<>. Often the bottleneck is a nested loop. Focus on it. It doesn't make sense to replace vector<> all over the place and spaghettify your code.
E. Avoid hidden costs
Some OpenCV functions convert data to double, process it, then convert back to the input format. Beware of them, they kill performance on mobile devices. Examples: warping, scaling, type conversions. Also, color space conversions are known to be lazy. Prefer grayscale obtained directly from native YUV.
F. Use vectorization
ARM processors implement vectorization with a technology called NEON. Learn to use it. It is powerful!
A small example:
float* a, *b, *c;
// init a and b to 1000001 elements
for(int i=0;i<1000001;i++)
c[i] = a[i]*b[i];
can be rewritten as follows. It's more verbose, but much faster.
float* a, *b, *c;
// init a and b to 1000001 elements
float32x4_t _a, _b, _c;
int i;
for(i=0;i<1000001;i+=4)
{
a_ = vld1q_f32( &a[i] ); // load 4 floats from a in a NEON register
b_ = vld1q_f32( &b[i] );
c_ = vmulq_f32(a_, b_); // perform 4 float multiplies in parrallel
vst1q_f32( &c[i], c_); // store the four results in c
}
// the vector size is not always multiple of 4 or 8 or 16.
// Process the remaining elements
for(;i<1000001;i++)
c[i] = a[i]*b[i];
Purists say you must write in assembler, but for a regular programmer that's a bit daunting. I had good results using gcc intrinsics, like in the above example.
Another way to jump-start is to convrt handcoded SSE-optimized code in OpenCV into NEON. SSE is the NEON equivalent in Intel processors, and many OpenCV functions use it, like here. This is the image filtering code for uchar matrices (the regular image format). You should't blindly convert instructions one by one, but take it as an example to start with.
You can read more about NEON in this blog and the following posts.
G. Pay attention to image capture
It can be surprisingly slow on a mobile device. Optimizing it is device and OS specific.
Before making any decision like this, you should profile your code to locate the hotspots in your code. Without this information, any changes you make to speed things up will be guesswork. Have you tried this Android NDK profiler?
There is some performance tests done by shervin imami on his website. You can check it to get some ideas.
http://www.shervinemami.info/timingTests.html
Hope it helps.
(And also, it would be nice if you share your own findings somewhere if you get any way for performance boost.)
I guess the question needs to be formulated to: is C faster than C++? and the answer is NO. Both are compiled to the native machine language and C++ is designed to be as fast as C
As for the STL (espeically ISO standard) are also designed and taken care that they are as fast as pointers + they offer flexibility.
The only reason to use C is that your platform doesn't support C++
In my humble openion, don't convert everything to C, as you'll probably get almost the same performance. and try instead to improve your code or use other functionalities of opencv to do what you want.
Not convinced? well then write a simple function, once in C and once in C++, and run it in a loop of 100 million times and measure the time yourself. Maybe this helps you taking the right decision
I've never used C or C++ in Android. But in a PC you can get C++ to run as fast as C code (sometimes even faster). Most of C++ was designed specifically to allow more features, but not at the cost of speed (Templates are solved at compile time). Most compilers are pretty good at optimizing your code, and your std::vector calls will be inlined and the code will be almost the same as using a native C array.
I'd suggest you look for another way of improving your performance. Maybe there are some multimedia hardware extensions in the Android you can get access to and use to optimize the code.
I noticed in multiple tests that:
C interface (IplImage) is a number of times faster when accessing the pixels directly instead of using Mat.at(x,y) method, when I converted my C++ application to C, I had a 3x performance increase in my blob detection routine
C++ interface crashes in certain routines when called from external applications (e.g. LabView) whereas it works when calling the same routines in C. Example of this is FindContours and cvFindContours
C is far more compatible with embedded devices. However, I have not done anything in this field yet.
I had similar problems on IOS devices and discussion Maximum speed from IOS/iPad/iPhone includes some hints applicable to other mobile platforms too.

processing an image using CUDA implementation, python (pycuda) or C++?

I am in a project to process an image using CUDA. The project is simply an addition or subtraction of the image.
May I ask your professional opinion, which is best and what would be the advantages and disadvantages of those two?
I appreciate everyone's opinions and/or suggestions since this project is very important to me.
General answer: It doesn't matter. Use the language you're more comfortable with.
Keep in mind, however, that pycuda is only a wrapper around the CUDA C interface, so it may not always be up-to-date, also it adds another potential source of bugs, …
Python is great at rapid prototyping, so I'd personally go for Python. You can always switch to C++ later if you need to.
If the rest of your pipeline is in Python, and you're using Numpy already to speed things up, pyCUDA is a good complement to accelerate expensive operations. However, depending on the size of your images and your program flow, you might not get too much of a speedup using pyCUDA. There is latency involved in passing the data back and forth across the PCI bus that is only made up for with large data sizes.
In your case (addition and subtraction), there are built-in operations in pyCUDA that you can use to your advantage. However, in my experience, using pyCUDA for something non-trivial requires knowing a lot about how CUDA works in the first place. For someone starting from no CUDA knowledge, pyCUDA might be a steep learning curve.
Take a look at openCV, it contains a lot of image processing functions and all the helpers to load/save/display images and operate cameras.
It also now supports CUDA, some of the image processing functions have been reimplemented in CUDA and it gives you a good framework to do your own.
Alex's answer is right. The amount of time consumed in the wrapper is minimal. Note that PyCUDA has some nice metaprogramming constructs for generating kernels which might be useful.
If all you're doing is adding or subtracting elements of an image, you probably shouldn't use CUDA for this at all. The amount of time it takes to transfer back and forth across the PCI-E bus will dwarf the amount of savings you get from parallelism.
Any time you deal with CUDA, it's useful to think about the CGMA ratio (computation to global memory access ratio). Your addition/subtraction is only 1 float point operation for 2 memory accesses (1 read and 1 write). This ends up being very lousy from a CUDA perspective.

Comparison of performance between Scala etc. and C/C++/Fortran?

I wonder if there is any reliable comparison of performance between "modern" multithreading-specialized languages like e.g. scala and "classic" "lower-level" languages like C, C++, Fortran using parallel libs like MPI, Posix or even Open-MP.
Any links and suggestions welcome.
Given that Java, and, therefore, Scala, can call external libraries, and given that those highly specialized external libraries will do most of the work, then the performance is the same as long as the same libraries are used.
Other than that, any such comparison is essentially meaningless. Scala code runs on a virtual machine which has run-time optimization. That optimization can push long-running programs towards greater performance than programs compiled with those other languages -- or not. It depends on the specific program written in each language.
Here's another non-answer: go to your local supercomputer centre and ask what fraction of the CPU load is used by each language you are interested in. This will only give you a proxy answer to your question, it will tell you what the people who are concerned with high performance on such machines use when tackling the kind of problem that they tackle. But it's as instructive as any other answer you are likely to get for such a broad question.
PS The answer will be that Fortran, C and C++ consume well in excess of 95% of the CPU cycles.
I'd view such comparisons as a fraction. The numerator is a constant (around 0.00001, I believe). The denominator is the number of threads multiplied by the number of logical processors.
IOW, for a single thread, the comparison has about a one chance in a million of meaning something. For a quad core processor running an application with (say) 16 threads, you're down to one chance in 64 million of a meaningful result.
In short, there are undoubtedly quite a few people working on it, but the chances of even a single result from any of them providing a result that's useful and meaningful is still extremely low. Worse, even if one of them really did mean something, it would be almost impossible to find, and even more difficult to verify to the point that you actually knew it meant something.

Will it be possible to run C code emulated on GA144?

This company have an interesting CPU that run at an amazing speed. Will it be possible to emulate C or is the memory too small?
There is C translator for SEAforth40 chip (previous version of GA144 chip)
Presentation:
http://www.asu.ru/files/documents/00002990.pdf
A first cursory glance at the instruction set suggests that "colorForth" can be thought of as a simple machine language. Given that, it may be possible to write a C compiler that compiles to colorForth as its target instruction set.
Of course, it may be easier to write code in colorForth in the first place.
From the looks of it, if someone writes a compiler which can output the machine code (33 instructions, not too complex), you won't need to emulate C, you could just directly compile it.
Of course, it would be extremely limited, since it looks like each chip gets a tiny amount of internal RAM (64 words isn't a lot to work with). There's an 18-bit memory address port attached to one of the cores, so you can have 256MB of external RAM, but it can only be directly accessed by a single one of the cores, and then it would need to be passed to the other.
It's possible that different cores could be used for different functions, but that would complicate the compiler quite a bit.
It could be done, but their interpreter should handle parallel tasks, load distribution, etc. It will probability be best to just go with their Forth interpreter.
Chlorophyll has some ideas of general interest. I also happens to look similar to C:
We developed Chlorophyll, a synthesis-aided programming model and
compiler for the GreenArrays GA144, an extremely minimalist
low-power spatial architecture that requires partitioning the program
into fragments of no more than 256 instructions and 64 words of
data. This processor is 100-times more energy efficient than its
competitors, but currently can only be programmed using a low-level
stack-based language. The Chlorophyll programming model allows
programmers to provide human insight by specifying partial
partitioning of data and computation. The Chlorophyll compiler relies
on synthesis, sidestepping the need to develop classical
optimizations, which may be challenging given the unusual
architecture. To scale synthesis to real problems, we decompose the
compilation into smaller synthesis subproblems—partitioning, layout,
and code generation. We show that the synthesized programs are no more
than 65% slower than highly optimized expert-written programs and are
faster than programs produced by a heuristic, non-synthesizing version
of our compiler.
http://www.eecs.berkeley.edu/~mangpo/www/talks/1311_forthday_handout.pdf
http://www.eecs.berkeley.edu/~nishant/papers/Chlorophyll.pdf
You would need to use external memory, but apart from that, it is certainly doable, according to this white paper by Greg Bailey:
It would not be difficult to build a virtual machine supporting C,
and there are many people and companies in the US alone for whom
building such a machine and completing a “port” of the C language
compiler and library to the virtual machine would be simply a
repetition of something they had done before. Once this has been
done, the GreenArray chip can run any C program which fits in the
external memory and will satisfy any C application requirement that
is met by the resulting execution speed.
-- excerpt from page 4
He also discuss their implementation of a eForth virtual machine in that paper.