(How) Does the compiler compile a monolithic program as a threaded one? - c++

I wrote a monolithic designed program which is quite rough on the processors needs. And as I have a dual-core I figured that one CPU should therefore be always at 100%. But both my CPUs are on 100% all the time. Now I am guessing that my compiler somehow turned my monolithic application in a threaded one. What are the limits of those optimization feature and when is it still needed to explicit make something threaded?
I am using the gcc on Ubuntu linux 64-Bit

It doesn't, at least not without using something like Cilk. You must be inadvertently using multiple threads (or processes) without realizing it. Perhaps you're using a third-party library that creates an extra thread or two in your process?
[EDIT]
As per the comments, use a program like top(1) to verify that is in fact your program's process that is using both CPUs at 100%. In your case, the XORG process is jumping to 100% because your program is producing a large amount of output.

Any calls to the OS, or other libraries (CRT for instance) may use other threads as well. I would hardly be surprised if the console ran in it's own thread, and if you're doing a lot of IO of any sort, that could cause the other CPU to max out.

Related

Speed performance of a Qt program: Windows vs Linux

I've already posted this question here, but since it's maybe not that Qt-specific, I thought I might try my chance here as well. I hope it's not inappropriate to do that (just tell me if it is).
I’ve developed a small scientific program that performs some mathematical computations. I’ve tried to optimize it so that it’s as fast as possible. Now I’m almost done deploying it for Windows, Mac and Linux users. But I have not been able to test it on many different computers yet.
Here’s what troubles me: To deploy for Windows, I’ve used a laptop which has both Windows 7 and Ubuntu 12.04 installed on it (dual boot). I compared the speed of the app running on these two systems, and I was shocked to observe that it’s at least twice as slow on Windows! I wouldn’t have been surprised if there were a small difference, but how can one account for such a difference?
Here are a few precisions:
The computation that I make the program do are just some brutal and stupid mathematical calculations, basically, it computes products and cosines in a loop that is called a billion times. On the other hand, the computation is multi-threaded: I launch something like 6 QThreads.
The laptop has two cores #1.73Ghz. At first I thought that Windows was probably not using one of the cores, but then I looked at the processor activity, according to the small graphic, both cores are running 100%.
Then I thought the C++ compiler for Windows didn’t the use the optimization options (things like -O1 -O2) that the C++ compiler for Linux automatically did (in release build), but apparently it does.
I’m bothered that the app is so mush slower (2 to 4 times) on Windows, and it’s really weird. On the other hand I haven’t tried on other computers with Windows yet. Still, do you have any idea why the difference?
Additional info: some data…
Even though Windows seems to be using the two cores, I’m thinking this might have something to do with threads management, here’s why:
Sample Computation n°1 (this one launches 2 QThreads):
PC1-windows: 7.33s
PC1-linux: 3.72s
PC2-linux: 1.36s
Sample Computation n°2 (this one launches 3 QThreads):
PC1-windows: 6.84s
PC1-linux: 3.24s
PC2-linux: 1.06s
Sample Computation n°3 (this one launches 6 QThreads):
PC1-windows: 8.35s
PC1-linux: 2.62s
PC2-linux: 0.47s
where:
PC1-windows = my 2 cores laptop (#1.73Ghz) with Windows 7
PC1-linux = my 2 cores laptop (#1.73Ghz) with Ubuntu 12.04
PC2-linux = my 8 cores laptop (#2.20Ghz) with Ubuntu 12.04
(Of course, it's not shocking that PC2 is faster. What's incredible to me is the difference between PC1-windows and PC1-linux).
Note: I've also tried running the program on a recent PC (4 or 8 cores #~3Ghz, don't remember exactly) under Mac OS, speed was comparable to PC2-linux (or slightly faster).
EDIT: I'll answer here a few questions I was asked in the comments.
I just installed Qt SDK on Windows, so I guess I have the latest version of everything (including MinGW?). The compiler is MinGW. Qt version is 4.8.1.
I use no optimization flags because I noticed that they are automatically used when I build in release mode (with Qt Creator). It seems to me that if I write something like QMAKE_CXXFLAGS += -O1, this only has an effect in debug build.
Lifetime of threads etc: this is pretty simple. When the user clicks the "Compute" button, 2 to 6 threads are launched simultaneously (depending on what he is computing), they are terminated when the computation ends. Nothing too fancy. Every thread just does brutal computations (except one, actually, which makes a (not so) small"computation every 30ms, basically checking whether the error is small enough).
EDIT: latest developments and partial answers
Here are some new developments that provide answers about all this:
I wanted to determine whether the difference in speed really had something to do with threads or not. So I modified the program so that the computation only uses 1 thread, that way we are pretty much comparing the performance on "pure C++ code". It turned out that now Windows was only slightly slower than Linux (something like 15%). So I guess that a small (but not unsignificant) part of the difference is intrinsic to the system, but the largest part is due to threads management.
As someone (Luca Carlon, thanks for that) suggested in the comments, I tried building the application with the compiler for Microsoft Visual Studio (MSVC), instead of MinGW. And suprise, the computation (with all the threads and everything) was now "only" 20% to 50% slower than Linux! I think I'm going to go ahead and be content with that. I noticed that weirdly though, the "pure C++" computation (with only one thread) was now even slower (than with MinGW), which must account for the overall difference. So as far as I can tell, MinGW is slightly better than MSVC except that it handles threads like a moron.
So, I’m thinking either there’s something I can do to make MinGW (ideally I’d rather use it than MSVC) handle threads better, or it just can’t. I would be amazed, how could it not be well known and documented ? Although I guess I should be careful about drawing conclusions too quickly, I’ve only compared things on one computer (for the moment).
Another option it could be: on linux qt are just loaded, this could happens i.e. if you use KDE, while in Windows library must be loaded so this slow down computation time. To check how much library loading waste your application you could write a dummy test with pure c++ code.
I have noticed exactly the same behavior on my PC.
I am running Windows 7(64bits), Ubuntu (64bits) and OSX (Lion 64bits) and my program compares 2 XML files (more than 60Mb each). It uses Multithreading too (2 threads) :
-Windows : 40sec
-Linux : 14sec (!!!)
-OSX : 22sec.
I use a personal class for threads (and not Qt one) which uses "pthread" on linux/OSX and "threads" on windows.
I use Qt/mingw compiler as I need the XML class from Qt.
I have found no way (for now) to have the 3 OS having similar performances... but I hope I will !
I think that another reason may be the memory : my program uses about 500Mb of RAM. So I think that Unix is managing it best because, in mono-thread, Windows is exactly 1.89 times slower and I don't think that Linux could be more than 2 times slower !
I have heard of one case where Windows was extremely slow with writing files if you do it wrongly. (This has nothing to do with Qt.)
The problem in that case was that the developer used a SQLite database, wrote some 10000 datasets, and did a SQL COMMIT after each insert. This caused Windows to write the whole DB file to disk each time, while Linux would only update the buffered version of the filesystem inode in the RAM. The speed difference was even worse in that case: 1 second on Linux vs. 1 minute on Windows. (After he changed SQLite to commit only once at the end, it was also 1 second on Windows.)
So if you're writing the results of your computation to disk, you might want to check if you're calling fsync() or fflush() too often. If your writing code comes from a library, you can use strace for this (Linux-only, but should give you a basic idea).
You might experience performance differences by how mutexes run on Windows and Linux.
Pure mutex code on windows can have a 15ms wait every time there is a contention for resource when locking. Better performing synchronization mechanism on Windows is Critical Sections. It doesn't experience the locking penalty that regular mutexes experience in most cases.
I have found that on Linux, regular mutexes perform the same as Critical Sections on Windows.
It's probably the memory allocator, try using jemalloc or tcmalloc from Google. Glibc's ptmalloc3 is significantly better than the old crusty allocator in MSVC's crt. The comparable option from Microsoft is the Concurrency CRT but you cannot simply drop it in as a replacement.

Increasing C++ Program CPU Use

I have a program written in C++ that runs a number of for loops per second without using anything that would make it wait for any reason. It consistently uses 2-10% of the CPU. Is there any way to force it to use more of the CPU and do a greater number of calculations without making the program more complex? Additionally, I compile with C::B on a Windows computer. Essentially, I'm asking whether there is a way to make my program faster by increasing usage of CPU, and if so, how.
That depends on why it's only using 10% of the CPU. If it's because you're using a multi-CPU machine and your program is using only one CPU, then no, you will have to introduce concurrency into your code to use that additional horsepower.
If it's being limited by something else (e.g. copying data to and from the disk), then you don't need to focus on CPU, you need to focus on whatever the bottleneck is. Most likely, the limiter will be reading from the disk, which you can improve by using better caching mechanisms.
Assuming your application has the power (PROCESS_SET_INFORMATION access right), you can use SetPriorityClass to bump up your priortiy (to the usual detriment of all other processes, of course).
You can go ABOVE_NORMAL_PRIORITY_CLASS (try this one first), HIGH_PRIORITY_CLASS (be very careful with this one) or REALTIME_PRIORITY_CLASS (I would strongly suggest that you probably shouldn't give this one a shot).
If you try the higher priorities and it's still clocking pretty low, then that's probably because you're not CPU-bound (such as if you're writing data to an output file). If that's the case, you'll probably have to find a way to make yourself CPU bound.
Just keep in mind that doing so may not be necessary (or even desirable). If you're running at a higher priority than other threads and you're still not sucking up a lot of CPU, it's probably because Windows has (most likely, rightfully) decided you don't need it.
It's really not the program's right or responsibility to demand additional resources from the system. That's the OS' job, as resource scheduler.
If it is necessary to use more CPU time than the OS sees fit, you should request that from the OS using the platform-dependent API. In this case, that seems to be something along the lines of SetPriorityClass or SetThreadPriority.
Creating a thread & giving higher priority to the thread might be one way.
If you use C++, consider using Intel Threading Building Block. You can find some examples here.
Some profilers give very nice indications of where bottlenecks in your code are. For example - the CodeAnalyst (for AMD chips only) has the instructions per cycle ratio. I'm sure intel profilers are similar.
As Billy O'Neal says though, if your runnning on an 8-core, being stuck on 10 percent of cpu is about right. If this is your problem then Windows msvc++ has a parallel mode (the parallel patterns library) for the standard algorithms. This can give parallelisation for free if have written your loops the c++ way (its still your responsibility to make sure your loops are thread safe). I've not used the msvc version but the gnu::__parallel_for_each etc work a treat.

Does/can Valgrind use multiple processors?

Is there a way to get valgrind to use multiple processors?
I'm doing some bottleneck profiling with valgrind's callgrind and noticed significantly different resource usage behavior in my application vs when run outside of valgrind/callgrind.
When run outside valgrind, it maxes out several processors, but run inside valgrind only uses one. This makes me worry that my bottle necks will be in different places, and thus invalidate my profiling.
According to the Valgrind Docs, they do not support multiple processors:
The main thing to point out with
respect to threaded programs is that
your program will use the native
threading library, but Valgrind
serialises execution so that only one
(kernel) thread is running at a time.
This approach avoids the horrible
implementation problems of
implementing a truly multithreaded
version of Valgrind, but it does mean
that threaded apps run only on one
CPU, even if you have a multiprocessor
or multicore machine.
Valgrind doesn't schedule the threads
itself. It merely ensures that only
one thread runs at once, using a
simple locking scheme. The actual
thread scheduling remains under
control of the OS kernel. What this
does mean, though, is that your
program will see very different
scheduling when run on Valgrind than
it does when running normally. This is
both because Valgrind is serialising
the threads, and because the code runs
so much slower than normal.
This difference in scheduling may
cause your program to behave
differently, if you have some kind of
concurrency, critical race, locking,
or similar, bugs. In that case you
might consider using the tools
Helgrind and/or DRD to track them
down.
Take a look at:
http://valgrind.org/docs/manual/manual-core.html#manual-core.pthreads_perf_sched
They added:
--fair-sched option
It may help.

How to profile multi-threaded C++ application on Linux?

I used to do all my Linux profiling with gprof.
However, with my multi-threaded application, it's output appears to be inconsistent.
Now, I dug this up:
http://sam.zoy.org/writings/programming/gprof.html
However, it's from a long time ago and in my gprof output, it appears my gprof is listing functions used by non-main threads.
So, my questions are:
In 2010, can I easily use gprof to profile multi-threaded Linux C++ applications? (Ubuntu 9.10)
What other tools should I look into for profiling?
Edit: added another answer on poor man's profiler, which IMHO is better for multithreaded apps.
Have a look at oprofile. The profiling overhead of this tool is negligible and it supports multithreaded applications---as long as you don't want to profile mutex contention (which is a very important part of profiling multithreaded applications)
Have a look at poor man's profiler. Surprisingly there are few other tools that for multithreaded applications do both CPU profiling and mutex contention profiling, and PMP does both, while not even requiring to install anything (as long as you have gdb).
Try modern linux profiling tool, the perf (perf_events): https://perf.wiki.kernel.org/index.php/Tutorial and http://www.brendangregg.com/perf.html:
perf record ./application
# generates profile file perf.data
perf report
Have a look at Valgrind.
A Paul R said, have a look at Zoom. You can also use lsstack, which is a low-tech approach but surprisingly effective, compared to gprof.
Added: Since you clarified that you are running OpenGL at 33ms, my prior recommendation stands. In addition, what I personally have done in situations like that is both effective and non-intuitive. Just get it running with a typical or problematic workload, and just stop it, manually, in its tracks, and see what it's doing and why. Do this several times.
Now, if it only occasionally misbehaves, you would like to stop it only while it's misbehaving. That's not easy, but I've used an alarm-clock interrupt set for just the right delay. For example, if one frame out of 100 takes more than 33ms, at the start of a frame, set the timer for 35ms, and at the end of a frame, turn it off. That way, it will interrupt only when the code is taking too long, and it will show you why. Of course, one sample might miss the guilty code, but 20 samples won't miss it.
I tried valgrind and gprof. It is a crying shame that none of them work well with multi-threaded applications. Later, I found Intel VTune Amplifier. The good thing is, it handles multi-threading well, works with most of the major languages, works on Windows and Linux, and has many great profiling features. Moreover, the application itself is free. However, it only works with Intel processors.
You can randomly run pstack to find out the stack at a given point. E.g. 10 or 20 times.
The most typical stack is where the application spends most of the time (according to experience, we can assume a Pareto distribution).
You can combine that knowledge with strace or truss (Solaris) to trace system calls, and pmap for the memory print.
If the application runs on a dedicated system, you have also sar to measure cpu, memory, i/o, etc. to profile the overall system.
Since you didn't mention non-commercial, may I suggest Intel's VTune. It's not free but the level of detail is very impressive (and the overhead is negligible).
Putting a slightly different twist on matters, you can actually get a pretty good idea as to what's going on in a multithreaded application using ftrace and kernelshark. Collecting the right trace and pressing the right buttons and you can see the scheduling of individual threads.
Depending on your distro's kernel you may have to build a kernel with the right configuration (but I think that a lot of them have it built in these days).
Microprofile is another possible answer to this. It requires hand-instrumentation of the code, but it seems like it handles multi-threaded code pretty well. And it also has special hooks for profiling graphics pipelines, including what's going on inside the card itself.

Force Program / Thread to use 100% of processor(s) resources

I do some c++ programming related to mapping software and mathematical modeling.
Some programs take anywhere from one to five hours to perform and output a result; however, they only consume 50% of my core duo. I tried the code on another dual processor based machine with the same result.
Is there a way to force a program to use all available processer resources and memory?
Note: I'm using ubuntu and g++
A thread can only run on one core at a time. If you want to use both cores, you need to find a way to do half the work in another thread.
Whether this is possible, and if so how to divide the work between threads, is completely dependent on the specific work you're doing.
To actually create a new thread, see the Boost.Thread docs, or the pthreads docs, or the Win32 API docs.
[Edit: other people have suggested using libraries to handle the threads for you. The reason I didn't mention these is because I have no experience of them, not because I don't think they're a good idea. They probably are, but it all depends on your algorithm and your platform. Threads are almost universal, but beware that multithreaded programming is often difficult: you create a lot of problems for yourself.]
The quickest method would be to read up about openMP and use it to parallelise your program.
Compile with the command g++ -fopenmp provided that your g++ version is >=4
You need to have as many threads running as there are CPU cores available in order to be able to potentially use all the processor time. (You can still be pre-empted by other tasks, though.)
There are many way to do this, and it depends completely on what you're processing. You may be able to use OpenMP or a library like TBB to do it almost transparently, however.
You're right that you'll need to use a threaded approach to use more than one core. Boost has a threading library, but that's not the whole problem: you also need to change your algorithm to work in a threaded environment.
There are some algorithms that simply cannot run in parallel -- for example, SHA-1 makes a number of "passes" over its data, but they cannot be threaded because each pass relies on the output of the run before it.
In order to parallelize your program, you'll need to be sure your algorithm can "divide and conquer" the problem into independent chunks, which it can then process in parallel before combining them into a full result.
Whatever you do, be very careful to verify the correctness of your answer. Save the single-threaded code, so you can compare its output to that of your multi-threaded code; threading is notoriously hard to do, and full of potential errors.
It may be more worth your time to avoid threading entirely, and try profiling your code instead: you may be able to get dramatic speed improvements by optimizing the most frequently-executed code, without getting near the challenges of threading.
To take full use of a multicore processor, you need to make the program multithreaded.
An alternative to multi-threading is to use more than one process. You would still need to divide & conquer your problem into mutiple independent chunks.
By 50%, do you mean just one core?
If the application isn't either multi-process or multi-threaded, there's no way it can use both cores at once.
Add a while(1) { } somewhere in main()?
Or to echo real advice, either launch multiple processes or rewrite the code to use threads. I'd recommend running multiple processes since that is easier, although if you need to speed up a single run it doesn't really help.
To get to 100% for each thread, you will need to:
(in each thread):
Eliminate all secondary storage I/O
(disk read/writes)
Eliminate all display I/O (screen
writes/prints)
Eliminate all locking mechanisms
(mutexs, semaphores)
Eliminate all Primary storage I/O
(operate strictly out of registers
and cache, not DRAM).
Good luck on your rewrite!