Multi-threaded performance issues - c++

I have a multi-threaded programs. We use an own implementation of the thread pool. First, the load of the project is enough. compred to single thread, the program of two threads is more faster.
When we increase the number of threads, greater than 2, performance begins terrible. Obviously, we encountered a multi-threaded performance issues.
Then, we started using Intel® VTune ™ Amplifier XE 2017 Performance analysis, we put this tool integrated into the VS2013. Then a surprising thing happened when I click on the star button of Intel® VTune ™ Amplifier XE , the project begins to run, plug collects Data . We find that when we start this project through the plug-in, with the increase in the number of threads, the performance becomes higher, the running time is shortened. We can open up to 20 threads. And time is shortened 20 times
So, we want to know, can Intel® VTune ™ Amplifier XE 2017 change the operation mode of multithreaded programs ? Why does this happen.
I have been troubled by this problem for a long time.

Finally,I resolve this question.The answer is simple.The cause of problem is that I run the program with debugging.If I redirectly run the *.exe,the perfomance is well.There is no relationship to VTune,just beacause VTune directly start with calling *.exe.

Related

Threading analysis in Vtune hangs at __kmp_acquire_ticket_lock

I am currently benchmarking a project written in C++ to determine the hot spots and the threading efficiency, using Intel VTune. When running the program normally it runs for ~15 minutes. Using the hotspot analysis in VTune I can see that the function __kmp_fork_barrier is taking up roughly 40% of the total CPU time.
Therefore, I also wanted to see the threading efficiency, but when starting the threading-module in VTune, it does not start the project at all, but instead hangs at __kmp_acquire_ticket_lock when running in Hardware event-based sampling-mode. When running in user-mode sampling-mode instead, the project immediately fails with a segfault (which does not occur when running it without VTune and checking it with valgrind). When using HPC performance characterization instead, VTune crashes.
Are those issues with VTune, or with my program? And how can I find the issues with the latter?
__kmp_xxx calls are functions of the Intel/Clang OpenMP runtime. __kmp_fork_barrier is called when an OpenMP barrier is reached. If you spend 40% of your time on this function this means that you have a load balancing issue with the OpenMP threads in your program. You need to fix this work imbalance to get better performance. You can use the (experimental) OMPT support of runtimes to track what threads are doing and when they do so. VTune should have a minimal support for profiling OpenMP programs. Encountering a VTune crash is likely a bug and it should be reported on the Intel forum so that VTune developers can fix it. On your side, you can check that your program always pass all OpenMP barrier in a deterministic way. For more information, you can look at the Intel VTune OpenMP tutorial.
Note that the results of VTune should also means that your OpenMP runtime is configured so that threads are actively polling the state of other threads which is good to reduce latencies but not always for performance or energy savings. You can control the behaviour of the runtime using the environment variable OMP_WAIT_POLICY.

Maximum Windows 8.1 CPU usage <= 30%

I'm writing a C++ application using Visual Studio 2013. The application iterates through an image doing some complicated analysis. To test code efficiency I am running the analysis (say) 100 times and seeing how long it takes. Then I modify the code, re-run the test and see if there is an improvement (or degradation) in performance.
Problem is that while I have a powerful 4-core i5 (i5-4200U # 1.6 GHz to be specific) and plenty of RAM, the overall CPU utilisation never exceeds about 30%. My process never seems to get beyond about 29.5%. I've tried setting the priority class of my application to "High" (using SetProcessPriority) and this doesn't help. There is zero disk and network access, all in memory (and about 5GB of memory to spare).
Is this some secret Windows 8.1 setting (to preserve performance)? Can I change this programmatically or through some Control Panel applet?
Well how do you expect your application to use 100% cpu when it is (most likely) only running on one core because you aren't using threads?
30% is slightly above the usage for one core (25%) so it is almost certain you aren't using threads here.

Android NDK - Multithreading is slowing down rendering

I have an Android app with a C++ library which uses pthreads to break down rendering tasks. This is for devices running Android 4+.
Lets say I have a 100 x 100 array of elements into which I repetitively do CPU-intensive processing. Currently I'm breaking the array up into four 25 x 100 element chunks and handing it off to four Posix threads (from a pool of stalled, pre-created threads). This gives an almost 4x speed increase on iOS and desktop Mac but slower results than single-threading under Android.
So the same code is used successfully to speed up the app on iOS or desktop Mac but in Android it often makes it even slower.
I have done some tests on it and only quite big junks of data speed up when using multi threading. If the whole process (all threads) takes around 2 seconds or more it will speed up in multi threading mode but if it is less (say only takes about 400ms) it will be either the same speed or slower than just calling the rendering function normally. Which could point to thread switching being really slow. The bigger the processing tasks, the more they profit from multithreading. My tasks are usually not as big, but not fast enough in single threading mode.
I have also noticed that on ARM builds the speed difference between slower multi threading and the faster single threading is quite significant (almost twice as fast in multi threading rather than single threading) whereas on x86 builds the multi and single threaded versions will run at about the same speed as single threading on ARM builds. So x86 builds do not get slower on multithreading but also not faster.
Has anyone else had the same behaviour or knows where the slowdown could come from? Are there any special requirements for multithreading on Android? Unfortunately I can't really post any code at the moment but it is all standard posix threading code which works fine on iOS and Mac in general and has been in use for years.
Android vendors aggressively optimize for battery life which includes keeping number of cores (hot-plugged) and their individual (if possible) frequency low.
Generic idea for managing number of cores online is to keep an eye on system load for a period of time (window). If load persists and is above a threshold, system will bring necessary additional available cores online. Such decision taking afaik always happens via a user-level daemon. This approach is generally very different from desktops since being able to bring cores online/offline and benefit of it is mostly SoC dependent.
Managing cpu frequency is also similar, if load persists cpu freq is increased but there is a more settled mechanism for this provided by Linux called cpu-freq and due to that it is similar between desktop and mobile.
So it is very possible that you are creating a cpu load pattern that's not triggering core bring up or freq increase. (as you also describe within your description)

Speed performance of a Qt program: Windows vs Linux

I've already posted this question here, but since it's maybe not that Qt-specific, I thought I might try my chance here as well. I hope it's not inappropriate to do that (just tell me if it is).
I’ve developed a small scientific program that performs some mathematical computations. I’ve tried to optimize it so that it’s as fast as possible. Now I’m almost done deploying it for Windows, Mac and Linux users. But I have not been able to test it on many different computers yet.
Here’s what troubles me: To deploy for Windows, I’ve used a laptop which has both Windows 7 and Ubuntu 12.04 installed on it (dual boot). I compared the speed of the app running on these two systems, and I was shocked to observe that it’s at least twice as slow on Windows! I wouldn’t have been surprised if there were a small difference, but how can one account for such a difference?
Here are a few precisions:
The computation that I make the program do are just some brutal and stupid mathematical calculations, basically, it computes products and cosines in a loop that is called a billion times. On the other hand, the computation is multi-threaded: I launch something like 6 QThreads.
The laptop has two cores #1.73Ghz. At first I thought that Windows was probably not using one of the cores, but then I looked at the processor activity, according to the small graphic, both cores are running 100%.
Then I thought the C++ compiler for Windows didn’t the use the optimization options (things like -O1 -O2) that the C++ compiler for Linux automatically did (in release build), but apparently it does.
I’m bothered that the app is so mush slower (2 to 4 times) on Windows, and it’s really weird. On the other hand I haven’t tried on other computers with Windows yet. Still, do you have any idea why the difference?
Additional info: some data…
Even though Windows seems to be using the two cores, I’m thinking this might have something to do with threads management, here’s why:
Sample Computation n°1 (this one launches 2 QThreads):
PC1-windows: 7.33s
PC1-linux: 3.72s
PC2-linux: 1.36s
Sample Computation n°2 (this one launches 3 QThreads):
PC1-windows: 6.84s
PC1-linux: 3.24s
PC2-linux: 1.06s
Sample Computation n°3 (this one launches 6 QThreads):
PC1-windows: 8.35s
PC1-linux: 2.62s
PC2-linux: 0.47s
where:
PC1-windows = my 2 cores laptop (#1.73Ghz) with Windows 7
PC1-linux = my 2 cores laptop (#1.73Ghz) with Ubuntu 12.04
PC2-linux = my 8 cores laptop (#2.20Ghz) with Ubuntu 12.04
(Of course, it's not shocking that PC2 is faster. What's incredible to me is the difference between PC1-windows and PC1-linux).
Note: I've also tried running the program on a recent PC (4 or 8 cores #~3Ghz, don't remember exactly) under Mac OS, speed was comparable to PC2-linux (or slightly faster).
EDIT: I'll answer here a few questions I was asked in the comments.
I just installed Qt SDK on Windows, so I guess I have the latest version of everything (including MinGW?). The compiler is MinGW. Qt version is 4.8.1.
I use no optimization flags because I noticed that they are automatically used when I build in release mode (with Qt Creator). It seems to me that if I write something like QMAKE_CXXFLAGS += -O1, this only has an effect in debug build.
Lifetime of threads etc: this is pretty simple. When the user clicks the "Compute" button, 2 to 6 threads are launched simultaneously (depending on what he is computing), they are terminated when the computation ends. Nothing too fancy. Every thread just does brutal computations (except one, actually, which makes a (not so) small"computation every 30ms, basically checking whether the error is small enough).
EDIT: latest developments and partial answers
Here are some new developments that provide answers about all this:
I wanted to determine whether the difference in speed really had something to do with threads or not. So I modified the program so that the computation only uses 1 thread, that way we are pretty much comparing the performance on "pure C++ code". It turned out that now Windows was only slightly slower than Linux (something like 15%). So I guess that a small (but not unsignificant) part of the difference is intrinsic to the system, but the largest part is due to threads management.
As someone (Luca Carlon, thanks for that) suggested in the comments, I tried building the application with the compiler for Microsoft Visual Studio (MSVC), instead of MinGW. And suprise, the computation (with all the threads and everything) was now "only" 20% to 50% slower than Linux! I think I'm going to go ahead and be content with that. I noticed that weirdly though, the "pure C++" computation (with only one thread) was now even slower (than with MinGW), which must account for the overall difference. So as far as I can tell, MinGW is slightly better than MSVC except that it handles threads like a moron.
So, I’m thinking either there’s something I can do to make MinGW (ideally I’d rather use it than MSVC) handle threads better, or it just can’t. I would be amazed, how could it not be well known and documented ? Although I guess I should be careful about drawing conclusions too quickly, I’ve only compared things on one computer (for the moment).
Another option it could be: on linux qt are just loaded, this could happens i.e. if you use KDE, while in Windows library must be loaded so this slow down computation time. To check how much library loading waste your application you could write a dummy test with pure c++ code.
I have noticed exactly the same behavior on my PC.
I am running Windows 7(64bits), Ubuntu (64bits) and OSX (Lion 64bits) and my program compares 2 XML files (more than 60Mb each). It uses Multithreading too (2 threads) :
-Windows : 40sec
-Linux : 14sec (!!!)
-OSX : 22sec.
I use a personal class for threads (and not Qt one) which uses "pthread" on linux/OSX and "threads" on windows.
I use Qt/mingw compiler as I need the XML class from Qt.
I have found no way (for now) to have the 3 OS having similar performances... but I hope I will !
I think that another reason may be the memory : my program uses about 500Mb of RAM. So I think that Unix is managing it best because, in mono-thread, Windows is exactly 1.89 times slower and I don't think that Linux could be more than 2 times slower !
I have heard of one case where Windows was extremely slow with writing files if you do it wrongly. (This has nothing to do with Qt.)
The problem in that case was that the developer used a SQLite database, wrote some 10000 datasets, and did a SQL COMMIT after each insert. This caused Windows to write the whole DB file to disk each time, while Linux would only update the buffered version of the filesystem inode in the RAM. The speed difference was even worse in that case: 1 second on Linux vs. 1 minute on Windows. (After he changed SQLite to commit only once at the end, it was also 1 second on Windows.)
So if you're writing the results of your computation to disk, you might want to check if you're calling fsync() or fflush() too often. If your writing code comes from a library, you can use strace for this (Linux-only, but should give you a basic idea).
You might experience performance differences by how mutexes run on Windows and Linux.
Pure mutex code on windows can have a 15ms wait every time there is a contention for resource when locking. Better performing synchronization mechanism on Windows is Critical Sections. It doesn't experience the locking penalty that regular mutexes experience in most cases.
I have found that on Linux, regular mutexes perform the same as Critical Sections on Windows.
It's probably the memory allocator, try using jemalloc or tcmalloc from Google. Glibc's ptmalloc3 is significantly better than the old crusty allocator in MSVC's crt. The comparable option from Microsoft is the Concurrency CRT but you cannot simply drop it in as a replacement.

How can I find out how much time is spend on each line in C/C++ code?

I am trying to find a profiling tool with which I can find out, how much time is spend on each line of code in a C/C++ program. I am working on Linux platforms (Ubuntu, Gentoo, SL) mainly with gcc. I use gprof but sometimes I need the "per line" information.
Any suggestions? Thank you!
On linux you can use oprofile. This is a sample based profiler which runs on almost any platform and supports the performance monitoring registers if they are available. On x86 it works with both AMD and Intel.
You can use it as standalone program wich will give you an annotated source, but there is a plugin available (linuxtools) for eclipse which integrates nicely into the IDE.
AMD CodeAnalyst is your best bet, it is totally free, and it works on windows and linux, though its primarily for AMD CPU's, so non-AMD CPU's won't get the MSR based profiling options. Under Windows it also has great integration for Visual Studio 2008 & 2010 as well.
for non-vendor specific, free profilers, you can try very sleepy, which also happens to be open source.
What Zoom does is take stack samples on wall-clock time.
Then the percent of time any function or line of code is responsible for is the fraction of samples on which it appears.
For example, if a line of code is on 30% of stack samples, and you could avoid executing it, the total execution time would decrease by 30%
This is true regardless of I/O, recursion, competing processes, swapping, all the things that confuse many profilers.