I have a cilk program that is using the libpuzzle library. My task is to parallelize the sorting of images based on their resemblance and I am using a parallel cilk for loop to compare all the images with a reference image. What I noticed was that on the first run of the program the execution was slow, but after the second run it speeds up and I could see all the logical cores working at 100%... I repeated this every time i built the project and always ra two runs and this performance could be seen. Any ideas what might cause a parallel program to run slightly bad on the first run and good on the second run. I changed the image distributions as well and this pattern seems to hold. If anyone has had similar experiences could you please share what you did to fix the problem?
Thank you
Related
I am trying to understand a huge performance problem with one of our C++ applications using OpenMP (on Windows). The structure of the application is as follows:
I have an algorithm which basically consists of a couple of for-loops which are parallelized using OpenMP:
void algorithm()
{
#pragma omp parallel for numThreads(12)
for (int i=0; ...)
{
// do some heavy computation (pure memory and CPU work, no I/O, no waiting)
}
// ... some more for-loops of this kind
}
The application executes this algorithm n times in parallel from n different threads:
std::thread t1(algorithm);
std::thread t2(algorithm);
//...
std::thread tn(algorithm);
t1.join();
t2.join();
//...
tn.join();
// end of application
Now, the problem is as follows:
when I run the application with n=1 (only one call to algorithm()) on my system with 32 physical CPU cores (no hyperthreading), it takes about 5s and loads the CPU to about 30% as expected (given that I have told OpenMP to only use 12 threads).
when I run with n=2, the CPU load goes up to about 60%, but the application takes almost 10 seconds. This means that it is almost impossible to run multiple algorithm instances in parallel.
This alone, of course, can have many reasons (including cache misses, RAM bandwidth limitations, etc.), but there is one thing that strikes me:
if I run my application twice in two parallel processes, each with n=1, both processes complete after about 5 seconds, meaning that I was well able to run two of my algorithms in parallel, as long as they live in different processes.
This seems to exclude many possible reasons for this performance bottleneck. And indeed, I have been unable to understand the cause of this, even after profiling the code. One of my suspicions is that there might be some excessive synchronization in OpenMP between different parallel sections.
Has anyone ever seen an effect like this before? Or can anyone give me advice how to approach this? I have really come to a point where I have tried all I can imagine, but without any success so far. I thus appreciate any help I can get!
Thanks a lot,
Da
PS.:
I have been using both, MS Visual Studio 2015 and Intel's 2017 compiler - both show basically the same effect.
I have a very simple reproducer showing this problem which I can provide if needed. It is really not much more than the above, just adding some real work to be done inside the for-loops.
I have spent the past year developing a logging library in C++ with performance in mind. To evaluate performance I developed a set of benchmarks to compare my code with other libraries, including a base case that performs no logging at all.
In my last benchmark I measure the total running time of a CPU-intensive task while logging is active and when it is not. I can then compare the time to determine how much overhead my library has. This bar chart shows the difference compared to my non-logging base case.
As you can see, my library ("reckless") adds negative overhead (unless all 4 CPU cores are busy). The program runs about half a second faster when logging is enabled than when it is disabled.
I know I should try to isolate this down to a simpler case rather than asking about a 4000-line program. But there are so many venues for what to remove, and without a hypothesis I will just make the problem go away when I try to isolate it. I could probably spend another year just doing this. I'm hoping that the collective expertise of Stack Overflow will make this a much more shallow problem or that the cause will be obvious to someone who has more experience than me.
Some facts about my library and the benchmarks:
The library consists of a front-end API that pushes the log arguments onto a lockless queue (Boost.Lockless) and a back-end thread that performs string formatting and writes the log entries to disk.
The timing is based on simply calling std::chrono::steady_clock::now() at the beginning and end of the program, and printing the difference.
The benchmark is run on a 4-core Intel CPU (i7-3770K).
The benchmark program computes a 1024x1024 Mandelbrot fractal and logs statistics about each pixel, i.e. it writes about one million log entries.
The total running time is about 35 seconds for the single worker-thread case. So the speed increase is about 1.5%.
The benchmark produces an output file (this is not part of the timed code) that contains the generated Mandelbrot fractal. I have verified that the same output is produced when logging is on and off.
The benchmark is run 100 times (with all the benchmarked libraries, this takes about 10 hours). The bar chart shows the average time and the error bars show the interquartile range.
Source code for the Mandelbrot computation
Source code for the benchmark.
Root of the code repository and documentation.
My question is, how can I explain the apparent speed increase when my logging library is enabled?
Edit: This was solved after trying the suggestions given in comments. My log object is created on line 24 of the benchmark test. Apparently when LOG_INIT() touches the log object it triggers a page fault that causes some or all pages of the image buffer to be mapped to physical memory. I'm still not sure why this improves the performance by almost half a second; even without the log object, the first thing that happens in the mandelbrot_thread() function is a write to the bottom of the image buffer, which should have a similar effect. But, in any case, clearing the buffer with a memset() before starting the benchmark makes everything more sane. Current benchmarks are here
Other things that I tried are:
Run it with the oprofile profiler. I was never able to get it to register any time in the locks, even after enlarging the job to make it run for about 10 minutes. Almost all the time was in the inner loop of the Mandelbrot computation. But maybe I would be able to interpret them differently now that I know about the page faults. I didn't think to check whether the image write was taking a disproportionate amount of time.
Removing the locks. This did have a significant effect on performance, but results were still weird and anyway I couldn't do the change in any of the multithreaded variants.
Compare the generated assembly code. There were differences but the logging build was clearly doing more things. Nothing stood out as being an obvious performance killer.
When uninitialised memory is first accessed, page faults will affect timing.
So, before your first call to, std::chrono::steady_clock::now(), initialise the memory by running memset() on your sample_buffer.
I have question about testing MPI program. I wrote FW algorithm with Open MPI. The program works fine and correct, but problem is that it takes more time than my sequential program (I have tried to test it on only one computer). Does someone have idea why that happens ? Thanks
It is a common misconception that a parallel implementation of a program will always be quicker than its sequential version.
The trouble with parallelizing a program is it introduces a fairly large overhead with the use of multiple threads, which a sequential program running from a single thread does not suffer from. Not only do we have to initially set up these threads, there is also communication taking place which wasn't necessary with the sequential program.
For relatively small problems, you will find that a sequential solution will almost always out perform the parallel program. As the size of your problem scales, the cost of managing multiple processes gradually becomes negligible with respect to the computational cost of the problem itself. As a result, your parallel version will begin to outperform your sequential program.
I am using Callgrind in order to see how many times specific functions are called. However, I am also interested in the execution time.
I know the programs take much longer when running on Callgrind, since it has to take information. However, what I am surprised about is that how the time changes. In my case, I am running two different versions of the Fast Marching Method (FMM and Simplified FMM) on 2D and 3D grids. The results are as follow:
In 2D the ratio FMM/SFMM is not kept at all, but at least it is always >1 (it takes always longes for FMM than for SFMM). However, in 3D the effect of Callgrind is completely the opposite, the times are completely changed: SFMM takes shorter will callgrind but longer in regular execution.
The compilation I am using (-Ofast, -fno-finite-math-only) is the same all the time and the same binaries are being run in callgrind and regular running ./bin-name
The time measuring functions are those from std::chrono.
Therefore, the question is: as I am using the same binary in all cases, how is it possible that the same binary behaves so differently? Are the other data I am getting (function calls,% time cost, etc) reliable in this case? Callgrind-like results were expected when running the binaries with regular execution command.
EDIT: in the implemtation, the main change is that in FMM I am using the Boost Fibonacci heap and in the SFMM I am using a small modification with a Boost Priority Queue.
Thank you!
I have a c++ program with opencv library which takes an image as input and perform pose estimation,color detection,phog. When I run this program from the command line it takes around 4-5sec to complete. It takes around 60%cpu. When I try to run the same program from two different command line windows at the same time the process takes around 10-15 sec to finish and both the process finish in almost the same time. The CPU Usage reaches upto 100%.
I have a website which calls this c++ exe using exec() command. So when two users try to upload an image and run it takes more time as I explained above in the command line. Is this because the c++ program involves high computation and the CPU reaches 100% it slows down? But I read that the CPU reaching 100% is not a bad thing as the computer is using its full capacity to run the program. So is this because of my c++ program or is it something to do with my server(computer) settings? This is probably not the apache server problem because when I try to run it from the command line also it slows down. I am using a quad core processor and all the 4 CPU reaches 100% when I try to run the same process at the same time so I think that its distributed among all the processor. So I have few more questions:
1) Can this be solved by using multithreading in my c++ code?As for now I am not using it but will multithreading make the c++ code more computationally expensive and increase the CPU usage(if this is the problem).
2) What can be the reason of it slowing down? Is the process in a queue and each process is ran only a certain amount of time and it switches between the two process?
3) If this is because it involves high computation will it help if I change some functions to opencv gpu functions?
4) Is there a way I can solve this problems any ideas or tips?
I have inserted the result of top when running one process and running the same process twice at the same time:
Version5 is the process,running it once
Two Version5 running at the same time
The CPU info:
Thanks in advance.
After zooming so that your picture fills almost my entire 22" screen, I can make out that the CPU flags show "ht", which means "hyperthreading", so you actually only have two genuine cores, that are shared between two hyperthreads. So running on all four CPU cores at once will not give the same performance as running on two genuine cores.
In other words, the "loss of performance" is entirely as you'd expect, because you have four threads fighting for the actual computational resources of two CPU cores. Hyperthreading helps if the code has a lot of memory interaction that can be "hidden" by running a second thread. But if you have a CPU intensive code, that isn't "missing in the cache" much, then the gain is much less, and in extreme cases, hyperthreading will actually cause slow-downs (because the code in one thread disrupts the caches and otherwise "gets in the way" of the first thread). You may want to experiment by going into the BIOS settings and turn off the hyperthreading, and compare the results. Sure, running two instances of the code will clearly still take longer, but the question is "is it longer than running with hyperthreading" - unfortunately, it's impossible to say for sure which is better from a theoretical standpoint (even if I could see the assembly code and understood the memory access patterns - without that level of detail, it's completely impossible to judge).
When running only one process reaches 60% of CPU usage it would be possible that using multithreading speeds up the execution. However, the CPU usage is likely to be higher
That's true. There might be an additional overhead for context switching (multitasking)
Changing functions can bring some improvements, but without having your code it is hard say.
Since the computational effort is that high, I think you have to decide whether you accept a high CPU usage or a longer execution time (of course after optimizing the code itself)
Greetz