Make Eigen run in a Multi-Thread

Make Eigen run in a Multi-Thread - c++

I have some questions regarding Eigen parallelization.
To my understanding, Eigen handles its internal parallelization but I want to activate multi threading. I just started compiling with G++ using the flag "-fopenmp" and running my executable with OMP_NUM_THREADS=4 ./exec.
Some parts of the code that run only using C++ code I used:
#pragma omp parallel
{
}
Looking at my system monitor I can see that sometimes I used more than one thread, but most of the time it isn't. I don't know if I have to use additional OpenMp code.
In the following link:
https://eigen.tuxfamily.org/dox/TopicMultiThreading.html
They mention that "in the case your application is parallelized with OpenMP, you might want to disable Eigen's own parallization as detailed in the previous section", but I don't really understand if I have to or how to do it.
I hope I am not mixing concepts here.
My thanks in advance.

Quoting from the link you posted:
Currently, the following algorithms can make use of multi-threading: general matrix - matrix products PartialPivLU
Thus, without knowing exactly what your program is doing, I'd hazard a guess that it's not mostly large matrix-matrix multiplications and/or PartialPivLU. This only regards Eigen's internal parallelization. What you do within the omp parallel blocks will probably run as expected (multiple threads).

Related

Image stitching is taking more time with Seamfinder and Exposurecompensator

I am trying to stitch images and the code on which I am working on, it uses SeamFinder and ExposureCompensator along with other functions. But while running the code, these two are taking so much of time. Is there any other alternative or is there a way to improve the performance.
Ptr<ExposureCompensator> compensator = ExposureCompensator::createDefault(expos_comp_type);
compensator->feed(corners, images_warped, masks_warped);
seam_finder = makePtr<GraphCutSeamFinder>(GraphCutSeamFinderBase::COST_COLOR);
seam_finder->find(images_warped_f, corners, masks_warped);
The above are the two functions which are taking time.
Please help me in solving the problem.
Thanks in advance.

The ImageStitching via OpenCV is known to be slow in many cases. Maybe you can give Open MP Parallel a shot here and counter the delay you are facing by using parallelization.
OpenMP consists of a set of compiler #pragmas that control how the program works. The pragmas are designed so that even if the compiler does not support them, the program will still yield correct behavior, but without any parallelism.
In cases where different iterations of loops have nothing to do with each other, therefore making these loops a prime target for parallelization. OpenMP effectively exploits these common program characteristics, so it is extremely easy to allow an OpenMP program to use multiple processors simply by adding a few lines of compiler directives into your source code.
In case you are running a loop in which a set of images are being stitched, you can make sure that the stiching for each set of images run parallely.
#pragma omp parallel for
for( ... )
{
// Image-stitching algorithms go here.
}
This compiler directive #pragma opm parallel for tells the compiler to auto-parallelize the for loop with OpenMP.
For non-loops, or just sections of code you can do something of this sort :
#pragma omp parallel sections
{
#pragma omp section
{
DoSomething();
}
#pragma omp section
{
DoSomethingElseParallely();
}
}
I know that the answer might not directly help you out, but might give you some avenues to dig.
You can go through more about the usage of OpenMP loop Parallelism and OpenMP: Sections before using it.

Do I need to disable OpenMP on a 1 core machine explicitly?

I parallelized some C++ code with OpenMP.
But what if my program will work on a 1 core machine?
Do I need disable usage threading at runtime:
Checks cores
If cores > 1 use OpenMP
Else ignore OpenMP devectives
If yes, does OpenMP have a special directive for it?

No, you don't need to disable OpenMP or threading for running on one core; and for situations where you might want to, you're probably better off explicitly re-compiling without OpenMP, although for complex parallelizations there are other measures, mentioned in the comments, that you can take as well.
When running on a single core or even hardware thread, even if you change nothing - not even the number of threads your code launches - correct, deadlock-free threading code should still run correctly, as the operating system schedules the various threads on the core.
Now, that context switching between threads is costly overhead. Typical OpenMP code, which is compute-bound and relies on work sharing constructs to assign work between threads, treats the number of threads as a parameter and launches as many threads as you have cores or hardware threads available. For such code, where you are just using constructs like
#pragma omp parallel for
for (i=0; i<N; i++)
data[i] = expensive_function(i)
then running on one core will likely only use one thread, or you can explicitly set the number of threads to be one using the OMP_NUM_THREADS environment variable. If OpenMP is to use only one thread and the computation is time-consuming enough, the overhead from the threading library in the above loop is negligible. In this case, there's really no disabling of OpenMP necessary; you're just running on one thread. You can also set the number of threads within the program using omp_set_num_threads(), but best practice is normally to do this at runtime.
However, there's a downside. Compiling with OpenMP disables certain optimizations. For instance, because the work decomposition is done at runtime, even loops with compiled-in trip count limits may not be able to, say, be unrolled or vectorized as effectively because it isn't known how many trips through the loop each thread will take. In that case, if you know that your code will be run on a single core, it may be worth doing the compilation without OpenMP enabled as well, and using that binary for single-core runs. You can also use this approach to test to see if the difference in optimizations matters, running the OpenMP-enabled version with OMP_NUM_THREADS=1 and comparing the timing to that of serial binary.
Of course, if your threading is more complex than using simple work sharing constructs, then it starts being harder to make generalizations. If you have assumptions built into your code about how many threads are present - maybe you have an explicit producer/consumer model for processing work, or a data decomposition hardcoded in, either of which are doable in OpenMP - then it's harder to see how things work. You may also have parallel regions which are much less work than a big computational loop; in those cases, where overhead even with one thread might be significant, it might be best to use if clauses to provide explicit serial paths, e.g.:
nThreadMax = imp_get_max_threads();
#pragma omp parallel if (nThreadMax > 1)
if (omp_in_parallel()) {
// Parallel code path
} else {
// Serial code path
}
But now doing compilation without OpenMP becomes more complicated.
Summarizing:
For big heavy computation work, which is what OpenMP is typically used for, it probably doesn't matter; use OMP_NUM_THREADS=1
You can test if it does matter, with overhead and disabled optimizations, by compiling without OpenMP and comparing the serial runtime to the one-thread OpenMP runtime
For more complicated threading cases, it's hard to say much in general; it depends.

I believe there is function called:
omp_get_num_procs()
that will let you know how many processors are available for OpenMP to work on. Then there are many ways to disable OpenMP. From your code you can run:
omp_set_num_threads(1)
Just remember that even on single core you can get some boost with OpenMP. It only depends on the specificity of your case.

OpenMP to Distributed Memory Code

I have a question related to parallel computing. I have a pretty big code written in C++ and which is parallelized using OpenMP on a shared memory basis. I wanted to ask is it possible to convert this shared memory code into a distributed memory code?
If possible what are the steps need to be performed?
Thank you for your cooperation.
Thanks,
Rahul Singh

Most programs which can be parallelised to run on shared memory computers can also be parallelised to run on distributed memory computers. So yes, the problem that your OpenMP program solves can probably be solved on a distributed memory computer. However, converting your OpenMP program to a distributed memory program is a different matter. You might be better advised to start with a serial implementation and to parallelise that than to try to adapt one mode of parallel thinking for another mode of parallel execution.
So, the first step you seek might be to unparallelise your program. But, as the commentators have already indicated, it's very difficult to provide more useful advice than I already have (and I haven't provided any very useful advice at all) without knowing a lot more about your application.

Shared Memory and Distributed Memory are two distinct paradigms in parallel computing which often means different thinking strategies. Some parallel programming frameworks, like UPC or MPI, can be emulated to run on either shared or distributed machines although its better not to do so since , e.g. here, UPC is meant to be used on shared memory and MPI is meant to be used distributed memory machines. I'm not sure about OpenMP.
In either case, my advice is to fist think about how you could get parallelism in your code on a distributed architecture and then go with MPI. If you happen to be in the computational science business, there are already very well written packages, such as PETSc, from Argonne National Lab, and Trilinos, from Sandia National Lab, that may help you develop much faster.

How to profile OpenMP bottlenecks

I have a loop that has been parallelized by OpenMP, but due to the nature of the task, there are 4 critical clauses.
What would be the best way to profile the speed up and find out which of the critical clauses (or maybe non-critical(!) ) take up the most time inside the loop?
I use Ubuntu 10.04 with g++ 4.4.3

Scalasca is a nice tool for profiling OpenMP (and MPI) codes and analyzing the results. Tau is also very nice but much harder to use. The intel tools, like the vtune, are also good but very expensive.

Arm MAP has OpenMP and pthreads profiling - and works without needing to instrument or modify your source code. You can see synchronization issues and where threads are spending time to the source line level. The OpenMP profiling blog entry is worth reading.
MAP is widely used for high performance computing as it is also profiles multiprocess applications such as MPI.

OpenMP includes the functions omp_get_wtime() and omp_get_wtick() for measuring timing performance (docs here), I would recommend using these.
Otherwise try a profiler. I prefer the google CPU profiler which can be found here.
There is also the manual way described in this answer.

There is also the ompP tool which I have used a number of times in the last ten years. I have found it to be really useful to identify and quantify load imbalance and parallel/serial regions. The web page seems to be down now but I also found it on web archive earlier this year.
edit: updated home directory

Force Program / Thread to use 100% of processor(s) resources

I do some c++ programming related to mapping software and mathematical modeling.
Some programs take anywhere from one to five hours to perform and output a result; however, they only consume 50% of my core duo. I tried the code on another dual processor based machine with the same result.
Is there a way to force a program to use all available processer resources and memory?
Note: I'm using ubuntu and g++

A thread can only run on one core at a time. If you want to use both cores, you need to find a way to do half the work in another thread.
Whether this is possible, and if so how to divide the work between threads, is completely dependent on the specific work you're doing.
To actually create a new thread, see the Boost.Thread docs, or the pthreads docs, or the Win32 API docs.
[Edit: other people have suggested using libraries to handle the threads for you. The reason I didn't mention these is because I have no experience of them, not because I don't think they're a good idea. They probably are, but it all depends on your algorithm and your platform. Threads are almost universal, but beware that multithreaded programming is often difficult: you create a lot of problems for yourself.]

The quickest method would be to read up about openMP and use it to parallelise your program.
Compile with the command g++ -fopenmp provided that your g++ version is >=4

You need to have as many threads running as there are CPU cores available in order to be able to potentially use all the processor time. (You can still be pre-empted by other tasks, though.)
There are many way to do this, and it depends completely on what you're processing. You may be able to use OpenMP or a library like TBB to do it almost transparently, however.

You're right that you'll need to use a threaded approach to use more than one core. Boost has a threading library, but that's not the whole problem: you also need to change your algorithm to work in a threaded environment.
There are some algorithms that simply cannot run in parallel -- for example, SHA-1 makes a number of "passes" over its data, but they cannot be threaded because each pass relies on the output of the run before it.
In order to parallelize your program, you'll need to be sure your algorithm can "divide and conquer" the problem into independent chunks, which it can then process in parallel before combining them into a full result.
Whatever you do, be very careful to verify the correctness of your answer. Save the single-threaded code, so you can compare its output to that of your multi-threaded code; threading is notoriously hard to do, and full of potential errors.
It may be more worth your time to avoid threading entirely, and try profiling your code instead: you may be able to get dramatic speed improvements by optimizing the most frequently-executed code, without getting near the challenges of threading.

To take full use of a multicore processor, you need to make the program multithreaded.

An alternative to multi-threading is to use more than one process. You would still need to divide & conquer your problem into mutiple independent chunks.

By 50%, do you mean just one core?
If the application isn't either multi-process or multi-threaded, there's no way it can use both cores at once.

Add a while(1) { } somewhere in main()?
Or to echo real advice, either launch multiple processes or rewrite the code to use threads. I'd recommend running multiple processes since that is easier, although if you need to speed up a single run it doesn't really help.

To get to 100% for each thread, you will need to:
(in each thread):
Eliminate all secondary storage I/O
(disk read/writes)
Eliminate all display I/O (screen
writes/prints)
Eliminate all locking mechanisms
(mutexs, semaphores)
Eliminate all Primary storage I/O
(operate strictly out of registers
and cache, not DRAM).
Good luck on your rewrite!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js