In C++17, I want to use several OpenBLAS subroutines with a different number of threads for each. Is there any way to accomplish this?
In the past, I have used openblas_set_num_threads();
to set the number of threads for my OpenBLAS subroutines. While this works, it sets the openblas num threads globally, preventing each subroutines to use a different number of threads when running in parallel. Because of this, I use the same number of threads for all of my OpenBLAS subroutines so they can run in parallel.
No way!! It seems that it is impossible so far.
Based on their user manual:
If your application is already multi-threaded, it will conflict with
OpenBLAS multi-threading
Actually, this feature is essential for most multithreaded libraries that want to use BLAS.
One easy option is to use MKL instead of OpenBLAS and use their mkl_set_num_threads_local that can play nicely, and the developer has a good control over threads. Look here.
A harder option is to call single threaded OpenBLAS and you implement the multithreading yourself. This can work either with OpenBLAS and MKL but it is cumbersome and you will probably lose performance if you don't know what you are doing.
For this problem there is no difference if you use C++17, C++11, any other flavoer of C++ or C.
Related
I'm working on a project (Hardware: RaspberryPI 3B+), which has lots of computation and parallel processing. At present, I'm noticing some sort of lag in the code performance. Therefore, I'm constantly looking for efficient ways to improve my code and its performance.
Currently, I'm using C-language (because I can access and manipulate lower-level drivers easily) and developing my own set of functions, libraries and the drivers, which runs faster than any other pre-defined or readymade libraries or plugins.
Now, instead of the software-based muti-treading (Pthread), I wanted to use the separate cores for performing the corresponding task. So, any suggestion or guideline how I can use the different cores of the RaspberryPI?
Moreover, how can I check the CPU utilization to choose the best core to perform a certain task?
Thanking with regards,
Aatif Shaikh
At the C/C++ level you do not have access to which CPU core will run which thread. Just use the C++ 11 standard threads and let the OS scheduler to decide which thread runs where.
That said, Linux has the taskset tool to check thread affinity and there 's also sched_setaffinity() function.
I have a code in Fortran which uses DGESVD from MKL and runs on 8 cores with Intel compiler. The code is accelerated via OpenMP. Also I know that OpenMP and MKL has their own setting to set the number of threads (omp_num_threads and mkl_num_threads). I want to know the optimum number of threads. Am I supposed to set the OMP_NUM_THREADS=1 before calling the LAPACK routine? Does the number of OpenMP threads affect MKL number of threads?
MKL also uses OpenMP for its multithreaded driver. This means that the number of OpenMP threads does affect the number of MKL threads, but in a very intricate way.
First, being OpenMP code, MKL is also controlled by the usual OpenMP ways to set the number of threads, e.g. OMP_NUM_THREADS and calls to omp_set_num_threads. But it also provides override configuration mechanisms in the form of MKL_NUM_THREADS and mkl_set_num_threads(). This allows one to have different number of threads in the user code and in the MKL routines.
Having configured the desired number of threads, one should also know how MKL behaves in nested parallelism cases. That is, MKL would by default run single-threaded if called from inside an active parallel region in the user code. MKL provides the MKL_DYNAMIC switch that can override this behaviour but it requires that the same OpenMP compiler is used for the user code as for MKL (read that - you must use Intel's compiler) as no compatibility is guaranteed between different OpenMP runtimes.
Generally speaking, you do not need to set the number of threads to 1 before calling into MKL, as this would make it single-threaded, unless the number of MKL threads was overridden by configuring it explicitly. And you should be careful when calling it from inside parallel regions when nested parallelism is enabled.
Further read about controlling the number of threads in MKL is available in MKL's User Guide:
Using Additional Threading Control (mirror of otherwise dead link)
Techniques to Set the Number of Threads
I’m looking for a portable method for creating threads specifically for output of data in C++. I’d prefer to stay away from Boost if possible, but I’m not against using it if it’s the best option.
Here is the situation:
I have a program that does a complex computation on some data that it reads and produces three output streams with a large amount of textual data. These three streams are being compressed on the fly using the Bzip2 library.
What I would like to do is to have the main computation run in the main thread, while the compression and output of the data is done in three additional threads. The idea being that in this way I can utilise the available computing cores and eliminate any bottleneck that the Bzip2 compression may be causing to the actual processing.
The way I imagine this working is for the three output threads to have open output file streams and to be waiting for string data that will then be compressed and output. The main thread will run its computation sending output to the other threads when necessary. Obviously, adequate buffering will have to be designed, but that’s not a problem.
I’d appreciate any suggestions regarding the best way to tackle this problem, in particular, what C++ libraries are the most appropriate for the task at hand. Keep in mind, that I would like to handle the buffering in the output threads and they should receive string class data.
Thanks in advance!
C++ doesn't support threads in its standard (at least not now), and to have threads portably you must use some library. There are many C++ libraries giving you portable threads out there, and your particular problem doesn't seem special in any way. Boost is very well received and adopted and has the best chance to influence future versions of the C++ standard. It is efficient and portable, so why not use it?
You should really use the Boost.Thread library. It is well documented, tested and light-weight (compared to full-featured libraries with multi-platform threading support, such as Qt).
Take a look at Boost ASIO: http://www.boost.org/doc/libs/1_44_0/doc/html/boost_asio/overview/core/async.html
It's very flexible in terms of threading organization. As a matter of fact you may re-think your design and get rid of additional threads at all. But it also supports your idea as well.
I got a C++ program (source) that is said to work in parallel. However, if I compile it (I am using Ubuntu 10.04 and g++ 4.4.3) with g++ and run it, one of my two CPU cores gets full load while the other is doing "nothing".
So I spoke to the one who gave me the program. I was told that I had to set specific flags for g++ in order to get the program compiled for 2 CPU cores. However, if I look at the code I'm not able to find any lines that point to parallelism.
So I have two questions:
Are there any C++-intrinsics for multithreaded applications, i.e. is it possible to write parallel code without any extra libraries (because I did not find any non-standard libraries included)?
Is it true that there are indeed flags for g++ that tell the compiler to compile the program for 2 CPU cores and to compile it so it runs in parallel (and if: what are they)?
AFAIK there are no compiler flags designed to make a single-threaded application exploit parallelism (it's definitely a nontrivial operation), with the exception of parallelization of loops iterations (-ftree-parallelize-loops), that, still, must be activated carefully; still, even if there's no explicit threads creation, there may be some OpenMP directives to parallelize several instruction sequences.
Look for the occurrence of "thread" and/or "std::thread" in the source code.
The current C++ language standard has no support for multi-processing in the language or the standard library. The proposed C++0x standard does have some support for threads, locks etc. I am not aware of any flags for g++ that would magically make your program do multi-processing, and it's hard to see what such flags could do.
The only thing I can think of is openMosix or LinuxPMI (the successor of openMosix). If the code uses processes then process "migration" technique makes is possible to put processes at work on different machines (which have the specified linux distribution installed).
Check for threads (grep -i thread), processes (grep fork) in your code. If none of this exists, then check for MPI. MPI requires some extra configuration since I recall (only used it for some homeworks in faculty).
As mentioned gcc (and others) implements some ways of parallelism with OpenMP with some pragmas.
I do some c++ programming related to mapping software and mathematical modeling.
Some programs take anywhere from one to five hours to perform and output a result; however, they only consume 50% of my core duo. I tried the code on another dual processor based machine with the same result.
Is there a way to force a program to use all available processer resources and memory?
Note: I'm using ubuntu and g++
A thread can only run on one core at a time. If you want to use both cores, you need to find a way to do half the work in another thread.
Whether this is possible, and if so how to divide the work between threads, is completely dependent on the specific work you're doing.
To actually create a new thread, see the Boost.Thread docs, or the pthreads docs, or the Win32 API docs.
[Edit: other people have suggested using libraries to handle the threads for you. The reason I didn't mention these is because I have no experience of them, not because I don't think they're a good idea. They probably are, but it all depends on your algorithm and your platform. Threads are almost universal, but beware that multithreaded programming is often difficult: you create a lot of problems for yourself.]
The quickest method would be to read up about openMP and use it to parallelise your program.
Compile with the command g++ -fopenmp provided that your g++ version is >=4
You need to have as many threads running as there are CPU cores available in order to be able to potentially use all the processor time. (You can still be pre-empted by other tasks, though.)
There are many way to do this, and it depends completely on what you're processing. You may be able to use OpenMP or a library like TBB to do it almost transparently, however.
You're right that you'll need to use a threaded approach to use more than one core. Boost has a threading library, but that's not the whole problem: you also need to change your algorithm to work in a threaded environment.
There are some algorithms that simply cannot run in parallel -- for example, SHA-1 makes a number of "passes" over its data, but they cannot be threaded because each pass relies on the output of the run before it.
In order to parallelize your program, you'll need to be sure your algorithm can "divide and conquer" the problem into independent chunks, which it can then process in parallel before combining them into a full result.
Whatever you do, be very careful to verify the correctness of your answer. Save the single-threaded code, so you can compare its output to that of your multi-threaded code; threading is notoriously hard to do, and full of potential errors.
It may be more worth your time to avoid threading entirely, and try profiling your code instead: you may be able to get dramatic speed improvements by optimizing the most frequently-executed code, without getting near the challenges of threading.
To take full use of a multicore processor, you need to make the program multithreaded.
An alternative to multi-threading is to use more than one process. You would still need to divide & conquer your problem into mutiple independent chunks.
By 50%, do you mean just one core?
If the application isn't either multi-process or multi-threaded, there's no way it can use both cores at once.
Add a while(1) { } somewhere in main()?
Or to echo real advice, either launch multiple processes or rewrite the code to use threads. I'd recommend running multiple processes since that is easier, although if you need to speed up a single run it doesn't really help.
To get to 100% for each thread, you will need to:
(in each thread):
Eliminate all secondary storage I/O
(disk read/writes)
Eliminate all display I/O (screen
writes/prints)
Eliminate all locking mechanisms
(mutexs, semaphores)
Eliminate all Primary storage I/O
(operate strictly out of registers
and cache, not DRAM).
Good luck on your rewrite!