I'm using the armadillo c++ library to do 2D fourier transforms, and I'm finding that the results are inconsistent when I use multiple threads. Specifically, I'm getting different results from the fft2 function.
The data I'm passing to fft2 is thread-local. I've also verified that the input data is not affected by the presence of other threads working on parallel problems. fft2 is producing different results if there are other threads also calling fft2. Does anyone know about threading issues with fft2? Or, armadillo in general?
Armadillo itself does not seem to have any kind of state that could make it not thread-safe (maybe the random generation part could be a problem). That is, it seems to be thread-save as long as the libraries it depends on are thread-safe.
I also had problems in the past with incorrect results when using multithreading. In my case the culprit was openblas, which I was compiling myself. In order to investigate the problem, I had created a small project to check that results from some SVD and matrix multiplications were the same when running in parallel and serially. They were not. Then I stumbled into an issue in openblas repository about thread safity, where I saw a flag that I could set in CMake (USE_LOCKING) when compiling openblas. After setting USE_LOCKING to true when compiling openblas I had no more problems with wrong results given by armadillo.
You are probably experience something similar, but regarding the fft2 library. Specially since you mention that other performing work in other threads does not pose a problem if they are not related to fft2. Thus, you should check if fft2 is thread-safe instead of thinking about armadillo.
Related
Absolute TAPI beginner here. I recently ported my CV code to make use of UMat intstead of Mat since my CPU was on it's limit, especially morphologic operations seemed to consume quite some computing power.
Now with UMat I cannot see any changes in my framerate, it is exactly the same no matter if I use UMat or not, also Process Explorer reports no GPU usage whatsoever. I did a small test with a few calls of dilation and closing on a full HD image -- no effect.
Am I missing something here? I'm using the latest OpenCV 3.2 build for Windows and a GTX 980 with driver 378.49. cv::ocl::haveOpenCL() and cv::ocl::useOpenCL() both return true and cv::ocl::Context::getDefault().device( 0 ) also gives me the correct device, everything looks good as far as I can tell. Also, I'm using some custom CL code via cv::ocl::Kernel which is definitely invoked.
I realize it is a naive way of thinking that just changing Mat to UMat will result in huge performance gain (although every of the the very limited number of ressources covering TAPI I find online suggests exactly that). Still, I was hoping to get some gain for starters and then further optimize step-by-step from there. However, the fact that I can't discover any GPU usage whatsoever highliy irritates me.
Is there something I have to watch out for? Maybe my usage of TAPI prevents from a streamlined execution of OpenCL code, maybe by accidental/hidden readbacks I'm not aware of? Do you see any way of profiling the code with respect to that matter?
Are there any how-to's, best practices or common pitfalls for using TAPI? Things like "don't use local UMat instances within functions", "use getUMat() instead of copyTo()", "avoid calls of function x since it will cause a cv::ocl::flush()", things of that sort?
Are there OpenCV operations that are not ported to OpenCL yet? Is there documentation accordingly? In the OpenCV source code I saw that, if built with HAVE_OPENCL flat, the functions try to run CL code using the CV_OCL_RUN macro, however there are a few conditions checked beforehand, otherwise it falls back to CPU. It does not seem like I have any possitility to figure out if the GPU or the CPU was actually used apart from stepping into each and every OpenCL function with the debugger, am I right?
Any ideas/experiences apart from that? I'd appreciate any input that relates to this matter.
In Armadillo C++, is there any way to disable the default parallelization when compiled with -fopenmp. I would like the parallelization to be on other parts of the code.
The function I'm particularly interested in is eig_sym().
Thanks very much,
Yantao
Armadillo isn't parallelized with OpenMP, with slight caveats:
The underlying LAPACK or BLAS implementation may be paralellized. If you are using OpenBLAS, it is.
The Armadillo gmm_diag implementation uses OpenMP.
So the simplest way to go is "don't use OpenBLAS, instead use a singlethreaded BLAS". But that's not the only way to go.
It sounds to me like you want to disable nested parallelism, so that the only parts of the code that are parallelized are at the higher levels of your code and not in eig_sym(). Here's some documentation on OMP_NESTED:
https://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html
So you could either set the environment OMP_NESTED to false at runtime, or with omp_set_nested() in your code.
In my code I'm trying to capitalize the power of a possibly present cuda capable GPU. While this code works well on computers that have cuda available (and where OpenCV was compiled with cuda support), I have troubles implementing a fallback to CPU. Even building fails, since the imports I'm using
#include "opencv2/core/cuda.hpp"
#include "opencv2/cudaimgproc.hpp"
#include "opencv2/cudaarithm.hpp"
are not found. I'm quite a novice regarding C++ program architecture. How would I need to model my code to support such a fallback functionality?
If you are implementing a fallback you probably want to switch to it at runtime. But the fact that you are getting compiler error messages suggests that you are compiling with different flags. In general, you probably want something like this:
if (HasCuda()) {
RunCudaCode(...);
} else {
RunCpuCode(...);
}
Alternatively, you could build two shared libraries one with and one without Cuda and load the one that you need based on HasCuda(). However, that approach only makes sense if your binary is huge and you're running into memory issues.
It might be necessary to have a similar block in your startup code that initializes Cuda.
I use C++ and odeint to solve a set of differential equations. I compile the code in Matlab using mex and g++ on Mac OS X. For some time everything worked perfectly, but now something curious is happening:
I can run the same simulation (with the same parameters) twice and in one run the results are fine and in the other a whole column (or multiple columns) of the output is NaN (this also influences the other outputs hinting at a problem occurring during the integration process).
I tried switching between various solvers. Now I am using runge_kutta_fehlberg78 with a fixed step size. (One possibility for NaN output is that the initial step size of an adaptive solver is too large.)
What can be the cause of this? Especially the randomness makes me wonder.
Edit: I am starting to suspect that the problem has something to do with Matlab. I compiled a version without the Matlab interface with Xcode as a normal executable and so far I didn't have a issue with NaN results. Its hard to tell if the problem is truly resolved and I don't understand why that would resolve it.
I have a program written in Fortran and I have more than 100 subroutines. However, I have around 30 subroutines where there are open-mp codes present. I was wondering what is the best procedure to compile these subroutines. When I used the all the files to compile at once then I found that open mp compiled code runs even slower than the one without open-mp. Should I compile the subroutines with open-mp tags separately ? What is the best practice under these conditions ?
Thank you so much.
Best Regards,
Jdbaba
The OpenMP-aware compilers look for the OpenMP pragma (the open signs after a comment symbol at the begin of the line). Therefore, sources without OpenMP code compiled with an OpenMP-aware compiler should result on the exact or very close object files (and executable).
Edit: One should note that as stated by Hristo Iliev below, enabling OpenMP could affect the serial code, for example by using OpenMP versions of libraries that may differ in algorithm (to be more effective in parallel) and optimizations.
Most likely, the problem here is more related to your code algorithms.
Or perhaps you did not compile with the same optimization flags when comparing OpenMP and non-OpenMP versions.