Disable the default Armadillo in C++ when compiled with -fopenmp - c++

In Armadillo C++, is there any way to disable the default parallelization when compiled with -fopenmp. I would like the parallelization to be on other parts of the code.
The function I'm particularly interested in is eig_sym().
Thanks very much,
Yantao

Armadillo isn't parallelized with OpenMP, with slight caveats:
The underlying LAPACK or BLAS implementation may be paralellized. If you are using OpenBLAS, it is.
The Armadillo gmm_diag implementation uses OpenMP.
So the simplest way to go is "don't use OpenBLAS, instead use a singlethreaded BLAS". But that's not the only way to go.
It sounds to me like you want to disable nested parallelism, so that the only parts of the code that are parallelized are at the higher levels of your code and not in eig_sym(). Here's some documentation on OMP_NESTED:
https://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html
So you could either set the environment OMP_NESTED to false at runtime, or with omp_set_nested() in your code.

Related

threading issue with armadillo fft2

I'm using the armadillo c++ library to do 2D fourier transforms, and I'm finding that the results are inconsistent when I use multiple threads. Specifically, I'm getting different results from the fft2 function.
The data I'm passing to fft2 is thread-local. I've also verified that the input data is not affected by the presence of other threads working on parallel problems. fft2 is producing different results if there are other threads also calling fft2. Does anyone know about threading issues with fft2? Or, armadillo in general?
Armadillo itself does not seem to have any kind of state that could make it not thread-safe (maybe the random generation part could be a problem). That is, it seems to be thread-save as long as the libraries it depends on are thread-safe.
I also had problems in the past with incorrect results when using multithreading. In my case the culprit was openblas, which I was compiling myself. In order to investigate the problem, I had created a small project to check that results from some SVD and matrix multiplications were the same when running in parallel and serially. They were not. Then I stumbled into an issue in openblas repository about thread safity, where I saw a flag that I could set in CMake (USE_LOCKING) when compiling openblas. After setting USE_LOCKING to true when compiling openblas I had no more problems with wrong results given by armadillo.
You are probably experience something similar, but regarding the fft2 library. Specially since you mention that other performing work in other threads does not pose a problem if they are not related to fft2. Thus, you should check if fft2 is thread-safe instead of thinking about armadillo.

Is there any instructions sets support MIMD arch?

I have already known SIMD instructions sets contains SSE1 to SSE5.
But not found too much talk about any instruction sets support MIMD arch.
In c++ code , we can use intrinsic to write "SIMD running" code.
Is there any way to write "MIMD running" code ?
If MIMD is more powerful than SIMD,
it is better to write c++ code support MIMD.
Is my thought correct ?
The Wikipedia page Flynn's taxonomy describes MIMD as:
Multiple autonomous processors simultaneously executing different instructions on different data. MIMD architectures include multi-core superscalar processors, and distributed systems, using either one shared memory space or a distributed memory space.
Any time you divide an algorithm (such as into threads using OpenMP, for example), you may be using MIMD. Generally, you don't need a special "MIMD instruction set" - the ISA is the same as for SISD, as each instruction stream operates independently of the others, on its own data. EPIC (explicitly parallel instruction computing) is an alternative approach where the functional units operate in lockstep, but with independent(ish) instructions and data.
As to which is "more powerful" (or more energy-efficient, or lowest latency, or whatever matters in your use case), there's no single answer. As with many complex issues, "it depends".
Is my thought correct ?
It is certainly naive, and implementation specific. Remember the following facts:
optimizing compilers generate very clever code (when you enable optimizations). Try for example some recent GCC invoked as g++ -march=native -O3 -Wall (and perhaps also -fverbose-asm -S if you want to look into the generated assembler code); see CppCon 2017: Matt Godbolt's talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”
there are some extensions (done thru standardized pragmas) to improve optimizations for MIMD, look into OpenMP, OpenACC.
consider explicit parallelization approaches: multi-threading (read some pthread programming tutorial), MPI...
look also into dialects for GPGPU computing like OpenCL & CUDA.
See also this answer to a related question.
If MIMD is more powerful than SIMD, it is better to write c++ code support MIMD.
Certainly not always, if you just care about performance. As usual, it depends, and you need to benchmark.

Is there a way to force Halide not to generate code which use vector instructions?

We have implemented few algorithms using Halide language which uses arctan like trigonometric functions. But for the instrumentation purposes we want to force Halide not to generated vector instructions.
We are using visual c++ in windows and cl compiler in Visual Studio 2013 tool chain. So far trying to force cl using /arch:IA32 but it still generate vector instructions.
Is there a way to force this somehow from Halide language side or any way to intercept math library calls and there we can ask Halide to use arctan functions written by us which are not optimized to use vector instructions.
Generally Halide will not generate any code for atan and the implementation will come from the system math library (libm). (This is not true for all math routines as we provide internal implementations for some, but usually this is made explicit via names such as fast_log, fast_exp, etc.) To override this, you would generally provide your own implementation of libm or atan (and atan2, etc.), but Halide may allow you to define atan_f32 and atan_f64 to do the override. This may be advantageous as those should be declared with weak linkage, though that likely does not work on Windows. You could also change the definitions of these routines in src/runtime/posix_math.ll to point to your own.
In general Halide will only generate vectorized code if the schedule says to do so. However, llvm has automatic vectorization passes that can generate vector instructions. On x86_64, the SIMD instructions will generally be used for scalar floating-point computation. On 32-bit x86, if you do not turn on any of the x86 SIMD flags in the Target (e.g. none of SSE41, AVX, etc.) then we should set the llvm target machine to disallow SIMD instructions entirely. But that will not affect stuff in libm unless you take measures to do so at final link time.
You can also use HalideExtern to declare a call to a routine of your own choosing and use that instead of atan.
You ought to be able to set the target to be, say, host-x86-64 which should prevent Halide from using any vectorization (i.e using sse4/avx* instructions).
If you are using AOT with generators, look at: http://halide-lang.org/tutorials/tutorial_lesson_15_generators_usage.html The my_first_generator_basic should not be using any SIMD instructions.
Not too familar with JIT, but this example shows how to set the target while JITing: https://github.com/halide/Halide/wiki/Minimal-GPU-example You should be able to use a similar approach to specify the target as x86-64.

What is the best way to use openmp with multiple subroutines in Fortran

I have a program written in Fortran and I have more than 100 subroutines. However, I have around 30 subroutines where there are open-mp codes present. I was wondering what is the best procedure to compile these subroutines. When I used the all the files to compile at once then I found that open mp compiled code runs even slower than the one without open-mp. Should I compile the subroutines with open-mp tags separately ? What is the best practice under these conditions ?
Thank you so much.
Best Regards,
Jdbaba
The OpenMP-aware compilers look for the OpenMP pragma (the open signs after a comment symbol at the begin of the line). Therefore, sources without OpenMP code compiled with an OpenMP-aware compiler should result on the exact or very close object files (and executable).
Edit: One should note that as stated by Hristo Iliev below, enabling OpenMP could affect the serial code, for example by using OpenMP versions of libraries that may differ in algorithm (to be more effective in parallel) and optimizations.
Most likely, the problem here is more related to your code algorithms.
Or perhaps you did not compile with the same optimization flags when comparing OpenMP and non-OpenMP versions.

Best compiler flags for an objective C project with the opencv framekwork

I´m compiling and ios project using an opencv framework, so I´m interested to know what are the best compiler flags to my project.
The project process a lot of matrix pixels , so I need from the side of the compiler to have SIMD instructions to be able to process this matrix as efficient as possible.
I using this flags :-mfpu=neon, -mfloat-abi=softfp and -O3,
And I also find this other flags:
-mno-thumb
-mfpu=maverick
-ftree-vectorize
-DNS_BLOCK_ASSERTIONS=1
I don´t know really if it is going to save me a lot of cpu processing, I search through google, but I didn´t find something that give me good reasons to know the best compiler flags.
Thanks
I am also using the same flags that you use for neon. No optimization would be done on neon intrinsic codes according to the optimization level O3 or anything. It just optimizes the ARM code.
As said by Vasile the best performance can be gained by writing the neon codes in assembly.
The easiest way is to write a program in which intrinsic neon codes are used and compile it using the flags you mentioned. Now use the assembly code generated for the code for further optimization.
A lot of optimization can be done by parallelizing or making use of dual instruction capabilities of neon.
The problem is that compilers are not so good at generating vectorized code. So, by just enabling NEON you'll not get much improvements (maybe 10% ??)
what you can do is to profile your app and write by hand those parts that eats your time, using NEON. And if you do it, why not patch them into the public OpenCV source?
By now, OpenCV has little to no code optimized for NEON (for the x86 SSE2, it is much better optimized).