Clang + OpenMP inefficient loop invariants - c++

I came across some inefficient code generation by Clang while answering a different question (How do i parallelize this code using openmp with reduction)
Let's consider this simple code:
void scale(float* inout, ptrdiff_t n, ptrdiff_t m, ptrdiff_t stride, float value)
{
const float inverse = 1.f / value;
# pragma omp parallel for
for(ptrdiff_t i = 0; i < n; ++i) {
# pragma omp simd
for(ptrdiff_t j = 0; j < m; ++j)
inout[i * stride + j] *= inverse;
}
}
Where do you put the computation of the inverse and does it matter? Options that I've explored:
Outside the loop, where it is in the example
In the parallel section but before the loop
In the outer loop
in the inner loop
For GCC-11, option 1 generates the best code: One division, then a single memory load and broadcast per thread. Option 2-4 all generate basically the same code, doing the division once per thread.
Clang assembly
However, with Clang-13 the code is vastly different.
Option 1: Does a redundant memory load and broadcast in the inner loop. And it doesn't load via stack pointer but wastes a general purpose register as a pointer to the constant. If you change the code to require multiple constants, Clang will waste multiple GP registers.
Option 2: Same code pattern as GCC
Option 3: Repeats the division once per iteration of the outer loop
Option 4: Repeats the division in the inner loop
Summary
It seems as if Clang's code generation has some issues with pulling redundant computations out of OpenMP loops. Interestingly, it doesn't seem to affect the array index computation. That gets pulled out of the inner loop just fine.
If I want code that works well on both GCC and Clang, I have to write something like this:
void scale(float* inout, ptrdiff_t n, ptrdiff_t m, ptrdiff_t stride, float value)
{
# pragma omp parallel
{
const float inverse = 1.f / value;
# pragma omp for nowait
for(ptrdiff_t i = 0; i < n; ++i) {
# pragma omp simd
for(ptrdiff_t j = 0; j < m; ++j)
inout[i * stride + j] *= inverse;
}
}
}
But that is awfully verbose.
This whole thing is a minor nuisance in this code example but if you check out the code in the other answer above, it gets so bad (especially with the GP register waste) that it seriously impacts performance.
So in conclusion, am I missing something? Should I write loops differently to ensure good code in both Clang and GCC?
Supplementary information
Here is a version of the code that allows easy testing and here is a Godbolt link
#include <cstddef>
// using std::ptrdiff_t
#define CONST_LOCATION 1
void scale(float* inout, std::ptrdiff_t n, std::ptrdiff_t m, std::ptrdiff_t stride,
float value)
{
# if CONST_LOCATION == 1
/*
* Clang-13.0.1: Redundant broadcast from memory in inner loop.
* Wastes GP register for pointer to constant
* GCC-11.2: Optimal
*/
const float inv = 1.f / value;
#endif
# pragma omp parallel
{
# if CONST_LOCATION == 2
/*
* Clang: Redundant computation in outer loop setup. Otherwise optimal
* GCC: Same as Clang
*/
const float inv = 1.f / value;
# endif
# pragma omp for nowait
for(std::ptrdiff_t i = 0; i < n; ++i) {
# if CONST_LOCATION == 3
/*
* Clang: Redundant computation in inner loop setup!
* GCC: Same as 2
*/
const float inv = 1.f / value;
# endif
# pragma omp simd
for(std::ptrdiff_t j = 0; j < m; ++j) {
# if CONST_LOCATION == 4
/*
* Clang: Redundant computation in inner loop!
* GCC: Same as 2
*/
const float inv = 1.f / value;
# endif
inout[i*stride + j] *= inv;
}
}
}
}
Tested with -O3 -mavx2 -mfma -fopenmp for a reasonably generic, modern compilation.

To answer myself, it's the pragma omp simd in the inner loop that messes up clang's code generation. Which is a shame because it does have a positive effect in some cases on some compilers.

Related

_kmp huge overhead and spin time for unkown calls in OpenMP?

I'm using Intel VTune to analyze my parallel application.
As you can see, there is an huge Spin Time at the beginning of the application (represented as the orange section on the left side):
It's more than 28% of the application durations (which is roughly 0.14 seconds)!
As you can see, these functions are _clone, start_thread, _kmp_launch_thread and _kmp_fork_barrier and they look like OpenMP internals or system calls, but it's not specified where these fucntion are called from.
In addition, if we zoom at the beginning of this section, we can notice a region instantiation, represented by the selected region:
However, I never call initInterTab2d and I have no idea if it's called by some of the labraries that I'm using (especially OpenCV).
Digging deeply and running an Advanced Hotspot analysis I found a little bit more about the firsts unkown functions:
And exaplanding tthe Function/Call Stack tab:
But again, I can't really understand why these functions, why they take so long and why only the master thread works during them, while the others are in a "barrier" state.
If you're interested, this is the link to part of the code.
Notice that I have only one #pragma omp parallel region, which is the selected section of this image (on the right side):
The code structure is the following:
Compute some serial, non parallelizable stuff. In particular, compute a chain of blurs, which is represented by gaussianBlur (included at the end of the code). cv::GaussianBlur is an OpenCV function which exploits IPP.
Start the parallel region, where 3 parallel for are used
The first one calls hessianResponse
A single thread add the results to a shared vector.
The second parallel region localfindAffineShapeArgs generates the data used by the next parallel region. The two regions can't be merged because of load imbalance.
The third region generates the final result in a balanced way.
Note: according to the lock analysis of VTune, the critical and barrier sections are not the reason of spinning.
This is the main function of the code:
void HessianDetector::detectPyramidKeypoints(const Mat &image, cv::Mat &descriptors, const AffineShapeParams ap, const SIFTDescriptorParams sp)
{
float curSigma = 0.5f;
float pixelDistance = 1.0f;
cv::Mat octaveLayer;
// prepare first octave input image
if (par.initialSigma > curSigma)
{
float sigma = sqrt(par.initialSigma * par.initialSigma - curSigma * curSigma);
octaveLayer = gaussianBlur(image, sigma);
}
// while there is sufficient size of image
int minSize = 2 * par.border + 2;
int rowsCounter = image.rows;
int colsCounter = image.cols;
float sigmaStep = pow(2.0f, 1.0f / (float) par.numberOfScales);
int levels = 0;
while (rowsCounter > minSize && colsCounter > minSize){
rowsCounter/=2; colsCounter/=2;
levels++;
}
int scaleCycles = par.numberOfScales+2;
//-------------------Shared Vectors-------------------
std::vector<Mat> blurs (scaleCycles*levels+1, Mat());
std::vector<Mat> hessResps (levels*scaleCycles+2); //+2 because high needs an extra one
std::vector<Wrapper> localWrappers;
std::vector<FindAffineShapeArgs> findAffineShapeArgs;
localWrappers.reserve(levels*(scaleCycles-2));
vector<float> pixelDistances;
pixelDistances.reserve(levels);
for(int i=0; i<levels; i++){
pixelDistances.push_back(pixelDistance);
pixelDistance*=2;
}
//compute blurs at all layers (not parallelizable)
for(int i=0; i<levels; i++){
blurs[i*scaleCycles+1] = octaveLayer.clone();
for (int j = 1; j < scaleCycles; j++){
float sigma = par.sigmas[j]* sqrt(sigmaStep * sigmaStep - 1.0f);
blurs[j+1+i*scaleCycles] = gaussianBlur(blurs[j+i*scaleCycles], sigma);
if(j == par.numberOfScales)
octaveLayer = halfImage(blurs[j+1+i*scaleCycles]);
}
}
#pragma omp parallel
{
//compute all the hessianResponses
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
int scaleCyclesLevel = scaleCycles * i;
float curSigma = par.sigmas[j];
hessResps[j+scaleCyclesLevel] = hessianResponse(blurs[j+scaleCyclesLevel], curSigma*curSigma);
}
//we need to allocate here localWrappers to keep alive the reference for FindAffineShapeArgs
#pragma omp single
{
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
int scaleCyclesLevel = scaleCycles * i;
localWrappers.push_back(Wrapper(sp, ap, hessResps[j+scaleCyclesLevel-1], hessResps[j+scaleCyclesLevel], hessResps[j+scaleCyclesLevel+1],
blurs[j+scaleCyclesLevel-1], blurs[j+scaleCyclesLevel]));
}
}
std::vector<FindAffineShapeArgs> localfindAffineShapeArgs;
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
size_t c = (scaleCycles-2) * i +j-2;
//toDo: octaveMap is shared, need synchronization
//if(j==1)
// octaveMap = Mat::zeros(blurs[scaleCyclesLevel+1].rows, blurs[scaleCyclesLevel+1].cols, CV_8UC1);
float curSigma = par.sigmas[j];
// find keypoints in this part of octave for curLevel
findLevelKeypoints(curSigma, pixelDistances[i], localWrappers[c]);
localfindAffineShapeArgs.insert(localfindAffineShapeArgs.end(), localWrappers[c].findAffineShapeArgs.begin(), localWrappers[c].findAffineShapeArgs.end());
}
#pragma omp critical
{
findAffineShapeArgs.insert(findAffineShapeArgs.end(), localfindAffineShapeArgs.begin(), localfindAffineShapeArgs.end());
}
#pragma omp barrier
std::vector<Result> localRes;
#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
hessianKeypointCallback->onHessianKeypointDetected(findAffineShapeArgs[i], localRes);
}
#pragma omp critical
{
for(size_t i=0; i<localRes.size(); i++)
descriptors.push_back(localRes[i].descriptor);
}
}
Mat gaussianBlur(const Mat input, const float sigma)
{
Mat ret(input.rows, input.cols, input.type());
int size = (int)(2.0 * 3.0 * sigma + 1.0); if (size % 2 == 0) size++;
GaussianBlur(input, ret, Size(size, size), sigma, sigma, BORDER_REPLICATE);
return ret;
}
If you consider a 50 ms (a fraction of the blink of an eye) one time cost to be a huge overhead, then you should probably focus on your workflow as such. Try to use one fully initialized process (with it's threads and data structures) in a persistent way to increase the work done during each each run.
That said, it may be possible to reduce the overhead, but in any case you will be very dependent on the runtime and initialization cost of your library, thus limiting your performance portability.
Your performance analysis may also be problematic. AFAIK VTune uses sampling, your data indicates a 1 ms sampling interval. That means you may have just 50 samples during the critical initialization path of your application, too little for a confident analysis. VTune might also have some forms of OpenMP instrumentation that provides more accurate results at small time scales. In any case I would take any performance measurement over just 150 ms with a grain of salt unless I knew exactly what impact and method the measurement has.
P.S. Running a simple code like:
#include <stdio.h>
#include <omp.h>
int main() {
double start = omp_get_wtime();
#pragma omp parallel
{
#pragma omp barrier
#pragma omp master
printf("%f s\n", omp_get_wtime() - start);
}
}
Shows an initial thread creation overhead between 3 ms and 200 ms on different systems / thread counts with the Intel OpenMP runtime.

OpenMP: parallel for doesn't do anything

I'm trying to make a parallel version of SIFT algorithm in OpenCV.
In particular in sift.cpp:
static void calcDescriptors(const std::vector<Mat>& gpyr, const std::vector<KeyPoint>& keypoints,
Mat& descriptors, int nOctaveLayers, int firstOctave )
{
...
#pragma omp parallel for
for( size_t i = 0; i < keypoints.size(); i++ )
{
...
calcSIFTDescriptor(img, ptf, angle, size*0.5f, d, n, descriptors.ptr<float>((int)i));
...
}
Gives already a speed-up from 84ms to 52ms on a quad-core machine. It doesn't scale so much, but it's already a good result for adding 1 line of codes.
Anyway most of the computation inside the loop is performed by calcSIFTDescriptor(), but anyway it takes on average 100us. So most of the computation time is given by the really high number of times that calcSIFTDescriptor() is called (thousands of times). So accomulating all these 100us results in several ms.
Anyway, I'm trying to optimize the calcSIFTDescriptor() performance. In particular the code is devide between two for and the following one take on average 60us:
for( k = 0; k < len; k++ )
{
float rbin = RBin[k], cbin = CBin[k];
float obin = (Ori[k] - ori)*bins_per_rad;
float mag = Mag[k]*W[k];
int r0 = cvFloor( rbin );
int c0 = cvFloor( cbin );
int o0 = cvFloor( obin );
rbin -= r0;
cbin -= c0;
obin -= o0;
if( o0 < 0 )
o0 += n;
if( o0 >= n )
o0 -= n;
// histogram update using tri-linear interpolation
float v_r1 = mag*rbin, v_r0 = mag - v_r1;
float v_rc11 = v_r1*cbin, v_rc10 = v_r1 - v_rc11;
float v_rc01 = v_r0*cbin, v_rc00 = v_r0 - v_rc01;
float v_rco111 = v_rc11*obin, v_rco110 = v_rc11 - v_rco111;
float v_rco101 = v_rc10*obin, v_rco100 = v_rc10 - v_rco101;
float v_rco011 = v_rc01*obin, v_rco010 = v_rc01 - v_rco011;
float v_rco001 = v_rc00*obin, v_rco000 = v_rc00 - v_rco001;
int idx = ((r0+1)*(d+2) + c0+1)*(n+2) + o0;
hist[idx] += v_rco000;
hist[idx+1] += v_rco001;
hist[idx+(n+2)] += v_rco010;
hist[idx+(n+3)] += v_rco011;
hist[idx+(d+2)*(n+2)] += v_rco100;
hist[idx+(d+2)*(n+2)+1] += v_rco101;
hist[idx+(d+3)*(n+2)] += v_rco110;
hist[idx+(d+3)*(n+2)+1] += v_rco111;
}
So I tried to add #pragma omp parallel for private(k) before it, and the weird thing happens: nothing happens!!!
Introducing this parallel for make the code computation on average 53ms (against 52ms of before). I would have expected one or more of the following results:
Taking >52ms given by the overhead of a new parallel for
Taking <52ms given by the gain obtained by the parallel for
Some sort of inconsistency in the result, since as you can see the shared vector hist is updated concurrently. Nothing of this happens: the result is still correct and no atomic or critical are used.
I'm an OpenMP newbie, but from I see is like this inner parllel for is like ignored. Why this happens?
NOTE: all the reported times are the average time with the same input for 10.000 times.
UPDATE:
I tried to remove the first parallel for, leaving the one in calcSIFTDescriptor and it happened was I was expecting: inconsistency has been observed due to the lack of any thread-safety mechanism. Introducing #pragma omp critical(dataupdate) before updating hist gave consistency again but now performances are horribles: 245ms on average.
I think that this is because of the overhead given by the parallel for in calcSIFTDescriptor, which is not worth for parallelize 30us.
BUT THE QUESTION STILL REMAINS: why the first version (with two parallel for) didn't produce any change (both in performance and consistency)?
I found out the answer by myself: the second (nested) parallel for doesn't make any effect for the reason described here:
OpenMP parallel regions can be nested inside each other. If nested
parallelism is disabled, then the new team created by a thread
encountering a parallel construct inside a parallel region consists
only of the encountering thread. If nested parallelism is enabled,
then the new team may consist of more than one thread.
So since the first parallel for takes all the possible thread, the second one has as team the encountering thread itself. So nothing happens.
Cheers to myself!

Fastest way to calculate the abs()-values of a complex array

I want to calculate the absolute values of the elements of a complex array in C or C++. The easiest way would be
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
But for large vectors that will be slow. Is there a way to speed that up (by using parallelization, for example)? Language can be either C or C++.
Given that all loop iterations are independent, you can use the following code for parallelization:
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
Of course, for using this you should enable OpenMP support while compiling your code (usually by using /openmp flag or setting the project options).
You can find several examples of OpenMP usage in wiki.
Or use Concurrency::parallele_for like that :
Concurrency::parallel_for(0, N, [&a, &b](int i)
{
b[i] = cabs(a[i]);
});
Use vector operations.
If you have glibc 2.22 (pretty recent), you can use the SIMD capabilities of OpenMP 4.0 to operate on vectors/arrays.
Libmvec is vector math library added in Glibc 2.22.
Vector math library was added to support SIMD constructs of OpenMP4.0
(#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) by adding
vector implementations of vector math functions.
Vector math functions are vector variants of corresponding scalar math
operations implemented using SIMD ISA extensions (e.g. SSE or AVX for
x86_64). They take packed vector arguments, perform the operation on
each element of the packed vector argument, and return a packed vector
result. Using vector math functions is faster than repeatedly calling
the scalar math routines.
Also, see Parallel for vs omp simd: when to use each?
If you're running on Solaris, you can explicitly use vhypot() from the math vector library libmvec.so to operate on a vector of complex numbers to obtain the absolute value of each:
Description
These functions evaluate the function hypot(x, y) for an entire vector
of values at once. ...
The source code for libmvec can be found at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/ and the vhypot() code specifically at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/common/__vhypot.c I don't recall if Sun Microsystems ever provided a Linux version of libmvec.so or not.
Using #pragma simd (even with -Ofast) or relying on the compilers auto-vectorization are more example of why it's a bad idea to blindly expect your compiler to implement SIMD efficiently. In order to use SIMD efficiently for this you need to use an array of struct of arrays. For example for single float with a SIMD width of 4 you could use
//struct of arrays of four complex numbers
struct c4 {
float x[4]; // real values of four complex numbers
float y[4]; // imaginary values of four complex numbers
};
Here is code showing how you could do this with SSE for the x86 instruction set.
#include <stdio.h>
#include <x86intrin.h>
#define N 10
struct c4{
float x[4];
float y[4];
};
static inline void cabs_soa4(struct c4 *a, float *b) {
__m128 x4 = _mm_loadu_ps(a->x);
__m128 y4 = _mm_loadu_ps(a->y);
__m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
_mm_storeu_ps(b, b4);
}
int main(void)
{
int n4 = ((N+3)&-4)/4; //choose next multiple of 4 and divide by 4
printf("%d\n", n4);
struct c4 a[n4]; //array of struct of arrays
for(int i=0; i<n4; i++) {
for(int j=0; j<4; j++) { a[i].x[j] = 1, a[i].y[j] = -1;}
}
float b[4*n4];
for(int i=0; i<n4; i++) {
cabs_soa4(&a[i], &b[4*i]);
}
for(int i = 0; i<N; i++) printf("%.2f ", b[i]); puts("");
}
It may help to unroll the loop a few times. In any case all this is moot for large N because the operation is memory bandwidth bound. For large N (meaning when the memory usage is much larger than the last level cache), although #pragma omp parallel may help some, the best solution is not to do this for large N. Instead do this in chunks which fit in the lowest level cache along with other compute operations. I mean something like this
for(int i = 0; i < nchunks; i++) {
for(int j = 0; j < chunk_size; j++) {
b[i*chunk_size+j] = cabs(a[i*chunk_size+j]);
}
foo(&b[i*chunck_size]); // foo is computationally intensive.
}
I did not implement an array of struct of array here but it should be easy to adjust the code for that.
If you are using a modern compiler (GCC 5, for example), you can use Cilk+, that will give you a nice array notation, automatically usage of SIMD instructions, and parallelisation.
So, if you want to run them in parallel you would do:
#include <cilk/cilk.h>
cilk_for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
or if you want to test SIMD:
#pragma simd
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
But, the nicest part of Cilk is that you can just do:
b[:] = cabs(a[:])
In this case, the compiler and the runtime environment will decide to which level it should be SIMDed and what should be paralellised (the optimal way is applying SIMD on large-ish chunks in parallel).
Since this is decided by a work scheduler at runtime, Intel claims it is capable of providing a near optimal scheduling, and that it should be able to make an optimal use of the cache.
Also, you can use std::future and std::async (they are part of C++11), maybe it's more clear way of achieving what you want to do:
#include <future>
...
int main()
{
...
// Create async calculations
std::future<void> *futures = new std::future<void>[N];
for (int i = 0; i < N; ++i)
{
futures[i] = std::async([&a, &b, i]
{
b[i] = std::sqrt(a[i]);
});
}
// Wait for calculation of all async procedures
for (int i = 0; i < N; ++i)
{
futures[i].get();
}
...
return 0;
}
IdeOne live code
We first create asynchronous procedures and then wait until everything is calculated.
Here I use sqrt instead of cabs because I just don't know what is cabs. I'm sure it doesn't matter.
Also, maybe you'll find this link useful: cplusplus.com

Parallel for with omp

I try to optimise the following loop with OpenMP:
#pragma omp parallel for private(diff)
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
#pragma omp atomic
d2 += diff * diff;
}
But it runs actually 4x slower than without #pragma.
EDIT
As Piotr S., coincoin and erenon pointed out, in my case x.d is so small, that's why parallelism makes my code run slower. I post the outer loop too, maybe there is some possibility for multithreading: (x.n is over 100 millions)
float sum_distribution = 0.0;
// look for the point that is furthest from any center
float max_dist = 0.0;
for (int i = 0; i < x.n; ++i) {
int example = dist2[i].second;
float d2 = 0.0, diff;
//#pragma omp parallel for private(diff) reduction(+:d2)
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
d2 += diff * diff;
}
if (d2 < dist2[i].first) {
dist2[i].first = d2;
}
if (dist2[i].first > max_dist) {
max_dist = dist2[i].first;
}
sum_distribution += dist2[i].first;
}
If someone is interested, here is the whole function: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L169, but as I measured 85% of the elapsed time comes from this loop.
Yes, the outer loop, as posted, can be parallelized with OpenMP.
All variables modified in the loop are either local to an iteration or are used for aggregation over the loop. And I assume that calls to x() in the calculation of diff have no side effects.
To do aggregation in parallel correctly and efficiently, you need to use an OpenMP loop with reduction clause. For sum_distribution the reduction operation is +, and for max_dist it's max. So, adding the following pragma in front of the outer loop should do the job:
#pragma omp parallel for reduction(+:sum_distribution) reduction(max:max_dist)
Note that max as a reduction operation can only be used since OpenMP 3.1. It's not that new, so most OpenMP-enabled compilers already support it, but not all; or you might use an older version. So it makes sense to consult with the documentation for your compiler.

How to hint OpenMP Stride?

I am trying to understand the conceptual reason why OpenMP breaks loop vectorization. Also any suggestions for fixing this would be helpful. I am considering manually parallelizing this to fix this issue, but that would certainly not be elegant and result in a massive amount of code bloat, as my code consists of several such sections that lend themselves to vectorization and parallelization.
I am using
Microsoft (R) C/C++ Optimizing Compiler Version 17.00.60315.1 for x64
With OpenMP:
info C5002: loop not vectorized due to reason '502'
Without OpenMP:
info C5001: loop vectorized
The VS vectorization page says this error happens when:
Induction variable is stepped in some manner other than a simple +1
Can I force it to step in stride 1?
The loop
#pragma omp parallel for
for (int j = 0; j < H*W; j++)//A,B,C,D,IN are __restricted
{
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
in[j]=atan2f(Gs,Gc);
}
Best Effort(?)
#pragma omp parallel
{// This seems to vectorize, but it still requires quite a lot of boiler code
int middle = H*W/2;
#pragma omp sections nowait
{
#pragma omp section
for (int j = 0; j < middle; j++)
{
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
in[j]=atan2f(Gs,Gc);
}
#pragma omp section
for (int j = middle; j < H*W; j++)
{
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
in[j]=atan2f(Gs,Gc);
}
}
}
I recommend that you do the vectorization manually. One reason is that auto-vectorization does not seem to handle carried loop dependencies well (loop unrolling).
To avoid code bloat and arcane intrinsics I use Agner Fog's vectorclass. In my experience it's just as fast as using intrinsics and it automatically takes advantage of SSE2-AVX2 (AVX2 is tested on a Intel emulator) depending on how you compile. I have written GEMM code using the vectorclass that works on SSE2 up to AVX2 and when I run on a system with AVX my code is already faster than Eigen which only uses SSE. Here is your function with the vectorclass (I did not try unrolling the loop).
#include "omp.h"
#include "math.h"
#include "vectorclass.h"
#include "vectormath.h"
void loop(const int H, const int W, const int outer_stride, float *A, float *B, float *C, float *D, float* in) {
#pragma omp parallel for
for (int j = 0; j < H*W; j+=8)//A,B,C,D,IN are __restricted, W*H must be a multiple of 8
{
Vec8f Gs = Vec8f().load(&D[j]) - Vec8f().load(&B[j]);
Vec8f Gc = Vec8f().load(&A[j]) - Vec8f().load(&C[j]);
Vec8f invec = atan(Gs, Gc);
invec.store(&in[j]);
}
}
When doing the vectorization yourself you have to be careful with array bounds. In the function above HW needs to be a multiple of 8. There are several solutions for that but the easiest and most efficient solution is to make the arrays (A,B,C,D,in) a bit larger (maximum 7 floats larger) if necessary to be a multiple of 8. However, another solution is to use the following code which does not require WH to be a multiple of 8 but it's not as pretty.
#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
void loop_fix(const int H, const int W, const int outer_stride, float *A, float *B, float *C, float *D, float* in) {
#pragma omp parallel for
for (int j = 0; j < ROUND_DOWN(H*W,8); j+=8)//A,B,C,D,IN are __restricted
{
Vec8f Gs = Vec8f().load(&D[j]) - Vec8f().load(&B[j]);
Vec8f Gc = Vec8f().load(&A[j]) - Vec8f().load(&C[j]);
Vec8f invec = atan(Gs, Gc);
invec.store(&in[j]);
}
for(int j=ROUND_DOWN(H*W,8); j<H*W; j++) {
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
in[j]=atan2f(Gs,Gc);
}
}
One challenge with doing the vectorization yourself is finding a SIMD math library (e.g. for atan2f). The vectorclass supports 3 options. Non-SIMD, LIBM by AMD, and SVML by Intel (I used the non-SIMD option in the code above).
SIMD math libraries for SSE and AVX
Some last comments you might want to consider. Visual Studio has auto-parallelization (off by default) as well as auto-vectorization (on by default, at least in release mode). You can try this instead of OpenMP to reduce code bloat.
http://msdn.microsoft.com/en-us/library/hh872235.aspx
Additionally, Microsoft has the parallel patterns library. It's worth looking into since Microsoft's OpenMP support is limited. It's nearly as easy as OpenMP to use. It's possible that one of these options works better with auto-vectorization (though I doubt it). Like I said, I would do the vectorization manually with the vectorclass.
You may try loop unrolling instead of sections:
#pragma omp parallel for
for (int j = 0; j < H*W; j += outer_stride)//A,B,C,D,IN are __restricted
{
for (int ii = 0; ii < outer_stride; ii++) {
float Gs = D[j+ii]-B[j+ii];
float Gc = A[j+ii]-C[j+ii];
in[j+ii] = atan2f(Gs,Gc);
}
}
where outer_stride is a suitable multiple of your SIMD line. Also, you may find this answer useful.