I am trying to understand the conceptual reason why OpenMP breaks loop vectorization. Also any suggestions for fixing this would be helpful. I am considering manually parallelizing this to fix this issue, but that would certainly not be elegant and result in a massive amount of code bloat, as my code consists of several such sections that lend themselves to vectorization and parallelization.
I am using
Microsoft (R) C/C++ Optimizing Compiler Version 17.00.60315.1 for x64
With OpenMP:
info C5002: loop not vectorized due to reason '502'
Without OpenMP:
info C5001: loop vectorized
The VS vectorization page says this error happens when:
Induction variable is stepped in some manner other than a simple +1
Can I force it to step in stride 1?
The loop
#pragma omp parallel for
for (int j = 0; j < H*W; j++)//A,B,C,D,IN are __restricted
{
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
in[j]=atan2f(Gs,Gc);
}
Best Effort(?)
#pragma omp parallel
{// This seems to vectorize, but it still requires quite a lot of boiler code
int middle = H*W/2;
#pragma omp sections nowait
{
#pragma omp section
for (int j = 0; j < middle; j++)
{
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
in[j]=atan2f(Gs,Gc);
}
#pragma omp section
for (int j = middle; j < H*W; j++)
{
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
in[j]=atan2f(Gs,Gc);
}
}
}
I recommend that you do the vectorization manually. One reason is that auto-vectorization does not seem to handle carried loop dependencies well (loop unrolling).
To avoid code bloat and arcane intrinsics I use Agner Fog's vectorclass. In my experience it's just as fast as using intrinsics and it automatically takes advantage of SSE2-AVX2 (AVX2 is tested on a Intel emulator) depending on how you compile. I have written GEMM code using the vectorclass that works on SSE2 up to AVX2 and when I run on a system with AVX my code is already faster than Eigen which only uses SSE. Here is your function with the vectorclass (I did not try unrolling the loop).
#include "omp.h"
#include "math.h"
#include "vectorclass.h"
#include "vectormath.h"
void loop(const int H, const int W, const int outer_stride, float *A, float *B, float *C, float *D, float* in) {
#pragma omp parallel for
for (int j = 0; j < H*W; j+=8)//A,B,C,D,IN are __restricted, W*H must be a multiple of 8
{
Vec8f Gs = Vec8f().load(&D[j]) - Vec8f().load(&B[j]);
Vec8f Gc = Vec8f().load(&A[j]) - Vec8f().load(&C[j]);
Vec8f invec = atan(Gs, Gc);
invec.store(&in[j]);
}
}
When doing the vectorization yourself you have to be careful with array bounds. In the function above HW needs to be a multiple of 8. There are several solutions for that but the easiest and most efficient solution is to make the arrays (A,B,C,D,in) a bit larger (maximum 7 floats larger) if necessary to be a multiple of 8. However, another solution is to use the following code which does not require WH to be a multiple of 8 but it's not as pretty.
#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
void loop_fix(const int H, const int W, const int outer_stride, float *A, float *B, float *C, float *D, float* in) {
#pragma omp parallel for
for (int j = 0; j < ROUND_DOWN(H*W,8); j+=8)//A,B,C,D,IN are __restricted
{
Vec8f Gs = Vec8f().load(&D[j]) - Vec8f().load(&B[j]);
Vec8f Gc = Vec8f().load(&A[j]) - Vec8f().load(&C[j]);
Vec8f invec = atan(Gs, Gc);
invec.store(&in[j]);
}
for(int j=ROUND_DOWN(H*W,8); j<H*W; j++) {
float Gs = D[j]-B[j];
float Gc = A[j]-C[j];
in[j]=atan2f(Gs,Gc);
}
}
One challenge with doing the vectorization yourself is finding a SIMD math library (e.g. for atan2f). The vectorclass supports 3 options. Non-SIMD, LIBM by AMD, and SVML by Intel (I used the non-SIMD option in the code above).
SIMD math libraries for SSE and AVX
Some last comments you might want to consider. Visual Studio has auto-parallelization (off by default) as well as auto-vectorization (on by default, at least in release mode). You can try this instead of OpenMP to reduce code bloat.
http://msdn.microsoft.com/en-us/library/hh872235.aspx
Additionally, Microsoft has the parallel patterns library. It's worth looking into since Microsoft's OpenMP support is limited. It's nearly as easy as OpenMP to use. It's possible that one of these options works better with auto-vectorization (though I doubt it). Like I said, I would do the vectorization manually with the vectorclass.
You may try loop unrolling instead of sections:
#pragma omp parallel for
for (int j = 0; j < H*W; j += outer_stride)//A,B,C,D,IN are __restricted
{
for (int ii = 0; ii < outer_stride; ii++) {
float Gs = D[j+ii]-B[j+ii];
float Gc = A[j+ii]-C[j+ii];
in[j+ii] = atan2f(Gs,Gc);
}
}
where outer_stride is a suitable multiple of your SIMD line. Also, you may find this answer useful.
Related
I came across some inefficient code generation by Clang while answering a different question (How do i parallelize this code using openmp with reduction)
Let's consider this simple code:
void scale(float* inout, ptrdiff_t n, ptrdiff_t m, ptrdiff_t stride, float value)
{
const float inverse = 1.f / value;
# pragma omp parallel for
for(ptrdiff_t i = 0; i < n; ++i) {
# pragma omp simd
for(ptrdiff_t j = 0; j < m; ++j)
inout[i * stride + j] *= inverse;
}
}
Where do you put the computation of the inverse and does it matter? Options that I've explored:
Outside the loop, where it is in the example
In the parallel section but before the loop
In the outer loop
in the inner loop
For GCC-11, option 1 generates the best code: One division, then a single memory load and broadcast per thread. Option 2-4 all generate basically the same code, doing the division once per thread.
Clang assembly
However, with Clang-13 the code is vastly different.
Option 1: Does a redundant memory load and broadcast in the inner loop. And it doesn't load via stack pointer but wastes a general purpose register as a pointer to the constant. If you change the code to require multiple constants, Clang will waste multiple GP registers.
Option 2: Same code pattern as GCC
Option 3: Repeats the division once per iteration of the outer loop
Option 4: Repeats the division in the inner loop
Summary
It seems as if Clang's code generation has some issues with pulling redundant computations out of OpenMP loops. Interestingly, it doesn't seem to affect the array index computation. That gets pulled out of the inner loop just fine.
If I want code that works well on both GCC and Clang, I have to write something like this:
void scale(float* inout, ptrdiff_t n, ptrdiff_t m, ptrdiff_t stride, float value)
{
# pragma omp parallel
{
const float inverse = 1.f / value;
# pragma omp for nowait
for(ptrdiff_t i = 0; i < n; ++i) {
# pragma omp simd
for(ptrdiff_t j = 0; j < m; ++j)
inout[i * stride + j] *= inverse;
}
}
}
But that is awfully verbose.
This whole thing is a minor nuisance in this code example but if you check out the code in the other answer above, it gets so bad (especially with the GP register waste) that it seriously impacts performance.
So in conclusion, am I missing something? Should I write loops differently to ensure good code in both Clang and GCC?
Supplementary information
Here is a version of the code that allows easy testing and here is a Godbolt link
#include <cstddef>
// using std::ptrdiff_t
#define CONST_LOCATION 1
void scale(float* inout, std::ptrdiff_t n, std::ptrdiff_t m, std::ptrdiff_t stride,
float value)
{
# if CONST_LOCATION == 1
/*
* Clang-13.0.1: Redundant broadcast from memory in inner loop.
* Wastes GP register for pointer to constant
* GCC-11.2: Optimal
*/
const float inv = 1.f / value;
#endif
# pragma omp parallel
{
# if CONST_LOCATION == 2
/*
* Clang: Redundant computation in outer loop setup. Otherwise optimal
* GCC: Same as Clang
*/
const float inv = 1.f / value;
# endif
# pragma omp for nowait
for(std::ptrdiff_t i = 0; i < n; ++i) {
# if CONST_LOCATION == 3
/*
* Clang: Redundant computation in inner loop setup!
* GCC: Same as 2
*/
const float inv = 1.f / value;
# endif
# pragma omp simd
for(std::ptrdiff_t j = 0; j < m; ++j) {
# if CONST_LOCATION == 4
/*
* Clang: Redundant computation in inner loop!
* GCC: Same as 2
*/
const float inv = 1.f / value;
# endif
inout[i*stride + j] *= inv;
}
}
}
}
Tested with -O3 -mavx2 -mfma -fopenmp for a reasonably generic, modern compilation.
To answer myself, it's the pragma omp simd in the inner loop that messes up clang's code generation. Which is a shame because it does have a positive effect in some cases on some compilers.
I'm trying to learn about intrinsic and how to properly utilize, and optimize it, I decided to implement a function to get the dot product of two arrays as a starting point to learn.
I create two functions to get the dot product of an array of integers int, one is coded in a normal way where you loop through every elements of the two arrays then perform multiplication with each element then add/accumulate/sum the resulting products to get the dot product.
The other uses intrinsic in a way where, I perform intrinsic operations on four elements of each array, I multiply each of them using _mm_mullo_epi32, then uses 2 horizontal add _mm_hadd_epi32 to get the sum of the current 4 elements, after that I add it up to the dot_product, then proceed to the the next four element, then repeat until I get to the calculated limit vec_loop, then I calculate the other remaining elements using the normal way to avoid calculating out of the array's memory, then I compare the performance of the two.
header file with the two types of dot product function:
// main.hpp
#ifndef main_hpp
#define main_hpp
#include <iostream>
#include <immintrin.h>
template<typename T>
T scalar_dot(T* a, T* b, size_t len){
T dot_product = 0;
for(size_t i=0; i<len; ++i) dot_product += a[i]*b[i];
return dot_product;
}
int sse_int_dot(int* a, int* b, size_t len){
size_t vec_loop = len/4;
size_t non_vec = len%4;
size_t start_non_vec_i = len-non_vec;
int dot_prod = 0;
for(size_t i=0; i<vec_loop; ++i)
{
__m128i va = _mm_loadu_si128((__m128i*)(a+(i*4)));
__m128i vb = _mm_loadu_si128((__m128i*)(b+(i*4)));
va = _mm_mullo_epi32(va,vb);
va = _mm_hadd_epi32(va,va);
va = _mm_hadd_epi32(va,va);
dot_prod += _mm_cvtsi128_si32(va);
}
for(size_t i=start_non_vec_i; i<len; ++i) dot_prod += a[i]*b[i];
return dot_prod;
}
#endif
cpp code to measure the time taken of each function
// main.cpp
#include <iostream>
#include <chrono>
#include <random>
#include "main.hpp"
int main()
{
// generate random integers
unsigned seed = std::chrono::steady_clock::now().time_since_epoch().count();
std::mt19937_64 rand_engine(seed);
std::mt19937_64 rand_engine2(seed/2);
std::uniform_int_distribution<int> random_number(0,9);
size_t LEN = 10000000;
int* a = new int[LEN];
int* b = new int[LEN];
for(size_t i=0; i<LEN; ++i)
{
a[i] = random_number(rand_engine);
b[i] = random_number(rand_engine2);
}
#ifdef SCALAR
int dot1 = 0;
#endif
#ifdef VECTOR
int dot2 = 0;
#endif
// timing
auto start = std::chrono::high_resolution_clock::now();
#ifdef SCALAR
dot1 = scalar_dot(a,b,LEN);
#endif
#ifdef VECTOR
dot2 = sse_int_dot(a,b,LEN);
#endif
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start);
std::cout<<"proccess taken "<<duration.count()<<" nanoseconds\n";
#ifdef SCALAR
std::cout<<"\nScalar : Dot product = "<<dot1<<"\n";
#endif
#ifdef VECTOR
std::cout<<"\nVector : Dot product = "<<dot2<<"\n";
#endif
return 0;
}
compilation:
intrinsic version : g++ main.cpp -DVECTOR -msse4.1 -o main.o
normal version : g++ main.cpp -DSCALAR -msse4.1 -o main.o
my machine:
Architecture: x86_64
CPU(s) : 1
CPU core(s): 4
Thread(s) per core: 1
Model name: Intel(R) Pentium(R) CPU N3700 # 1.60GHz
L1d cache: 96 KiB
L1i cache: 128 KiB
L2 cache: 2 MiB
some Flags : sse, sse2, sse4_1, sse4_2
In the main.cpp there are 10000000 elements of int array, when I compile the code above in my machine, it seems that the intrinsic function runs slower than the normal version, most of the time, intrinsic take around97529675 nanoseconds and sometimes even longer, while the normal code only takes around 87568313 nanoseconds, here I thought that my intrinsic function should run faster if the optimization flags is off, but turns out it is indeed somehow a little bit slower.
so my questions are:
why is my intrinsic function runs slower? (am I doing something wrong?)
how can I correct my intrinsic implementation, what is the proper way?
does the compiler auto vectorize/unroll the normal code even when the optimization flag is off
what is the fastest way to get the dot product given the specs of my machine?
I hope someone can help, thanks
So with #Peter Cordes, #Qubit and #j6t suggestions, I tweeked the code a little bit, I now only do multiplication inside the loop, then I moved the horizontal addition outside the loop... It managed to increase the performance of the intrinsic version from around 97529675 nanoseconds, to around 56444187 nanoseconds which is significantly faster than my previous implementation, with the same compilation flags and 10000000 elements of int array.
here is the new function from main.hpp
int _sse_int_dot(int* a, int* b, size_t len){
size_t vec_loop = len/4;
size_t non_vec = len%4;
size_t start_non_vec_i = len-non_vec;
int dot_product;
__m128i vdot_product = _mm_set1_epi32(0);
for(size_t i=0; i<vec_loop; ++i)
{
__m128i va = _mm_loadu_si128((__m128i*)(a+(i*4)));
__m128i vb = _mm_loadu_si128((__m128i*)(b+(i*4)));
__m128i vc = _mm_mullo_epi32(va,vb);
vdot_product = _mm_add_epi32(vdot_product,vc);
}
vdot_product = _mm_hadd_epi32(vdot_product,vdot_product);
vdot_product = _mm_hadd_epi32(vdot_product,vdot_product);
dot_product = _mm_cvtsi128_si32(vdot_product);
for(size_t i=start_non_vec_i; i<len; ++i) dot_product += a[i]*b[i];
return dot_product;
}
If there is more to improve with this code, please point it out, for now I'm just gonna leave it here as the answer.
I'm using Intel VTune to analyze my parallel application.
As you can see, there is an huge Spin Time at the beginning of the application (represented as the orange section on the left side):
It's more than 28% of the application durations (which is roughly 0.14 seconds)!
As you can see, these functions are _clone, start_thread, _kmp_launch_thread and _kmp_fork_barrier and they look like OpenMP internals or system calls, but it's not specified where these fucntion are called from.
In addition, if we zoom at the beginning of this section, we can notice a region instantiation, represented by the selected region:
However, I never call initInterTab2d and I have no idea if it's called by some of the labraries that I'm using (especially OpenCV).
Digging deeply and running an Advanced Hotspot analysis I found a little bit more about the firsts unkown functions:
And exaplanding tthe Function/Call Stack tab:
But again, I can't really understand why these functions, why they take so long and why only the master thread works during them, while the others are in a "barrier" state.
If you're interested, this is the link to part of the code.
Notice that I have only one #pragma omp parallel region, which is the selected section of this image (on the right side):
The code structure is the following:
Compute some serial, non parallelizable stuff. In particular, compute a chain of blurs, which is represented by gaussianBlur (included at the end of the code). cv::GaussianBlur is an OpenCV function which exploits IPP.
Start the parallel region, where 3 parallel for are used
The first one calls hessianResponse
A single thread add the results to a shared vector.
The second parallel region localfindAffineShapeArgs generates the data used by the next parallel region. The two regions can't be merged because of load imbalance.
The third region generates the final result in a balanced way.
Note: according to the lock analysis of VTune, the critical and barrier sections are not the reason of spinning.
This is the main function of the code:
void HessianDetector::detectPyramidKeypoints(const Mat &image, cv::Mat &descriptors, const AffineShapeParams ap, const SIFTDescriptorParams sp)
{
float curSigma = 0.5f;
float pixelDistance = 1.0f;
cv::Mat octaveLayer;
// prepare first octave input image
if (par.initialSigma > curSigma)
{
float sigma = sqrt(par.initialSigma * par.initialSigma - curSigma * curSigma);
octaveLayer = gaussianBlur(image, sigma);
}
// while there is sufficient size of image
int minSize = 2 * par.border + 2;
int rowsCounter = image.rows;
int colsCounter = image.cols;
float sigmaStep = pow(2.0f, 1.0f / (float) par.numberOfScales);
int levels = 0;
while (rowsCounter > minSize && colsCounter > minSize){
rowsCounter/=2; colsCounter/=2;
levels++;
}
int scaleCycles = par.numberOfScales+2;
//-------------------Shared Vectors-------------------
std::vector<Mat> blurs (scaleCycles*levels+1, Mat());
std::vector<Mat> hessResps (levels*scaleCycles+2); //+2 because high needs an extra one
std::vector<Wrapper> localWrappers;
std::vector<FindAffineShapeArgs> findAffineShapeArgs;
localWrappers.reserve(levels*(scaleCycles-2));
vector<float> pixelDistances;
pixelDistances.reserve(levels);
for(int i=0; i<levels; i++){
pixelDistances.push_back(pixelDistance);
pixelDistance*=2;
}
//compute blurs at all layers (not parallelizable)
for(int i=0; i<levels; i++){
blurs[i*scaleCycles+1] = octaveLayer.clone();
for (int j = 1; j < scaleCycles; j++){
float sigma = par.sigmas[j]* sqrt(sigmaStep * sigmaStep - 1.0f);
blurs[j+1+i*scaleCycles] = gaussianBlur(blurs[j+i*scaleCycles], sigma);
if(j == par.numberOfScales)
octaveLayer = halfImage(blurs[j+1+i*scaleCycles]);
}
}
#pragma omp parallel
{
//compute all the hessianResponses
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
for (int j = 1; j <= scaleCycles; j++)
{
int scaleCyclesLevel = scaleCycles * i;
float curSigma = par.sigmas[j];
hessResps[j+scaleCyclesLevel] = hessianResponse(blurs[j+scaleCyclesLevel], curSigma*curSigma);
}
//we need to allocate here localWrappers to keep alive the reference for FindAffineShapeArgs
#pragma omp single
{
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
int scaleCyclesLevel = scaleCycles * i;
localWrappers.push_back(Wrapper(sp, ap, hessResps[j+scaleCyclesLevel-1], hessResps[j+scaleCyclesLevel], hessResps[j+scaleCyclesLevel+1],
blurs[j+scaleCyclesLevel-1], blurs[j+scaleCyclesLevel]));
}
}
std::vector<FindAffineShapeArgs> localfindAffineShapeArgs;
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
for (int j = 2; j < scaleCycles; j++){
size_t c = (scaleCycles-2) * i +j-2;
//toDo: octaveMap is shared, need synchronization
//if(j==1)
// octaveMap = Mat::zeros(blurs[scaleCyclesLevel+1].rows, blurs[scaleCyclesLevel+1].cols, CV_8UC1);
float curSigma = par.sigmas[j];
// find keypoints in this part of octave for curLevel
findLevelKeypoints(curSigma, pixelDistances[i], localWrappers[c]);
localfindAffineShapeArgs.insert(localfindAffineShapeArgs.end(), localWrappers[c].findAffineShapeArgs.begin(), localWrappers[c].findAffineShapeArgs.end());
}
#pragma omp critical
{
findAffineShapeArgs.insert(findAffineShapeArgs.end(), localfindAffineShapeArgs.begin(), localfindAffineShapeArgs.end());
}
#pragma omp barrier
std::vector<Result> localRes;
#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
hessianKeypointCallback->onHessianKeypointDetected(findAffineShapeArgs[i], localRes);
}
#pragma omp critical
{
for(size_t i=0; i<localRes.size(); i++)
descriptors.push_back(localRes[i].descriptor);
}
}
Mat gaussianBlur(const Mat input, const float sigma)
{
Mat ret(input.rows, input.cols, input.type());
int size = (int)(2.0 * 3.0 * sigma + 1.0); if (size % 2 == 0) size++;
GaussianBlur(input, ret, Size(size, size), sigma, sigma, BORDER_REPLICATE);
return ret;
}
If you consider a 50 ms (a fraction of the blink of an eye) one time cost to be a huge overhead, then you should probably focus on your workflow as such. Try to use one fully initialized process (with it's threads and data structures) in a persistent way to increase the work done during each each run.
That said, it may be possible to reduce the overhead, but in any case you will be very dependent on the runtime and initialization cost of your library, thus limiting your performance portability.
Your performance analysis may also be problematic. AFAIK VTune uses sampling, your data indicates a 1 ms sampling interval. That means you may have just 50 samples during the critical initialization path of your application, too little for a confident analysis. VTune might also have some forms of OpenMP instrumentation that provides more accurate results at small time scales. In any case I would take any performance measurement over just 150 ms with a grain of salt unless I knew exactly what impact and method the measurement has.
P.S. Running a simple code like:
#include <stdio.h>
#include <omp.h>
int main() {
double start = omp_get_wtime();
#pragma omp parallel
{
#pragma omp barrier
#pragma omp master
printf("%f s\n", omp_get_wtime() - start);
}
}
Shows an initial thread creation overhead between 3 ms and 200 ms on different systems / thread counts with the Intel OpenMP runtime.
I want to calculate the absolute values of the elements of a complex array in C or C++. The easiest way would be
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
But for large vectors that will be slow. Is there a way to speed that up (by using parallelization, for example)? Language can be either C or C++.
Given that all loop iterations are independent, you can use the following code for parallelization:
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
Of course, for using this you should enable OpenMP support while compiling your code (usually by using /openmp flag or setting the project options).
You can find several examples of OpenMP usage in wiki.
Or use Concurrency::parallele_for like that :
Concurrency::parallel_for(0, N, [&a, &b](int i)
{
b[i] = cabs(a[i]);
});
Use vector operations.
If you have glibc 2.22 (pretty recent), you can use the SIMD capabilities of OpenMP 4.0 to operate on vectors/arrays.
Libmvec is vector math library added in Glibc 2.22.
Vector math library was added to support SIMD constructs of OpenMP4.0
(#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) by adding
vector implementations of vector math functions.
Vector math functions are vector variants of corresponding scalar math
operations implemented using SIMD ISA extensions (e.g. SSE or AVX for
x86_64). They take packed vector arguments, perform the operation on
each element of the packed vector argument, and return a packed vector
result. Using vector math functions is faster than repeatedly calling
the scalar math routines.
Also, see Parallel for vs omp simd: when to use each?
If you're running on Solaris, you can explicitly use vhypot() from the math vector library libmvec.so to operate on a vector of complex numbers to obtain the absolute value of each:
Description
These functions evaluate the function hypot(x, y) for an entire vector
of values at once. ...
The source code for libmvec can be found at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/ and the vhypot() code specifically at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/common/__vhypot.c I don't recall if Sun Microsystems ever provided a Linux version of libmvec.so or not.
Using #pragma simd (even with -Ofast) or relying on the compilers auto-vectorization are more example of why it's a bad idea to blindly expect your compiler to implement SIMD efficiently. In order to use SIMD efficiently for this you need to use an array of struct of arrays. For example for single float with a SIMD width of 4 you could use
//struct of arrays of four complex numbers
struct c4 {
float x[4]; // real values of four complex numbers
float y[4]; // imaginary values of four complex numbers
};
Here is code showing how you could do this with SSE for the x86 instruction set.
#include <stdio.h>
#include <x86intrin.h>
#define N 10
struct c4{
float x[4];
float y[4];
};
static inline void cabs_soa4(struct c4 *a, float *b) {
__m128 x4 = _mm_loadu_ps(a->x);
__m128 y4 = _mm_loadu_ps(a->y);
__m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
_mm_storeu_ps(b, b4);
}
int main(void)
{
int n4 = ((N+3)&-4)/4; //choose next multiple of 4 and divide by 4
printf("%d\n", n4);
struct c4 a[n4]; //array of struct of arrays
for(int i=0; i<n4; i++) {
for(int j=0; j<4; j++) { a[i].x[j] = 1, a[i].y[j] = -1;}
}
float b[4*n4];
for(int i=0; i<n4; i++) {
cabs_soa4(&a[i], &b[4*i]);
}
for(int i = 0; i<N; i++) printf("%.2f ", b[i]); puts("");
}
It may help to unroll the loop a few times. In any case all this is moot for large N because the operation is memory bandwidth bound. For large N (meaning when the memory usage is much larger than the last level cache), although #pragma omp parallel may help some, the best solution is not to do this for large N. Instead do this in chunks which fit in the lowest level cache along with other compute operations. I mean something like this
for(int i = 0; i < nchunks; i++) {
for(int j = 0; j < chunk_size; j++) {
b[i*chunk_size+j] = cabs(a[i*chunk_size+j]);
}
foo(&b[i*chunck_size]); // foo is computationally intensive.
}
I did not implement an array of struct of array here but it should be easy to adjust the code for that.
If you are using a modern compiler (GCC 5, for example), you can use Cilk+, that will give you a nice array notation, automatically usage of SIMD instructions, and parallelisation.
So, if you want to run them in parallel you would do:
#include <cilk/cilk.h>
cilk_for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
or if you want to test SIMD:
#pragma simd
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
But, the nicest part of Cilk is that you can just do:
b[:] = cabs(a[:])
In this case, the compiler and the runtime environment will decide to which level it should be SIMDed and what should be paralellised (the optimal way is applying SIMD on large-ish chunks in parallel).
Since this is decided by a work scheduler at runtime, Intel claims it is capable of providing a near optimal scheduling, and that it should be able to make an optimal use of the cache.
Also, you can use std::future and std::async (they are part of C++11), maybe it's more clear way of achieving what you want to do:
#include <future>
...
int main()
{
...
// Create async calculations
std::future<void> *futures = new std::future<void>[N];
for (int i = 0; i < N; ++i)
{
futures[i] = std::async([&a, &b, i]
{
b[i] = std::sqrt(a[i]);
});
}
// Wait for calculation of all async procedures
for (int i = 0; i < N; ++i)
{
futures[i].get();
}
...
return 0;
}
IdeOne live code
We first create asynchronous procedures and then wait until everything is calculated.
Here I use sqrt instead of cabs because I just don't know what is cabs. I'm sure it doesn't matter.
Also, maybe you'll find this link useful: cplusplus.com
I try to optimise the following loop with OpenMP:
#pragma omp parallel for private(diff)
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
#pragma omp atomic
d2 += diff * diff;
}
But it runs actually 4x slower than without #pragma.
EDIT
As Piotr S., coincoin and erenon pointed out, in my case x.d is so small, that's why parallelism makes my code run slower. I post the outer loop too, maybe there is some possibility for multithreading: (x.n is over 100 millions)
float sum_distribution = 0.0;
// look for the point that is furthest from any center
float max_dist = 0.0;
for (int i = 0; i < x.n; ++i) {
int example = dist2[i].second;
float d2 = 0.0, diff;
//#pragma omp parallel for private(diff) reduction(+:d2)
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
d2 += diff * diff;
}
if (d2 < dist2[i].first) {
dist2[i].first = d2;
}
if (dist2[i].first > max_dist) {
max_dist = dist2[i].first;
}
sum_distribution += dist2[i].first;
}
If someone is interested, here is the whole function: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L169, but as I measured 85% of the elapsed time comes from this loop.
Yes, the outer loop, as posted, can be parallelized with OpenMP.
All variables modified in the loop are either local to an iteration or are used for aggregation over the loop. And I assume that calls to x() in the calculation of diff have no side effects.
To do aggregation in parallel correctly and efficiently, you need to use an OpenMP loop with reduction clause. For sum_distribution the reduction operation is +, and for max_dist it's max. So, adding the following pragma in front of the outer loop should do the job:
#pragma omp parallel for reduction(+:sum_distribution) reduction(max:max_dist)
Note that max as a reduction operation can only be used since OpenMP 3.1. It's not that new, so most OpenMP-enabled compilers already support it, but not all; or you might use an older version. So it makes sense to consult with the documentation for your compiler.