SIMD optimisation for cross-pattern access

SIMD optimisation for cross-pattern access - c++

I'm tryint to write a monte-carlo simulation of the Ising model, and I was wondering if it was possible to use SIMD optimisations for accessing data in a cross pattern.
I basically want to know if there's any way of speeding up this function.
//up/down/left/right stencil accumulation
float lattice::compute_point_energy(int row, int col) {
int accumulator=0;
accumulator+= get(row? row-1: size_-1, col);
accumulator+= get((row+1)%size_, col);
accumulator+= get(row, col? col-1: size_-1);
accumulator+= get(row, (col+1)%size_) ;
return -get(row, col) * (accumulator * J_ + H_);
}
get(i, j) is a method accesses a flat std::vector of shorts. I see that there might be a few problems: the access has lots of ternary logic going on (for periodic boundary conditions), and none of the vector elements are adjacent. Is it to make SIMD optimisations for this chunk, or should I keep digging? Re-implementing the adjacency matrix and/or using a different container (e.g. an array, or vector of different type) are an option.

SIMD is the last thing you'll want to try with this function.
I think you're trying to use an up/down/left/right 4-stencil for your computation. If so, your code should have a comment noting this.
You're losing a lot of speed in this function because of the potential for branching at your ternary operators and because modulus is relatively slow.
You'd do well to surround the two-dimensional space you're operating over with a ring of cells set to values appropriate for handling edge effects. This allows you to eliminate checks for edge effects.
For accessing your stencil, I find it often works to use something like the following:
const int width = 10;
const int height = 10;
const int offset[4] = {-1,1,-width,width};
double accumulator=0;
for(int i=0;i<4;i++)
accumulator += get(current_loc+offset[i]);
Notice that the mini-array has precalculated offsets to the neighbouring cells in your domain. A good compiler will likely unroll the foregoing loop.
Once you've done all this, appropriate choice of optimization flags may lead to automatic vectorization.
As it is, the branching and mods in your code are likely preventing auto-vectorization. You can check this by enabling appropriate flags. For Intel Compiler Collection (icc), you'll want:
-qopt-report=5 -qopt-report-phase:vec
For GCC you'll want (if I recall correctly):
-fopt-info-vec -fopt-info-missed

Related

Can OpenMP's SIMD directive vectorize indexing operations?

Say I have an MxN matrix (SIG) and a list of Nx1 fractional indices (idxt). Each fractional index in idxt uniquely corresponds to the same position column in SIG. I would like to index to the appropriate value in SIG using the indices stored in idxt, take that value and save it in another Nx1 vector. Since the indices in idxt are fractional, I need to interpolate in SIG. Here is an implementation that uses linear interpolation:
void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
const Eigen::Ref<const Eigen::Matrix<short int, -1, -1>>& SIG,
double& bfVal) {
Eigen::VectorXd bfPTVec(idxt.size());
#pragma omp simd
for (int j = 0; j < idxt.size(); j++) {
int vIDX = static_cast<int>(idxt(j));
double interp1 = vIDX + 1 - idxt(j);
double interp2 = idxt(j) - vIDX;
bfPTVec(j) = (SIG(vIDX,j)*interp1 + SIG(vIDX+1,j)*interp2);
}
bfVal = ((idxt.array() > 0.0).select(bfPTVec,0.0)).sum();
}
I suspect there is a better way to implement the body of the loop here that would help the compiler better exploit SIMD operations. For example, as I understand it, forcing the compiler to cast between types, both explicitly as the first line does and implicitly as some of the mathematical operations do is not a vectorizable operation.
Additionally, by making the access to SIG dependent on values in idxt which are calculated at runtime I'm not clear if the type of memory read-write I'm performing here is vectorizable, or how it could be vectorized. Looking at the big picture description of my problem where each idxt corresponds to the same "position" column as SIG, I get a sense that it should be a vectorizable operation, but I'm not sure how to translate that into good code.
Clarification
Thanks to the comments, I realized I hadn't specified that certain values that I don't want contributing to the final summation in idxt are set to zero when idxt is initialized outside of this method. Hence the last line in the example given above.

Theoretically, it should be possible, assuming the processor support this operation. However, in practice, this is not the case for many reasons.
First of all, mainstream x86-64 processors supporting the instruction set AVX-2 (or AVX-512) does have instructions for that: gather SIMD instructions. Unfortunately, the instruction set is quite limited: you can only fetch 32-bit/64-bit values from the memory base on 32-bit/64-bit indices. Moreover, this instruction is not very efficiently implemented on mainstream processors yet. Indeed, it fetch every item separately which is not faster than a scalar code, but this can still be useful if the rest of the code is vectorized since reading many scalar value to fill a SIMD register manually tends to be a bit less efficient (although it was surprisingly faster on old processors due to a quite inefficient early implementation of gather instructions). Note that is the SIG matrix is big, then cache misses will significantly slow down the code.
Additionally, AVX-2 is not enabled by default on mainstream processors because not all x86-64 processors supports it. Thus, you need to enable AVX-2 (eg. using -mavx2) so compilers could vectorize the loop efficiently. Unfortunately, this is not enough. Indeed, most compilers currently fail to automatically detect when this instruction can/should be used. Even if they could, then the fact that IEEE-754 floating point number operations are not associative and values can be infinity or NaN generally does not help them to generate an efficient code (although it should be fine here). Note that you can tell to your compiler that operations can be assumed associated and you use only finite/basic real numbers (eg. using -ffast-math, which can be unsafe). The same thing apply for Eigen type/operators if compilers fail to completely inline all the functions (which is the case for ICC).
To speed up the code, you can try to change the type of the SIG variable to a matrix reference containing int32_t items. Another possible optimization is to split the loop in small fixed-size chunks (eg.32 items) and split the loop in many parts so to compute the indirection in a separate loops so compilers can vectorize at least some of the loops. Some compilers likes Clang are able to do that automatically for you: they generate a fast SIMD implementation for a part of the loop and do the indirections use scalar instructions. If this is not enough (which appear to be the case so far), then you certainly need to vectorize the loop yourself using SIMD intrinsics (or possible use SIMD libraries that does that for you).

Probably no, but I would expect manually vectorized version to be faster.
Below is an example of that inner loop, untested. It doesn’t use AVX only SSE up to 4.1, and should be compatible with these Eigen matrices you have there.
The pIndex input pointer should point to the j-th element of your idxt vector, and pSignsColumn should point to the start of the j-th column of the SIG matrix. It assumes your SIG matrix is column major. It’s normally the default memory layout in Eigen but possible to override with template arguments, and probably with macros as well.
inline double computePoint( const double* pIndex, const int16_t* pSignsColumn )
{
// Load the index value into both lanes of the vector
__m128d idx = _mm_loaddup_pd( pIndex );
// Convert into int32 with truncation; this assumes the number there ain't negative.
const int iFloor = _mm_cvttsd_si32( idx );
// Compute fractional part
idx = _mm_sub_pd( idx, _mm_floor_pd( idx ) );
// Compute interpolation coefficients, they are [ 1.0 - idx, idx ]
idx = _mm_addsub_pd( _mm_set_sd( 1.0 ), idx );
// Load two int16_t values from sequential addresses
const __m128i signsInt = _mm_loadu_si32( pSignsColumn + iFloor );
// Upcast them to int32, then to fp64
const __m128d signs = _mm_cvtepi32_pd( _mm_cvtepi16_epi32( signsInt ) );
// Compute the result
__m128d res = _mm_mul_pd( idx, signs );
res = _mm_add_sd( res, _mm_unpackhi_pd( res, res ) );
// The above 2 lines (3 instructions) can be replaced with the following one:
// const __m128d res = _mm_dp_pd( idx, signs, 0b110001 );
// It may or may not be better, the dppd instruction is not particularly fast.
return _mm_cvtsd_f64( res );
}

data locality for implementing 2d array in c/c++

Long time ago, inspired by "Numerical recipes in C", I started to use the following construct for storing matrices (2D-arrays).
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
for (i = 0; i < NumRows; ++i) x[i] = (double *)calloc(NumCol, sizeof(double));
return x;
}
double **x = allocate_matrix(1000,2000);
x[m][n] = ...;
But recently noticed that many people implement matrices as follows
double *x = (double *)malloc(NumRows * NumCols * sizeof(double));
x[NumCol * m + n] = ...;
From the locality point of view the second method seems perfect, but has awful readability... So I started to wonder, is my first method with storing auxiliary array or **double pointers really bad or the compiler will optimize it eventually such that it will be more or less equivalent in performance to the second method? I am suspicious because I think that in the first method two jumps are made when accessing the value, x[m] and then x[m][n] and there is a chance that each time the CPU will load first the x array and then x[m] array.
p.s. do not worry about extra memory for storing **double, for large matrices it is just a small percentage.
P.P.S. since many people did not understand my question very well, I will try to re-shape it: do I understand right that the first method is kind of locality-hell, when each time x[m][n] is accessed first x array will be loaded into CPU cache and then x[m] array will be loaded thus making each access at the speed of talking to RAM. Or am I wrong and the first method is also OK from data-locality point of view?

For C-style allocations you can actually have the best of both worlds:
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
x[0] = (double *)calloc(NumRows * NumCol, sizeof(double)); // <<< single contiguous memory allocation for entire array
for (i = 1; i < NumRows; ++i) x[i] = x[i - 1] + NumCols;
return x;
}
This way you get data locality and its associated cache/memory access benefits, and you can treat the array as a double ** or a flattened 2D array (array[i * NumCols + j]) interchangeably. You also have fewer calloc/free calls (2 versus NumRows + 1).

No need to guess whether the compiler will optimize the first method. Just use the second method which you know is fast, and use a wrapper class that implements for example these methods:
double& operator(int x, int y);
double const& operator(int x, int y) const;
... and access your objects like this:
arr(2, 3) = 5;
Alternatively, if you can bear a little more code complexity in the wrapper class(es), you can implement a class that can be accessed with the more traditional arr[2][3] = 5; syntax. This is implemented in a dimension-agnostic way in the Boost.MultiArray library, but you can do your own simple implementation too, using a proxy class.
Note: Considering your usage of C style (a hardcoded non-generic "double" type, plain pointers, function-beginning variable declarations, and malloc), you will probably need to get more into C++ constructs before you can implement either of the options I mentioned.

The two methods are quite different.
While the first method allows for easier direct access to the values by adding another indirection (the double** array, hence you need 1+N mallocs), ...
the second method guarantees that ALL values are stored contiguously and only requires one malloc.
I would argue that the second method is always superior. Malloc is an expensive operation and contiguous memory is a huge plus, depending on the application.
In C++, you'd just implement it like this:
std::vector<double> matrix(NumRows * NumCols);
matrix[y * numCols + x] = value; // Access
and if you're concerned with the inconvenience of having to compute the index yourself, add a wrapper that implements operator(int x, int y) to it.
You are also right that the first method is more expensive when accessing the values. Because you need two memory lookups as you described x[m] and then x[m][n]. There is no way the compiler will "optimize this away". The first array, depending on its size, will be cached, and the performance hit may not be that bad. In the second case, you need an extra multiplication for direct access.

In the first method you use, the double* in the master array point to logical columns (arrays of size NumCol).
So, if you write something like below, you get the benefits of data locality in some sense (pseudocode):
foreach(row in rows):
foreach(elem in row):
//Do something
If you tried the same thing with the second method, and if element access was done the way you specified (i.e. x[NumCol*m + n]), you still get the same benefit. This is because you treat the array to be in row-major order. If you tried the same pseudocode while accessing the elements in column-major order, I assume you'd get cache misses given that the array size is large enough.
In addition to this, the second method has the additional desirable property of being a single contiguous block of memory which further improves the performance even when you loop through multiple rows (unlike the first method).
So, in conclusion, the second method should be much better in terms of performance.

If NumCol is a compile-time constant, or if you are using GCC with language extensions enabled, then you can do:
double (*x)[NumCol] = (double (*)[NumCol]) malloc(NumRows * sizeof (double[NumCol]));
and then use x as a 2D array and the compiler will do the indexing arithmetic for you. The caveat is that unless NumCol is a compile-time constant, ISO C++ won't let you do this, and if you use GCC language extensions you won't be able to port your code to another compiler.

For loop or no loop? (dataset is small and not subject to change)

Let's say I have a situation where I have a matrix of a small, known size where the size is unlikely to change over the life of the software. If I need to examine each matrix element, would it be more efficient to use a loop or to manually index into each matrix location?
For example, let's say I have a system made up of 3 windows, 2 panes per window. I need to keep track of state for each window pane. In my system, there will only ever be 3 windows, 2 panes per window.
static const int NUMBER_OF_WINDOWS = 3;
static const int NUMBER_OF_PANES = 2;
static const int WINDOW_LEFT = 0;
static const int WINDOW_MIDDLE = 1;
static const int WINDOW_RIGHT = 2;
static const int PANE_TOP = 0;
static const int PANE_BOTTOM = 1;
paneState windowPanes[NUMBER_OF_WINDOWS][NUMBER_OF_PANES];
Which of these accessing methods would be more efficient?
loop version:
for (int ii=0; ii<NUMBER_OF_WINDOWS; ii++)
{
for (int jj=0; jj<NUMBER_OF_PANES; jj++)
{
doSomething(windowPanes[ii][jj];
}
}
vs.
manual access version:
doSomething(windowPanes[WINDOW_LEFT][PANE_TOP]);
doSomething(windowPanes[WINDOW_MIDDLE][PANE_TOP]);
doSomething(windowPanes[WINDOW_RIGHT][PANE_TOP]);
doSomething(windowPanes[WINDOW_LEFT][PANE_BOTTOM]);
doSomething(windowPanes[WINDOW_MIDDLE][PANE_BOTTOM]);
doSomething(windowPanes[WINDOW_RIGHT][PANE_BOTTOM]);
Will the loop code generate branch instructions, and will those be more costly than the instructions that would be generated on the manual access?

The classic Efficiency vs Organization. The for loops are much more human readable and the manual way is more machine readable.
I recommend you use the loops. Because the compiler, if optimizing is enabled, will actually generate the manual code for you when it sees that the upper bounds are constant. That way you get the best of both worlds.

First of all: How complex is your function doSomething? If it is (most likely this is so), then you will not notice any difference.
In general, calling your function sequentially will be slightly more effective than the loop. But once again, the gain will be so tiny that it is not worth discussing it.
Bear in mind that optimizing compilers do loop unrolling. This is essentially generating code that will rotate your loop smaller number of times while doing more work in each rotation (they will call your function 2-4 times in sequence). When the number of rotations is small and fixed compiler may easily eliminate the loop completely.
Look at your code from the point of view of clarity and ease of modification. In many cases compiler will do a lot of useful tricks related to performance.

You may linearize your multi-dimensional array
paneState windowPanes[NUMBER_OF_WINDOWS * NUMBER_OF_PANES];
and then
for (auto& pane : windowPanes) {
doSomething(pane);
}
Which avoid extra loop if compiler doesn't optimize it.

Why would this search method not be scalable?

I want to parallel my search algorithm using openMP, vTree is a binary search tree, and I want to apply my search algorithm for each of the point set. below is a snippet of my code. the search procedure for two points is totally irrelevant and so can be parallel. though they do need to read a same tree, but once constructed, the tree wouldn't be modified any more. thus it is read-only.
However, the code below shows terrible scalability, on my 32-core platform, only 2x speed up is achieved. is it because that vTree is read by all threads? if so, how can I further optimize the code?
auto results = vector<vector<Point>>(particleNum);
auto t3 = high_resolution_clock::now();
double radius = 1.6;
#pragma omp parallel for
for (decltype(points.size()) i = 0; i < points.size(); i++)
{
vTree.search(points[i], radius, results[i]);
}
auto t4 = high_resolution_clock::now();
double searchTime = duration_cast<duration<double>>(t4 - t3).count();
the type signature for search is
void VPTree::search(const Point& p, double radius, vector<Point>& result) const
search result would be put into result.

My best guess would be that you are cache ping-pong'ing on the result vectors. I would assume that your "search" function uses the passed-in result vector as a place to put points and that you use it throughout the algorithm to insert neighbors as you encounter them in the search tree. Whenever you add a point to that result vector, the internal data of that vector object will be modified. And because all of your result vectors are packed together in contiguous memory, it is likely that different result vectors occupy the same cache lines. So, when the CPU maintains cache coherence, it will constantly lock the relevant cache lines.
The way to solve it is to use an internal, temporary vector that you only assign to the results vector once at the end (which can be done cheaply if you use move semantics). Something like this:
void VPTree::search(const Point& p, double radius, vector<Point>& result) const {
vector<Point> tmp_result;
// ... add results to "tmp_result"
result = std::move(tmp_result);
return;
}
Or, you could also just return the vector by value (which is implicitly using a move):
vector<Point> VPTree::search(const Point& p, double radius) const {
vector<Point> result;
// ... add results to "result"
return result;
}
Welcome to the joyful world of move-semantics and how awesome it can be at solving these types of concurrency / cache-coherence issues.
It is also conceivable that you're experiencing problems related to accessing the same tree from all threads, but since it's all read-only operations, I'm pretty sure that even on a conservative architecture like x86 (and other Intel / AMD CPUs) that this should not pose a significant issue, but I might be wrong (maybe a kind of "over-subscription" problem might be at play, but it's dubious). And other problems might include the fact that OpenMP does incur quite a bit of overhead (spawning threads, synchronizing, etc.) which has to be weighted against the computational cost of the actual operations you are doing within those parallel loops (and it's not always a favorable trade-off). And also, if your VPTree (I imagine stands for "Vantage-point Tree") does not have good locality of references (e.g., you implemented it as a linked-tree), then the performance is going to be terrible whichever way you use it (as I explain here).

Simultaneously multiply all struct-elements with a scalar

I have a struct that represents a vector. This vector consists of two one-byte integers. I use them to keep values from 0 to 255.
typedef uint8_T unsigned char;
struct Vector
{
uint8_T x;
uint8_T y;
};
Now, the main use case in my program is to multiply both elements of the vector with a 32bit float value:
typedef real32_T float;
Vector Vector::operator * ( const real32_T f ) const {
return Vector( (uint8_T)(x * f), (uint8_T)(y * f) );
};
This needs to be performed very often. Is there a way that these two multiplications can be performed simultaneously? Maybe by vectorization, SSE or similar? Or is the Visual studio compiler already doing this simultaneously?
Another usecase is to interpolate between two Vectors.
Vector Vector::interpolate(const Vector& rhs, real32_T z) const
{
return Vector(
(uint8_T)(x + z * (rhs.x - x)),
(uint8_T)(y + z * (rhs.y - y))
);
}
This already uses an optimized interpolation aproach (https://stackoverflow.com/a/4353537/871495).
But again the values of the vectors are multiplied by the same scalar value.
Is there a possibility to improve the performance of these operations?
Thanks
(I am using Visual Studio 2010 with an 64bit compiler)

In my experience, Visual Studio (especially an older version like VS2010) does not do a lot of vectorization on its own. They have improved it in the newer versions, so if you can, you might see if a change of compiler speeds up your code.
Depending on the code that uses these functions and the optimization the compiler does, it may not even be the calculations that slow down your program. Function calls and cache misses may hurt a lot more.
You could try the following:
If not already done, define the functions in the header file, so the compiler can inline them
If you use these functions in a tight loop, try doing the calculations 'by hand' without any function calls (temporarily expose the variables) and see if it makes a speed difference)
If you have a lot of vectors, look at how they are laid out in memory. Store them contiguously to minimize cache misses.
For SSE to work really well, you'd have to work with 4 values at once - so multiply 2 vectors with 2 floats. In a loop, use a step of 2 and write a static function that calculates 2 vectors at once using SSE instructions. Because your vectors are not aligned (and hardly ever will be with 8 bit variables), the code could even run slower than what you have now, but it's worth a try.
If applicable and if you don't depend on the clamping that occurs with your cast from float to uint8_t (e.g. if your floats are in range [0,1]), try using float everywhere. This may allow the compiler do do far better optimization.

You haven't showed the full algorithm, but the conversions between integer and float numbers is a very slow operation. Eliminating this operation and using only one type (if possible preferably integers) can greatly improve performances.
Alternatevly, you can use lrint() to do the conversion as explained here.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js