I have two arrays and I want to copy one array into the other with some stride. For example, I have
A A A A A A A A ...
B B B B B B B B ...
and I want to copy every three elements of B to A to obtain
B A A B A A B A ...
From the post "Is there a standard, strided version of memcpy?", it seems that there is no such a possibility in C.
However, I have experienced that, in some cases, memcpy is faster than a for loop based copy.
My question is; Is there any way to efficiently perform strided memory copy in C++ performing at least as a standard for loop?
Thank you very much.
EDIT - CLARIFICATION OF THE PROBLEM
To make the problem clearer, let us denote the two arrays at hand by a and b. I have a function that performs the unique following for loop
for (int i=0; i<NumElements, i++)
a_[i] = b_[i];
where both the []'s are overloaded operators (I'm using an expression templates technique) so that they can be actually mean, for example
a[3*i]=b[i];
Might be a too specific answer, but on an ARM platform that supports NEON, NEON vectorization can be used to make strided copy even faster. This could be life-saving in an environment where resources are relatively more limited, which is probably why ARM is used in that setting in the first place. A prominent example is Android where most devices still use the ARM v7a architecture which supports NEON.
The following examples demonstrate this, it is a loop to copy the semi-planar UV plane of a YUV420sp image into the planar UV plane of a YUV420p image. The sizes of the source and destination buffers are both 640*480/2 bytes. All of the examples are compiled with the g++ 4.8 inside Android NDK r9d. They are executed on a Samsung Exynos Octa 5420 processor:
Level 1: Regular
void convertUVsp2UVp(
unsigned char* __restrict srcptr,
unsigned char* __restrict dstptr,
int stride)
{
for(int i=0;i<stride;i++){
dstptr[i] = srcptr[i*2];
dstptr[i + stride] = srcptr[i*2 + 1];
}
}
Compiled with -O3 only, takes about 1.5 ms on average.
Level 2: Unrolled and squeezed a bit more with moving pointers
void convertUVsp2UVp(
unsigned char* __restrict srcptr,
unsigned char* __restrict dstptr,
int stride)
{
unsigned char* endptr = dstptr + stride;
while(dstptr<endptr){
*(dstptr + 0) = *(srcptr + 0);
*(dstptr + stride + 0) = *(srcptr + 1);
*(dstptr + 1) = *(srcptr + 2);
*(dstptr + stride + 1) = *(srcptr + 3);
*(dstptr + 2) = *(srcptr + 4);
*(dstptr + stride + 2) = *(srcptr + 5);
*(dstptr + 3) = *(srcptr + 6);
*(dstptr + stride + 3) = *(srcptr + 7);
*(dstptr + 4) = *(srcptr + 8);
*(dstptr + stride + 4) = *(srcptr + 9);
*(dstptr + 5) = *(srcptr + 10);
*(dstptr + stride + 5) = *(srcptr + 11);
*(dstptr + 6) = *(srcptr + 12);
*(dstptr + stride + 6) = *(srcptr + 13);
*(dstptr + 7) = *(srcptr + 14);
*(dstptr + stride + 7) = *(srcptr + 15);
srcptr+=16;
dstptr+=8;
}
}
Compiled with -O3 only, takes about 1.15 ms on average. This is probably as fast as it gets on a regular architecture, as per the other answer.
Level 3: Regular + GCC automatic NEON vectorization
void convertUVsp2UVp(
unsigned char* __restrict srcptr,
unsigned char* __restrict dstptr,
int stride)
{
for(int i=0;i<stride;i++){
dstptr[i] = srcptr[i*2];
dstptr[i + stride] = srcptr[i*2 + 1];
}
}
Compiled with -O3 -mfpu=neon -ftree-vectorize -ftree-vectorizer-verbose=1 -mfloat-abi=softfp, takes about 0.6 ms on average. For reference, a memcpy of 640*480 bytes, or double the amount of what's tested here, takes about 0.6 ms on average.
As a side note, the second code (unrolled and pointered) compiled with the NEON parameters above takes about the same amount of time, 0.6 ms.
Is there any way to efficiently perform strided memory copy in C++ performing at least as a standard for loop?
Edit 2: There is no function for strided copying in the C++ libraries.
Since strided copying is not as popular a memory copying, chip manufacturers nor language designs have specialized support for strided copying.
Assuming a standard for loop, you may be able to gain some performance by using Loop Unrolling. Some compilers have options to unroll loops; it's not a "standard" option.
Given a standard for loop:
#define RESULT_SIZE 72
#define SIZE_A 48
#define SIZE_B 24
unsigned int A[SIZE_A];
unsigned int B[SIZE_B];
unsigned int result[RESULT_SIZE];
unsigned int index_a = 0;
unsigned int index_b = 0;
unsigned int index_result = 0;
for (index_result = 0; index_result < RESULT_SIZE;)
{
result[index_result++] = B[index_b++];
result[index_result++] = A[index_a++];
result[index_result++] = A[index_a++];
}
Loop unrolling would repeat the contents of the "standard" for loop:
for (index_result = 0; index_result < RESULT_SIZE;)
{
result[index_result++] = B[index_b++];
result[index_result++] = A[index_a++];
result[index_result++] = A[index_a++];
result[index_result++] = B[index_b++];
result[index_result++] = A[index_a++];
result[index_result++] = A[index_a++];
}
In the unrolled version, the number of loops has been cut in half.
The performance improvement may be negligible compared to other options.
The following issues affect performance and each may have different speed improvements:
Processing data cache misses
Reloading of instruction pipeline (depends on processor)
Operating System swapping memory with disk
Other tasks running concurrently
Parallel processing (depends on processor / platform)
One example of parallel processing is to have one processor copy the B items to the new array and another processor copy the A items to the new array.
Related
I have a large tensor of floating point data with the dimensions 35k(rows) x 45(cols) x 150(slices) which I have stored in an armadillo cube container. I need to linearly combine all the 150 slices together in under 35 ms (a must for my application). The linear combination floating point weights are also stored in an armadillo container. My fastest implementation so far takes 70 ms, averaged over a window of 30 frames, and I don't seem to be able to beat that. Please note I'm allowed CPU parallel computations but not GPU.
I have tried multiple different ways of performing this linear combination but the following code seems to be the fastest I can get (70 ms) as I believe I'm maximizing the cache hit chances by fetching the largest possible contiguous memory chunk at each iteration.
Please note that Armadillo stores data in column major format. So in a tensor, it first stores the columns of the first channel, then the columns of the second channel, then third and so forth.
typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;
int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);
double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n++) {
Timer::time_point start = Timer::now();
arma::fmat result(rows, cols, arma::fill::zeros);
for (int i = 0; i < slices; i++)
result += tensor.slice(i) * w(i);
Timer::time_point end = Timer::now();
Duration span = end - start;
double t = span.count();
overallTime += t;
cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}
cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;
I need to optimize this code by at least 2x and I would very much appreciate any suggestions.
First at all I need to admit, I'm not familiar with the arma framework or the memory layout; the least if the syntax result += slice(i) * weight evaluates lazily.
Two primary problem and its solution anyway lies in the memory layout and the memory-to-arithmetic computation ratio.
To say a+=b*c is problematic because it needs to read the b and a, write a and uses up to two arithmetic operations (two, if the architecture does not combine multiplication and accumulation).
If the memory layout is of form float tensor[rows][columns][channels], the problem is converted to making rows * columns dot products of length channels and should be expressed as such.
If it's float tensor[c][h][w], it's better to unroll the loop to result+= slice(i) + slice(i+1)+.... Reading four slices at a time reduces the memory transfers by 50%.
It might even be better to process the results in chunks of 4*N results (reading from all the 150 channels/slices) where N<16, so that the accumulators can be allocated explicitly or implicitly by the compiler to SIMD registers.
There's a possibility of a minor improvement by padding the slice count to multiples of 4 or 8, by compiling with -ffast-math to enable fused multiply accumulate (if available) and with multithreading.
The constraints indicate the need to perform 13.5GFlops, which is a reasonable number in terms of arithmetic (for many modern architectures) but also it means at least 54 Gb/s memory bandwidth, which could be relaxed with fp16 or 16-bit fixed point arithmetic.
EDIT
Knowing the memory order to be float tensor[150][45][35000] or float tensor[kSlices][kRows * kCols == kCols * kRows] suggests to me to try first unrolling the outer loop by 4 (or maybe even 5, as 150 is not divisible by 4 requiring special case for the excess) streams.
void blend(int kCols, int kRows, float const *tensor, float *result, float const *w) {
// ensure that the cols*rows is a multiple of 4 (pad if necessary)
// - allows the auto vectorizer to skip handling the 'excess' code where the data
// length mod simd width != 0
// one could try even SIMD width of 16*4, as clang 14
// can further unroll the inner loop to 4 ymm registers
auto const stride = (kCols * kRows + 3) & ~3;
// try also s+=6, s+=3, or s+=4, which would require a dedicated inner loop (for s+=2)
for (int s = 0; s < 150; s+=5) {
auto src0 = tensor + s * stride;
auto src1 = src0 + stride;
auto src2 = src1 + stride;
auto src3 = src2 + stride;
auto src4 = src3 + stride;
auto dst = result;
for (int x = 0; x < stride; x++) {
// clang should be able to optimize caching the weights
// to registers outside the innerloop
auto add = src0[x] * w[s] +
src1[x] * w[s+1] +
src2[x] * w[s+2] +
src3[x] * w[s+3] +
src4[x] * w[s+4];
// clang should be able to optimize this comparison
// out of the loop, generating two inner kernels
if (s == 0) {
dst[x] = add;
} else {
dst[x] += add;
}
}
}
}
EDIT 2
Another starting point (before adding multithreading) would be consider changing the layout to
float tensor[kCols][kRows][kSlices + kPadding]; // padding is optional
The downside now is that kSlices = 150 can't anymore fit all the weights in registers (and secondly kSlices is not a multiple of 4 or 8). Furthermore the final reduction needs to be horizontal.
The upside is that reduction no longer needs to go through memory, which is a big thing with the added multithreading.
void blendHWC(float const *tensor, float const *w, float *dst, int n, int c) {
// each thread will read from 4 positions in order
// to share the weights -- finding the best distance
// might need some iterations
auto src0 = tensor;
auto src1 = src0 + c;
auto src2 = src1 + c;
auto src3 = src2 + c;
for (int i = 0; i < n/4; i++) {
vec8 acc0(0.0f), acc1(0.0f), acc2(0.0f), acc3(0.0f);
// #pragma unroll?
for (auto j = 0; j < c / 8; c++) {
vec8 w(w + j);
acc0 += w * vec8(src0 + j);
acc1 += w * vec8(src1 + j);
acc2 += w * vec8(src2 + j);
acc3 += w * vec8(src3 + j);
}
vec4 sum = horizontal_reduct(acc0,acc1,acc2,acc3);
sum.store(dst); dst+=4;
}
}
These vec4 and vec8 are some custom SIMD classes, which map to SIMD instructions either through intrinsics, or by virtue of the compiler being able to do compile using vec4 = float __attribute__ __attribute__((vector_size(16))); to efficient SIMD code.
As #hbrerkere suggested in the comment section, by using the -O3 flag and making the following changes, the performance improved by almost 65%. The code now runs at 45 ms as opposed to the initial 70 ms.
int lastStep = (slices / 4 - 1) * 4;
int i = 0;
while (i <= lastStep) {
result += tensor.slice(i) * w_id(i) + tensor.slice(i + 1) * w_id(i + 1) + tensor.slice(i + 2) * w_id(i + 2) + tensor.slice(i + 3) * w_id(i + 3);
i += 4;
}
while (i < slices) {
result += tensor.slice(i) * w_id(i);
i++;
}
Without having the actual code, I'm guessing that
+= tensor.slice(i) * w_id(i)
creates a temporary object and then adds it to the lhs. Yes, overloaded operators look nice, but I would write a function
addto( lhs, slice1, w1, slice2, w2, ....unroll to 4... )
which translates to pure loops over the elements:
for (i=....)
for (j=...)
lhs[i][j] += slice1[i][j]*w1[j] + slice2[i][j] &c
It would surprise me if that doesn't buy you an extra factor.
i have this code snippet I came across and I'm trying to use OpenMP to make it run faster than the original version. However, in comparison this seems to be taking about the same amount of time as the older version. Not sure why this multithreading approach is not working to optimize it. Like the timings are still the same. What can I do to make it run even faster?:
void sobel(unsigned char *data_out,
unsigned char *data_in, unsigned height,
unsigned width)
{
/* Sobel matrices for convolution */
int sobelv[3][3] = { {-1, -2, -1}, {0, 0, 0}, {1, 2, 1} };
int sobelh[3][3] = { {-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1} };
unsigned int size, i, j;
int lay;
size = height * width;
#ifdef OPENMP
#pragma omp parallel for collapse(64) shared (data_in,data_out,sobelv, sobelh,size) private (i,j,lay)
#endif
for (lay = 0; lay < 3; lay++) {
for (i = 1; i < height - 1; ++i) {
for (j = 1; j < width - 1; j++) {
int sumh, sumv;
int k = -1, l = -1;
sumh = 0;
sumv = 0;
/* Convolution part */
for ( k = -1; k < 2; k++)
for (l = -1; l < 2; l++) {
sumh =
sumh + sobelh[k + 1][l + 1] *(int) data_in[lay * size + (i + k) * width +(j + l)];
sumv =
sumv + sobelv[k + 1][l +1] * (int) data_in[lay *size +(i +k) *width + (j +l)];
}
int temp = abs(sumh / 8) + abs(sumv / 8);
data_out[lay * size + i * width + j] =
(temp > 255? 255: temp);
}
}
}
}
the main function is simply calling this function like this:
sobel(data_out, data_in, header.height, header.width);
any help would be appreciated!! :)
The best optimization you can apply is to vectorize the code. Compilers can often auto-vectorize the code when it is sufficiently simple but this one is too complex for most compilers (including GCC and Clang) to vectorize it.
Manual code vectorization is cumbersome error-prone and often make the code (more) dependent of a specific architecture (eg. x86-64). However, you can help the compiler to generate it for you. To do that, you it better to:
avoid mixing signed/unsigned types and type of different size;
use the smallest possible types fitting your needs;
avoid loops and conditions in the vectorized loop;
access data contiguously;
avoid integer multiplication/division with small types (on x86-64 and/or with some compilers);
prefer using local short-scoped variables when this is possible;
enable advanced optimizations like -O3 for GCC/Clang, possibly coupled with -mavx2 if your target platform supports the AVX-2 instruction set, or with -march=native if your target platform is the one where the program is built;
be careful about aliasing (possibly using temporary arrays, strict aliasing rules, memcpy calls, restrict compiler extensions, etc.) [thanks to #Laci].
You can check the generated assembly code to see if the code is vectorized or not.
Moreover, using collapse(2) should enough here to get a good speed-up. collapse(3) can introduce some unwanted overheads due to the last loop being shared amongst threads. collapse(64) is not correct (it cannot be bigger than the number of nested loops).
Here is the resulting untested code:
#include <cmath>
void sobel(unsigned char *data_out,
unsigned char *data_in, int height,
int width)
{
const int size = height * width;
#ifdef OPENMP
#pragma omp parallel for collapse(2) shared(data_in,data_out,size)
#endif
for (int lay = 0; lay < 3; lay++)
{
for (int i = 1; i < height - 1; ++i)
{
for (int j = 1; j < width - 1; j++)
{
short a11 = data_in[lay * size + (i-1) * width + (j-1)];
short a12 = data_in[lay * size + (i-1) * width + j];
short a13 = data_in[lay * size + (i-1) * width + (j+1)];
short a21 = data_in[lay * size + i * width + (j-1)];
short a23 = data_in[lay * size + i * width + (j+1)];
short a31 = data_in[lay * size + (i+1) * width + (j-1)];
short a32 = data_in[lay * size + (i+1) * width + j];
short a33 = data_in[lay * size + (i+1) * width + (j+1)];
short sumh = a13 - a11 + (a23 - a21) + (a23 - a21) + a33 - a31;
short sumv = a31 + a32 + a32 + a33 - (a11 + a12 + a12 + a13);
short temp = (abs(sumh) >> 3) + (abs(sumv) >> 3);
data_out[lay * size + i * width + j] = (temp > 255? 255: temp);
}
}
}
}
I expect the code to be several time faster (especially true in sequential) -- typically about 10 times faster with AVX-2 since the processor can work on 16 values at once (despite a bit more work related to SIMD instructions).
Another possible optimization you can do is called register blocking. The idea is to change the loop so that you work on small fixed-size tiles (eg. 2x2 or 4x2 SIMD values). This should reduces the number of L1-cache loads and the number of char-to-short/short-to-char conversions to perform. However, this is hard to help the compiler so it does this optimization correctly on such a code. It is probably better to use SIMD intrinsics if performance is critical and do the register blocking yourself.
I have been following this coursera course and at some point the code below is given and the instructor claims that vectorization is done by including #pragma omp simd between the inner and outer for loops since guided vectorization is hard. How can I vectorize the code used in the course on my own, and is there a way to achieve better performance than if I simply add #pragma omp simd and move on?
template<typename P>
void ApplyStencil(ImageClass<P> & img_in, ImageClass<P> & img_out) {
const int width = img_in.width;
const int height = img_in.height;
P * in = img_in.pixel;
P * out = img_out.pixel;
for (int i = 1; i < height-1; i++)
for (int j = 1; j < width-1; j++) {
P val = -in[(i-1)*width + j-1] - in[(i-1)*width + j] - in[(i-1)*width + j+1]
-in[(i )*width + j-1] + 8*in[(i )*width + j] - in[(i )*width + j+1]
-in[(i+1)*width + j-1] - in[(i+1)*width + j] - in[(i+1)*width + j+1];
val = (val < 0 ? 0 : val);
val = (val > 255 ? 255 : val);
out[i*width + j] = val;
}
}
template void ApplyStencil<float>(ImageClass<float> & img_in, ImageClass<float> & img_out);
I am compiling using gcc with the -march=native -fopenmp flags for AVX512 support on a skylake processor.
❯ gcc -march=native -Q --help=target|grep march
-march= skylake
❯ gcc -march=knl -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __AVX__ 1
#define __AVX2__ 1
#define __AVX512CD__ 1
#define __AVX512ER__ 1
#define __AVX512F__ 1
#define __AVX512PF__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1
Here is some untested proof-of-concept implementation which uses 4 adds, 1 fmsub and 3 loads per packet (instead of 9 loads, 7 adds, 1 fmsub for a straight-forward implementation). I left out the clamping (which for float images looks unusual at least, and for uint8 it does nothing, unless you change P val = ... to auto val = ..., as Peter noticed in the comments) -- but you can easily add that yourself.
The idea of this implementation is to sum up the pixels left and right (x0_2) as well as all 3 (x012) and add these from 3 consecutive rows (a012 + b0_2 + c012) then subtract that from the middle pixel multiplied by 8.
At the end of each loop drop the contents of a012 and move bX to aX and cX to bX for the next iteration.
The applyStencil function simply applies the first function for each column of 16 pixels (starting at col = 1 and at the end just performs a possibly overlapping computation for the last 16 columns). If your input image has less than 18 columns you need to handle that differently (possibly by masked loads/stores).
#include <immintrin.h>
void applyStencilColumn(float const *in, float *out, size_t width, size_t height)
{
if(height < 3) return; // sanity check
float const* last_in = in + height*width;
__m512 a012, b012, b0_2, b1;
__m512 const eight = _mm512_set1_ps(8.0);
{
// initialize first rows:
__m512 a0 = _mm512_loadu_ps(in-1);
__m512 a1 = _mm512_loadu_ps(in+0);
__m512 a2 = _mm512_loadu_ps(in+1);
a012 = _mm512_add_ps(_mm512_add_ps(a0,a2),a1);
in += width;
__m512 b0 = _mm512_loadu_ps(in-1);
b1 = _mm512_loadu_ps(in+0);
__m512 b2 = _mm512_loadu_ps(in+1);
b0_2 = _mm512_add_ps(b0,b2);
b012 = _mm512_add_ps(b0_2,b1);
in += width;
}
// skip first row for output:
out += width;
for(; in<last_in; in+=width, out+=width)
{
// precalculate sums for next row:
__m512 c0 = _mm512_loadu_ps(in-1);
__m512 c1 = _mm512_loadu_ps(in+0);
__m512 c2 = _mm512_loadu_ps(in+1);
__m512 c0_2 = _mm512_add_ps(c0,c2);
__m512 c012 = _mm512_add_ps(c0_2, c1);
__m512 outer = _mm512_add_ps(_mm512_add_ps(a012,b0_2), c012);
__m512 result = _mm512_fmsub_ps(eight, b1, outer);
_mm512_storeu_ps(out, result);
// shift/rename registers (with some unrolling this can be avoided entirely)
a012 = b012;
b0_2 = c0_2; b012 = c012; b1 = c1;
}
}
void applyStencil(float const *in, float *out, size_t width, size_t height)
{
if(width < 18) return; // assert("special case of narrow image not implemented");
for(size_t col = 1; col < width - 18; col += 16)
{
applyStencilColumn(in + col, out + col, width, height);
}
applyStencilColumn(in + width - 18, out + width - 18, width, height);
}
Possible improvements (left as an exercise):
The applyStencilColumn could act on columns of 32, 48, 64, ... pixels for better cache locality (as long as you have sufficient registers). This makes implementing both functions slightly more complicated, of course.
If you unroll 3 (or 6, 9, ...) iterations of the for(; in<last_in; in+=width) loop, there would be no need to actually move registers (plus the general benefit of unrolling).
If your width is a multiple of 16, you could ensure that at least the stores are mostly aligned (except for the first and last columns).
You could iterate just over a small number of rows at the same time (by adding another outer loop to the main function and calling applyStencilColumn with a fixed height. Make sure to have the right amount of overlap between row-sets. (The ideal number of rows likely depends on the size of your image).
You could also always add 3 consecutive pixels but multiply the center pixel by 9 instead (9*b1-outer). Then (with some book-keeping effort) you could add row0+(row1+row2) and (row1+row2)+row3 to get the row1 and row2 intermediate results (having 3 instead of 4 additions). Doing the same horizontally looks more complicated, though.
Of course, you should always test and benchmark any custom SIMD implementation vs what your compiler generates from the generic implementation.
Making my homework for implementing Conway's Game of Life using intrinsic functions found the working code, but cannot understand the main part of it.
This implementation first calculates amount of alive neighbors for each sell and store the result in an array counts, so the array of sells (world) is states. I cannot really get how the newstate is generated here. I understand how left shift works, how bitwise OR works, but I cannot understand why they're used like this, why shufmask is like this and how the shuffle works. Also cannot understand why _mm256_slli_epi16 used if the type of array elements is uint8_t. So my question is all about this string
__m256i newstate = _mm256_shuffle_epi8(shufmask, _mm256_or_si256(c, _mm256_slli_epi16(oldstate, 3)));
Could you please explain for me, dummy boy, if it's possible maximum detailed, how it works.
void gameoflife8vec(uint8_t *counts, uint8_t *states, size_t width, size_t height) {
assert(width % (sizeof(__m256i)) == 0);
size_t awidth = width + 2;
computecounts8vec(counts, states, width, height);
__m256i shufmask =
_mm256_set_epi8(
0, 0, 0, 0, 0, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0
);
for (size_t i = 0; i < height; i++) {
for (size_t j = 0; j < width; j += sizeof(__m256i)) {
__m256i c = _mm256_lddqu_si256(
(const __m256i *)(counts + (i + 1) * awidth + j + 1));
c = _mm256_subs_epu8(
c, _mm256_set1_epi8(
1)); // max was 8 = 0b1000, make it 7, 1 becomes 0, 0 remains 0
__m256i oldstate = _mm256_lddqu_si256(
(const __m256i *)(states + (i + 1) * awidth + j + 1));
__m256i newstate = _mm256_shuffle_epi8(
shufmask, _mm256_or_si256(c, _mm256_slli_epi16(oldstate, 3)));
_mm256_storeu_si256((__m256i *)(states + (i + 1) * awidth + (j + 1)),
newstate);
}
}
}
The memory for array is allocated in this way
uint8_t *states = (uint8_t *)malloc((N + 2) * (N + 2) * sizeof(uint8_t));
uint8_t *counts = (uint8_t *)malloc((N + 2) * (N + 2) * sizeof(uint8_t));
Also the source code can be found here https://github.com/lemire/SIMDgameoflife
shuffle_epi8 is being used here as a parallel table-lookup, with a constant first operand and a variable 2nd operand.
Daniel's code does some calculations that produce a 4-bit integer for every byte in the vector, then uses _mm256_shuffle_epi8 to map those integers to 0 / 1 alive-or-dead new states.
Notice that the low and high lanes of shufmask are identical: it's the same lookup table for both lanes. (It's not a lane-crossing shuffle, it's 32 parallel lookups from 2x 16-byte tables, using the low 4 bits in each element. And the high bit to zero it out.) See the intrinsic and asm instruction documentation.
shufmask is a poor choice of variable name. It's not the shuffle-control vector. alivetable might be a better choice.
Using [v]pshufb to implement a 16-entry LUT is a (fairly) well-known technique. For example, it's one way to implement a popcnt for large arrays that's faster than scalar, splitting bytes into low/high nibble and looking up the 4-bit popcnt results. See Counting 1 bits (population count) on large data using AVX-512 or AVX-2, specifically https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx2-lookup.cpp
I have the problem that my code returns different results when comparing debug to release. I checked that both modes use /fp:precise, so that should not be the problem. The main issue I have with this is that the complete image analysis (its an image understanding project) is completely deterministic, there's absolutely nothing random in it.
Another issue with this is the fact that my release build actually always returns the same result (23.014 for the image), while debug returns some random value between 22 and 23, which just should not be. I've already checked whether it may be thread related, but the only part in the algorithm which is multi-threaded returns the precisely same result for both debug and release.
What else may be happening here?
Update1: The code I now found responsible for this behaviour:
float PatternMatcher::GetSADFloatRel(float* sample, float* compared, int sampleX, int compX, int offX)
{
if (sampleX != compX)
{
return 50000.0f;
}
float result = 0;
float* pTemp1 = sample;
float* pTemp2 = compared + offX;
float w1 = 0.0f;
float w2 = 0.0f;
float w3 = 0.0f;
for(int j = 0; j < sampleX; j ++)
{
w1 += pTemp1[j] * pTemp1[j];
w2 += pTemp1[j] * pTemp2[j];
w3 += pTemp2[j] * pTemp2[j];
}
float a = w2 / w3;
result = w3 * a * a - 2 * w2 * a + w1;
return result / sampleX;
}
Update2:
This is not reproducible with 32bit code. While debug and release code will always result in the same value for 32bit, it still is different from the 64bit release version, and the 64bit debug still returns some completely random values.
Update3:
Okay, I found it to certainly be caused by OpenMP. When I disable it, it works fine. (both Debug and Release use the same code, and both have OpenMP activated).
Following is the code giving me trouble:
#pragma omp parallel for shared(last, bestHit, cVal, rad, veneOffset)
for(int r = 0; r < 53; ++r)
{
for(int k = 0; k < 3; ++k)
{
for(int c = 0; c < 30; ++c)
{
for(int o = -1; o <= 1; ++o)
{
/*
r: 2.0f - 15.0f, in 53 steps, representing the radius of blood vessel
c: 0-29, in steps of 1, representing the absorption value (collagene)
iO: 0-2, depending on current radius. Signifies a subpixel offset (-1/3, 0, 1/3)
o: since we are not sure we hit the middle, move -1 to 1 pixels along the samples
*/
int offset = r * 3 * 61 * 30 + k * 30 * 61 + c * 61 + o + (61 - (4*w+1))/2;
if(offset < 0 || offset == fSamples.size())
{
continue;
}
last = GetSADFloatRel(adapted, &fSamples.at(offset), 4*w+1, 4*w+1, 0);
if(bestHit > last)
{
bestHit = last;
rad = (r+8)*0.25f;
cVal = c * 2;
veneOffset =(-0.5f + (1.0f / 3.0f) * k + (1.0f / 3.0f) / 2.0f);
if(fabs(veneOffset) < 0.001)
veneOffset = 0.0f;
}
last = GetSADFloatRel(input, &fSamples.at(offset), w * 4 + 1, w * 4 + 1, 0);
if(bestHit > last)
{
bestHit = last;
rad = (r+8)*0.25f;
cVal = c * 2;
veneOffset = (-0.5f + (1.0f / 3.0f) * k + (1.0f / 3.0f) / 2.0f);
if(fabs(veneOffset) < 0.001)
veneOffset = 0.0f;
}
}
}
}
}
Note: with Release mode and OpenMP activated I get the same result as with deactivating OpenMP. Debug mode and OpenMP activated gets a different result, OpenMP deactivated gets the same result as with Release.
At least two possibilities:
Turning on optimization may result in the compiler reordering operations. This can introduce small differences in floating-point calculations when compared to the order executed in debug mode, where operation reordering does not occur. This may account for numerical differences between debug and release, but does not account for numerical differences from one run to the next in debug mode.
You have a memory-related bug in your code, such as reading/writing past the bounds of an array, using an uninitialized variable, using an unallocated pointer, etc. Try running it through a memory checker, such as the excellent Valgrind, to identify such problems. Memory related errors may account for non-deterministic behavior.
If you are on Windows, then Valgrind isn't available (pity), but you can look here for a list of alternatives.
To elaborate on my comment, this is the code that is most probably the root of your problem:
#pragma omp parallel for shared(last, bestHit, cVal, rad, veneOffset)
{
...
last = GetSADFloatRel(adapted, &fSamples.at(offset), 4*w+1, 4*w+1, 0);
if(bestHit > last)
{
last is only assigned to before it is read again so it is a good candidate for being a lastprivate variable, if you really need the value from the last iteration outside the parallel region. Otherwise just make it private.
Access to bestHit, cVal, rad, and veneOffset should be synchronised by a critical region:
#pragma omp critical
if (bestHit > last)
{
bestHit = last;
rad = (r+8)*0.25f;
cVal = c * 2;
veneOffset =(-0.5f + (1.0f / 3.0f) * k + (1.0f / 3.0f) / 2.0f);
if(fabs(veneOffset) < 0.001)
veneOffset = 0.0f;
}
Note that by default all variables, except the counters of parallel for loops and those defined inside the parallel region, are shared, i.e. the shared clause in your case does nothing unless you also apply the default(none) clause.
Another thing that you should be aware of is that in 32-bit mode Visual Studio uses x87 FPU math while in 64-bit mode it uses SSE math by default. x87 FPU does intermediate calculations using 80-bit floating point precision (even for calculations involving float only) while the SSE unit supports only the standard IEEE single and double precisions. Introducing OpenMP or any other parallelisation technique to a 32-bit x87 FPU code means that at certain points intermediate values should be converted back to the single precision of float and if done sufficiently many times a slight or significant difference (depending on the numerical stability of the algorithm) could be observed between the results from the serial code and the parallel one.
Based on your code, I would suggest that the following modified code would give you good parallel performance because there is no synchronisation at each iteration:
#pragma omp parallel private(last)
{
int rBest = 0, kBest = 0, cBest = 0;
float myBestHit = bestHit;
#pragma omp for
for(int r = 0; r < 53; ++r)
{
for(int k = 0; k < 3; ++k)
{
for(int c = 0; c < 30; ++c)
{
for(int o = -1; o <= 1; ++o)
{
/*
r: 2.0f - 15.0f, in 53 steps, representing the radius of blood vessel
c: 0-29, in steps of 1, representing the absorption value (collagene)
iO: 0-2, depending on current radius. Signifies a subpixel offset (-1/3, 0, 1/3)
o: since we are not sure we hit the middle, move -1 to 1 pixels along the samples
*/
int offset = r * 3 * 61 * 30 + k * 30 * 61 + c * 61 + o + (61 - (4*w+1))/2;
if(offset < 0 || offset == fSamples.size())
{
continue;
}
last = GetSADFloatRel(adapted, &fSamples.at(offset), 4*w+1, 4*w+1, 0);
if(myBestHit > last)
{
myBestHit = last;
rBest = r;
cBest = c;
kBest = k;
}
last = GetSADFloatRel(input, &fSamples.at(offset), w * 4 + 1, w * 4 + 1, 0);
if(myBestHit > last)
{
myBestHit = last;
rBest = r;
cBest = c;
kBest = k;
}
}
}
}
}
#pragma omp critical
if (bestHit > myBestHit)
{
bestHit = myBestHit;
rad = (rBest+8)*0.25f;
cVal = cBest * 2;
veneOffset =(-0.5f + (1.0f / 3.0f) * kBest + (1.0f / 3.0f) / 2.0f);
if(fabs(veneOffset) < 0.001)
veneOffset = 0.0f;
}
}
It only stores the values of the parameters that give the best hit in each thread and then at the end of the parallel region it computes rad, cVal and veneOffset based on the best values. Now there is only one critical region, and it is at the end of code. You can get around it also, but you would have to introduce additional arrays.
One thing to double check is that all variables are initialized. Many times un-optimized code (Debug mode) will initialize memory.
I would have said variable initialization in debug vs not there in release. But your results would not back this up (reliable result in release).
Does your code rely on any specific offsets or sizes? Debug build would place guards bytes around some allocations.
Could it be floating point related?
The debug floating point stack is different to the release which is built for more efficiency.
Look here: http://thetweaker.wordpress.com/2009/08/28/debugrelease-numerical-differences/
Just about any undefined behavior can account for this: uninitialized
variables, rogue pointers, multiple modifications of the same object
without an intervening sequence point, etc. etc. The fact that the
results are at times unreproduceable argues somewhat for an
uninitialized variable, but it can also occur from pointer problems or
bounds errors.
Be aware that optimization can change results, especially on an Intel.
Optimization can change which intermediate values spill to memory, and
if you've not carefully used parentheses, even the order of evaluation
in an expression. (And as we all know, in machine floating point, (a +
b) + c) != a + (b + c).) Still the results should be deterministic:
you will get different results according to the degree of optimization,
but for any set of optimization flags, you should get the same results.