C/C++ fast absolute difference between two series - c++

i am interested in generating efficient c/c++ code to get the differences between two time series.
More precise: The time series values are stored as uint16_t arrays with fixed and equal length == 128.
I am good with a pure c as well as a pure c++ implementation. My code examples are in c++
My intentions are:
Let A,B and C be discrete time series of length l with a value-type of uint16_t.
Vn[n<l]: Cn = |An - Bn|;
What i can think of in pseudo code:
for index i:
if a[i] > b[i]:
c[i] = a[i] - b[i]
else:
c[i] = b[i] - a[i]
Or in c/c++
for(uint8_t idx = 0; idx < 128; idx++){
c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];
}
But i really dont like the if/else statement in the loop.
I am okay with looping - this can be unrolled by the compiler.
Somewhat like:
void getBufDiff(const uint16_t (&a)[], const uint16_t (&b)[], uint16_t (&c)[]) {
#pragma unroll 16
for (uint8_t i = 0; i < 128; i++) {
c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];
}
#end pragma
}
What i am looking for is a 'magic code' which speeds up the if/else and gets me the absolute difference between the two unsigned values.
I am okay with a +/- 1 precision (In case this would allow some bit-magic to happen). I am also okay with changing the data-type to get faster results. And i am also okay with dropping the loop for something else.
So something like:
void getBufDiff(const uint16_t (&a)[], const uint16_t (&b)[], uint16_t (&c)[]) {
#pragma unroll 16
for (uint8_t i = 0; i < 128; i++) {
c[i] = magic_code_for_abs_diff(a[i],b[i]);
}
#end pragma
}
Did try XORing the two values. Gives proper results only for one of the cases.
EDIT 2:
Did a quick test on different approaches on my Laptop.
For 250000000 entrys this is the performance (256 rounds):
c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i]; ~500ms
c[i] = std::abs(a[i] - b[i]); ~800ms
c[i] = ((a[i] - b[i]) + ((a[i] - b[i]) >> 15)) ^ (i >> 15) ~425ms
uint16_t tmp = (a[i] - b[i]); c[i] = tmp * ((tmp > 0) - (tmp < 0)); ~600ms
uint16_t ret[2] = { a[i] - b[i], b[i] - a[i] };c[i] = ret[a[i] < b[i]] ~900ms
c[i] = ((a[i] - b[i]) >> 31 | 1) * (a[i] - b[i]); ~375ms
c[i] = ((a[i] - b[i])) ^ ((a[i] - b[i]) >> 15) ~425ms

Your problem is a good candidate for SIMD. GCC can do it automatically, here is a simplified example: https://godbolt.org/z/36nM8bYYv
void absDiff(const uint16_t* a, const uint16_t* b, uint16_t* __restrict__ c)
{
for (uint8_t i = 0; i < 16; i++)
c[i] = a[i] - b[i];
}
Note that I added __restrict__ to enable autovectorization, otherwise the compiler has to assume your arrays may overlap and it isn't safe to use SIMD (because some writes could change future reads in the loop).
I simplified it to just 16 at a time, and removed the absolute value for the sake of illustration. The generated assembly is:
vld1.16 {q9}, [r0]!
vld1.16 {q11}, [r1]!
vld1.16 {q8}, [r0]
vld1.16 {q10}, [r1]
vsub.i16 q9, q9, q11
vsub.i16 q8, q8, q10
vst1.16 {q9}, [r2]!
vst1.16 {q8}, [r2]
bx lr
That means it loads 8 integers at once from a, then from b, repeats that once, then does 8 subtracts at once, then again, then stores 8 values twice into c. Many fewer instructions than without SIMD.
Of course it requires benchmarking to see if this is actually faster on your system (after you add back the absolute value part, I suggest using your ?: approach which does not defeat autovectorization), but I expect it will be significantly faster.

Try to let the compiler see the conditional lane-selection pattern for SIMD instructions like this (pseudo code):
// store a,b to SIMD registers
for(0 to 32)
a[...] = input[...]
b[...] = input2[...]
// single type operation, easily parallelizable
for(0 to 32)
vector1[...] = a[...] - b[...]
// single type operation, easily parallelizable
// maybe better to compute b-a to decrease dependency to first step
// since a and b are already in SIMD registers
for(0 to 32)
vector2[...] = -vector1[...]
// single type operation, easily parallelizable
// re-use a,b registers, again
for(0 to 32)
vector3[...] = a[...] < b[...]
// x84 architecture has SIMD instructions for this
// operation is simple, no other calculations inside, just 3 inputs, 1 out
// all operands are registers (at least should be, if compiler works fine)
for(0 to 32)
vector4[...] = vector3[...] ? vector2[...]:vetor1[...]
If you write your benchmark codes, I can compare this with other solutions. But it wouldn't matter for good compilers (or good compiler flags) that do same thing automatically for the first benchmark code in question.

Fast abs (under two complement integers) can be implemented as (x + (x >> N)) ^ (x >> N) where N is the size of int - 1, i.e. 15 in your case. That's a possible implementation of std::abs. Still you can try it
– answer by freakish

Since you write "I am okay with a +/- 1 precision", you can use a XOR-solution: instead of abs(x), do x ^ (x >> 15). This will give an off-by-1 result for negative values.
If you want to calculate the correct result even for negative values, use the other answer (with x >> 15 correction).
In any case, this XOR-trick only works if overflow is impossible. The compiler can't replace abs by code which uses XOR because of that.

Related

Example of C++ code optimization for parallel computing

I'm trying to understand optimization routines. I'm focusing on the most critical part of my code (the code has some cycles of length "nc" and one cycle of length "np", where number "np" is much larger then "nc"). I present part of the code in here. The rest of code is not very essential in % of computational time so i prefer code purify in the rest of the algorithm. However, the critical cycle with "np" length is a pretty simple piece of code and it can be parallelized. So it will not hurt if i rewrite this part into some more effective and less clear version (maybe into SSE instructions). I'm using a gcc compiler, c++ code, and OpenMP parallelization.
This code is part of the well known particle-in-cell algorithm (and this one is also basic one). I'm trying to learn code optimization on this version (so my goal is not to have effective PIC algorithm only, because it is already written in thousand variants, but i want to bring some demonstrative example for code optimization also). I'm trying to do some work but i am not very sure if i solved all optimization properties correctly.
const int NT = ...; // number of threads (in two versions: about 6 or about 30)
const int np = 10000000; // np is about 1000-10000 times larger than nc commonly
const int nc = 10000;
const int step = 1000;
float u[np], x[np];
float a[nc], a_lin[nc], rho_full[NT][nc], rho_diff[NT][nc] , weight[nc];
int p,num;
for ( i = 0 ; i<step ; i++) {
// ***
// *** some not very time consuming code for calculation
// *** a, a_lin from values of rho_full and rho_diff
#pragma omp for private(p,num)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
rho_full[num][p] += weight[k];
rho_diff[num][p] += weight[k] * (x[k] - p);
}
};
I realize this has problems:
1) (main question) I use set of arrays rho_full[num][p] where num is index for each thread. After computation i just summarize this arrays (rho_full[0][p] + rho_full[1][p] + rho_full[2][p] ...). The reason is avoidance of writing into same part of array with two different threads. I am not very sure if this way is an effective solution (note that number "nc" is relatively small, so number of operations with "np" is still probably most essential)
2) (also important question) I need to read x[k] many times and it's also changed many times. Maybe its better to read this value into some register and then forget whole x array or fix some pointer in here. After all calculation i can call x[k] array again and store obtained value. I believe that compiler do this work for me but i am not very sure because i used modification of x[k] in the center of algorithm. So the compiler probably do some effective work on their own but maybe in this version it call more times then nessesary becouse more then ones I swich calling and storing this value.
3) (probably not relevant) The code works with integer part and remainder below decimal point part. It needs both of this values. I identify integer part as p = (int) x and remainder as x - p. I calculate this routine at the begin and also in the end of cycle interior. One can see that this spliting can be stored somewhere and used at next step (i mean step at i index). Do you thing that following version is better? I store integral and remainder part at arrays of x instead of whole value x.
int x_int[np];
float x_rem[np];
//...
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
u[k] += a[x_int[k]] + a_lin[x_int[k]] * x_rem[k];
x_rem[k] += u[k];
p = (int) x_rem[k]; // *** This part is added into code for simplify the rest.
x_int[k] += p; // *** And maybe there is a better way how to realize
x_rem[k] -= p; // *** this "pushing correction".
if (x_int[k]<0 ) {x_int[k]+=nc;} else
if (x_int[k]>nc) {x_int[k]-=nc;};
rho_full[num][x_int[k]] += weight[k];
rho_diff[num][x_int[k]] += weight[k] * x_rem[k];
}
};
You can use OMP reduction for your for loop:
int result = 0;
#pragma omp for nowait reduction(+:result)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
result += weight[k] + weight[k] * (x[k] - p);
}

Most efficient way to test a 256-bit YMM AVX register element for equal or less than zero

I'm implementing a particle system using Intel AVX intrinsics. When the Y-position of a particle is less than or equal to zero I want to reset the particle.
The particle system is ordered in a SOA-pattern like this:
class ParticleSystem
{
private:
float* mXPosition;
float* mYPosition;
float* mZPosition;
.... Rest of code not important for this question
My initial approach I had in mind was just to iterate through the mYPosition array and check for the case stated in the beginning. Perhaps some performance improvmentes could be made with this approach?
The question however is if there is any efficient way to implement this
using the AVX intrinsics? Thank you!
If the elements which are <= 0 are relatively sparse then one simple approach is to test 8 at a time using AVX and then drop into scalar code when you identify a vector which contains one or more such elements, e.g.:
#include <immintrin.h> // AVX intrinsics
const __m256 vk0 = _mm256_setzero_ps(); // const vector of zeros
for (int i = 0; i + 8 <= n; i += 8)
{
__m256 vy = _mm256_loadu_ps(&mYPosition[i]); // load 8 x floats
__m256 vcmp = _mm256_cmp_ps(vy, vk0, _CMP_LE_OS); // compare for <= 0
int mask = _mm256_movemask_ps(vcmp); // get MS bits from comparison result
if (mask != 0) // if any bits set
{ // then we have 1 or more elements <= 0
for (int k = 0; k < 8; ++k) // test each element in vector
{ // using scalar code...
if ((mask & 1) != 0)
{
// found element at index i + k
// do something with it...
}
mask >>= 1;
}
}
}
// deal with any remaining elements in case where n is not a multiple of 8
for (int j = i; j < n; ++j)
{
if (mYPosition[j] <= 0.0f)
{
// found element at index j
// do something with it...
}
}
Of course if the matching elements are not sparse, i.e. if you are typically finding one or more in every vector of 8, then this isn't going to buy you any performance gain. However if the elements are sparse, such that most vectors can be skipped, then you should see a significant benefit.

Multithread C++ program to speed up a summatory loop

I have a loop that iterates from 1 to N and takes a modular sum over time. However N is very large and so I am wondering if there is a way to modify it by taking advantage of multithread.
To give sample program
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
f(i) in my case isn't an actual function, but a long expression that would take up room here. Putting it there to illustrate purpose.
Yes, try this:
double total=0;
#pragma omp parallel for reduction(+:total)
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
Compile with:
g++ -fopenmp your_program.c
It's that simple! No headers are required. The #pragma line automatically spins up a couple of threads, divides the iterations of the loop evenly, and then recombines everything after the loop. Note though, that you must know the number of iterations beforehand.
This code uses OpenMP, which provides easy-to-use parallelism that's quite suitable to your case. OpenMP is even built-in to the GCC and MSVC compilers.
This page shows some of the other reduction operations that are possible.
If you need nested for loops, you can just write
double total=0;
#pragma omp parallel for reduction(+:total)
for (long long i = 1; i < N; ++i)
for (long long j = 1; j < N; ++j)
total = (total + f(i)*j) % modulus;
And the outer loop will be parallelised, with each thread running its own copy of the inner loop.
But you could also use the collapse directive:
#pragma omp parallel for reduction(+:total) collapse(2)
and then the iterations of both loops will be automagically divied up.
If each thread needs its own copy of a variable defined prior to the loop, use the private command:
double total=0, cheese=4;
#pragma omp parallel for reduction(+:total) private(cheese)
for (long long i = 1; i < N; ++i)
total = (total + f(i)) % modulus;
Note that you don't need to use private(total) because this is implied by reduction.
As presumably the f(i) are independent but take the same time roughly to run, you could create yourself 4 threads, and get each to sum up 1/4 of the total, then return the sum as a value, and join each one. This isn't a very flexible method, especially if the times the f(i) times can be random.
You might also want to consider a thread pool, and make each thread calculate f(i) then get the next i to sum.
Try openMP's parallel for with the reduction clause for your total http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause
If f(long long int) is a function that solely relies on its input and no global state and the abelian properties of addition hold, you can gain a significant advantage like this:
for(long long int i = 0, j = 1; i < N; i += 2, j += 2)
{
total1 = (total1 + f(i)) % modulus;
total2 = (total2 + f(j)) % modulus;
}
total = (total1 + total2) % modulus;
Breaking this out like that should help by allowing the compiler to improve code generation and the CPU to use more resources (the two operations can be handled in parallel) and pump more data out and reduce stalls. [I am assuming an x86 architecture here]
Of course, without knowing what f really looks like, it's hard to be sure if this is possible or if it will really help or make a measurable difference.
There may be other similar tricks that you can exploit special knowledge of your input and your platform - for example, SSE instructions could allow you to do even more. Platform-specific functionality might also be useful. For example, a modulo operation may not be required at all and your compiler may provide a special intrinsic function to perform addition modulo N.
I must ask, have you profiled your code and found this to be a hotspot?
You could use Threading Building Blocks
tbb::parallel_for(1, N, [=](long long i) {
total = (total + f(i)) % modulus;
});
Or whitout overflow checks:
tbb::parallel_for(1, N, [=](long long i) {
total = (total + f(i));
});
total %= modulus;

Am I using the SSE resources most efficiently?

I have this piece of code to find how many pixels of key lie in low/high range.
The low-high matrix is generated from an input big matrix. I have to output the coordinates of low/high where number of matching pixels is greater than 150(of 256).
int8_t high[8192][8192];
int8_t low[8192][8192];
int8_t key[16][16]
for (int i = 0; i <= 8192 - 16; i++)
for (int j = 0; j <= 8192 - 16; j++)
{
char *kLoc = key[ii];
char *lLoc = low[i + ii] + j;
char *hLoc = high[i + ii] + j;
__m128i high, low, num;
low = _mm_loadu_si128((__m128i*)lLoc);
high = _mm_loadu_si128((__m128i*)hLoc);
num = _mm_loadu_si128((__m128i*)kLoc);
// Snip
}
Can this be made better?
I understand there are 8 128-bit XMM registers and also MMX registers, whereas i am using just 3 of available XMM registers. Can I optimize the code to make use of all registers?
When it come to optimization, don't guess, measure, and measure on target/production environment. For simple arithmetic the bottleneck is usually on memory bandwidth, so you may consider doing other things in between load and save. You may also reduce to one compare instead of 2 compare+merge result, by reorder and interleave the loop into:
num_1 = loadu(hLoc); hLoc++;
num_2 = loadu(hLoc); hLoc++;
low_1 = loadu(kLoc); kLoc++;
low_2 = loadu(kLoc); kLoc++;
high_1 = loadu(lLoc); lLoc++;
high_2 = loadu(lLoc); lLoc++;
num1 = mm_sub(low1)
num2 = mm_sub(low2)
cmp num1, high1
cmp num2, high2
store num1
store num2
You may also want to move kLoc, lLoc, hLoc outside the loop and do increment(ie. kLoc++) as above, some compiler so dumb and generate code that calculate the address in every loop.

Optimizing this code block

for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++)
{
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
{
int num = matBigger[i+ii][j+jj];
// Extract range from this.
int low = num & 0xff;
int high = num >> 8;
if (low < matSmaller[ii][jj] && matSmaller[ii][jj] > high)
// match found
}
}
The machine is x86_64, 32kb L1 cahce, 256 Kb L2 cache.
Any pointers on how can I possibly optimize this code?
EDIT Some background to the original problem : Fastest way to Find a m x n submatrix in M X N matrix
First thing I'd try is to move the ii and jj loops outside the i and j loops. That way you're using the same elements of matSmaller for 25 million iterations of the i and j loops, meaning that you (or the compiler if you're lucky) can hoist the access to them outside those loops:
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++) {
int num = matBigger[i+ii][j+jj];
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
}
}
This might be faster (thanks to less access to the matSmaller array), or it might be slower (because I've changed the pattern of access to the matBigger array, and it's possible that I've made it less cache-friendly). A similar alternative would be to move the ii loop outside i and j and hoist matSmaller[ii], but leave the jj loop inside. The rule of thumb is that it's more cache-friendly to increment the last index of a multi-dimensional array in your inner loops, than earlier indexes. So we're "happier" to modify jj and j than we are to modify ii and i.
Second thing I'd try - what's the type of matBigger? Looks like the values in it are only 16 bits, so try it both as int and as (u)int16_t. The former might be faster because aligned int access is fast. The latter might be faster because more of the array fits in cache at any one time.
There are some higher-level things you could consider with some early analysis of smaller: for example if it's 0 then you needn't examine matBigger for that value of ii and jj, because num & 0xff < 0 is always false.
To do better than "guess things and see whether they're faster or not" you need to know for starters which line is hottest, which means you need a profiler.
Some basic advice:
Profile it, so you can learn where the hot-spots are.
Think about cache locality, and the addresses resulting from your loop order.
Use more const in the innermost scope, to hint more to the compiler.
Try breaking it up so you don't compute high if the low test is failing.
Try maintaining the offset into matBigger and matSmaller explicitly, to the innermost stepping into a simple increment.
Best thing ist to understand what the code is supposed to do, then check whether another algorithm exists for this problem.
Apart from that:
if you are just interested if a matching entry exists, make sure to break out of all 3 loops at the position of // match found.
make sure the data is stored in an optimal way. It all depends on your problem, but i.e. it could be more efficient to have just one array of size 5000*5000*20 and overload operator()(int,int,int) for accessing elements.
What are matSmaller and matBigger?
Try changing them to matBigger[i+ii * COL_COUNT + j+jj]
I agree with Steve about rearranging your loops to have the higher count as the inner loop. Since your code is only doing loads and compares, I believe a significant portion of the time is used for pointer arithmetic. Try an experiment to change Steve's answer into this:
for (int ii = 0; ii < 20; ii++)
{
for (int jj = 0; jj < 20; jj++)
{
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
{
int *pI = &matBigger[i+ii][jj];
for (int j = 0; j < 5000; j++)
{
int num = *pI++;
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
} // for j
} // for i
} // for jj
} // for ii
Even in 64-bit mode, the C compiler doesn't necessarily do a great job of keeping everything in register. By changing the array access to be a simple pointer increment, you'll make the compiler's job easier to produce efficient code.
Edit: I just noticed #unwind suggested basically the same thing. Another issue to consider is the statistics of your comparison. Is the low or high comparison more probable? Arrange the conditional statement so that the less probable test is first.
Looks like there is a lot of repetition here. One optimization is to reduce the amount of duplicate effort. Using pen and paper, I'm showing the matBigger "i" index iterating as:
[0 + 0], [0 + 1], [0 + 2], ..., [0 + 19],
[1 + 0], [1 + 1], ..., [1 + 18], [1 + 19]
[2 + 0], ..., [2 + 17], [2 + 18], [2 + 19]
As you can see there are locations that are accessed many times.
Also, multiplying the iteration counts indicate that the inner content is accessed: 20 * 20 * 5000 * 5000, or 10000000000 (10E+9) times. That's a lot!
So rather than trying to speed up the execution of 10E9 instructions (such as execution (pipeline) cache or data cache optimization), try reducing the number of iterations.
The code is searcing the matrix for a number that is within a range: larger than a minimal value and less than the maximum range value.
Based on this, try a different approach:
Find and remember all coordinates where the search value is greater
than the low value. Let us call these anchor points.
For each anchor point, find the coordinates of the first value after
the anchor point that is outside the range.
The objective is to reduce the number of duplicate accesses. Anchor points allow for a one pass scan and allow other decisions such as finding a range or determining an MxN matrix that contains the anchor value.
Another idea is to create new data structures containing the matBigger and matSmaller that are more optimized for searching.
For example, create a {value, coordinate list} entry for each unique value in matSmaller:
Value coordinate list
26 -> (2,3), (6,5), ..., (1007, 75)
31 -> (4,7), (2634, 5), ...
Now you can use this data structure to find values in matSmaller and immediately know their locations. So you could search matBigger for each unique value in this data structure. This again reduces the number of access to the matrices.