Optimisation of IIR filter - c++

Quick question related to IIR filter coefficients. Here is a very typical implementation of a direct form II biquad IIR processor that I found online.
// b0, b1, b2, a1, a2 are filter coefficients
// m1, m2 are the memory locations
// dn is the de-denormal coeff (=1.0e-20f)
void processBiquad(const float* in, float* out, unsigned length)
{
for(unsigned i = 0; i < length; ++i)
{
register float w = in[i] - a1*m1 - a2*m2 + dn;
out[i] = b1*m1 + b2*m2 + b0*w;
m2 = m1; m1 = w;
}
dn = -dn;
}
I understand that the "register" is somewhat unnecessary given how smart modern compilers are about this kind of thing. My question is, are there any potential performance benefits to storing the filter coefficients in individual variables rather than using arrays and dereferencing the values? Would the answer to this question depend on the target platform?
i.e.
out[i] = b[1]*m[1] + b[2]*m[2] + b[0]*w;
versus
out[i] = b1*m1 + b2*m2 + b0*w;

It really depends on your compiler and the optimization options. Here is my take:
Any modern compiler would just ignore register. It is just a hint to the compiler and modern ones just don't use it.
Accessing constant indexes in a loop is usually optimized away when compiling with optimization on. In a sense, using variables or an array as you showed makes no difference.
Always, always run benchmarks and look at the generated code for performance critical sections of the code.
EDIT: OK, just out of curiosity I wrote a small program and got "identical" code generated when using full optimization with VS2010. Here is what I get inside the loop for the expression in question (exactly identical for both cases):
0128138D fmul dword ptr [eax+0Ch]
01281390 faddp st(1),st
01281392 fld dword ptr [eax+10h]
01281395 fld dword ptr [w]
01281398 fld st(0)
0128139A fmulp st(2),st
0128139C fxch st(2)
0128139E faddp st(1),st
012813A0 fstp dword ptr [ecx+8]
Notice that I added a few lines to output the results so that I make sure compiler does not just optimize away everything. Here is the code:
#include <iostream>
#include <iterator>
#include <algorithm>
class test1
{
float a1, a2, b0, b1, b2;
float dn;
float m1, m2;
public:
void processBiquad(const float* in, float* out, unsigned length)
{
for(unsigned i = 0; i < length; ++i)
{
float w = in[i] - a1*m1 - a2*m2 + dn;
out[i] = b1*m1 + b2*m2 + b0*w;
m2 = m1; m1 = w;
}
dn = -dn;
}
};
class test2
{
float a[2], b[3];
float dn;
float m1, m2;
public:
void processBiquad(const float* in, float* out, unsigned length)
{
for(unsigned i = 0; i < length; ++i)
{
float w = in[i] - a[0]*m1 - a[1]*m2 + dn;
out[i] = b[0]*m1 + b[1]*m2 + b[2]*w;
m2 = m1; m1 = w;
}
dn = -dn;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
test1 t1;
test2 t2;
float a[1000];
float b[1000];
t1.processBiquad(a, b, 1000);
t2.processBiquad(a, b, 1000);
std::copy(b, b+1000, std::ostream_iterator<float>(std::cout, " "));
return 0;
}

I am not sure, but this :
out[i] = b[1]*m[1] + b[2]*m[2] + b[0]*w;
might be worse, because it would compile to indirect access, and that is worse then direct access performance wise.
The only way to actually see, is to check the compiled assembler and profile the code.

You will likely get a benefit if you can declare the coefficients b0, b1, b2 as const. Code will be more efficient if any of your operands are known and fixed at compile time.

Related

(How) Can I vectorize `std::complex<double>` using openmp?

I want to optimize my application using vectorization. More specifically, I want to vectorize the mathematical operations on the std::complex<double> type. However, this seems to be quite difficult. Consider the following example:
#define TEST_LEN 100
#include <algorithm>
#include <complex>
typedef std::complex<double> cmplx;
using namespace std::complex_literals;
#pragma omp declare simd
cmplx add(cmplx a, cmplx b)
{
return a + b;
}
#pragma omp declare simd
cmplx mult(cmplx a, cmplx b)
{
return a * b;
}
void k(cmplx *x, cmplx *&y, int i0, int N)
{
#pragma omp for simd
for (int i = i0; i < N; i++)
y[i] = add(mult(-(1i + 1.0), x[i]), 1i);
}
int main(int argc, char **argv)
{
cmplx *x = new cmplx[TEST_LEN];
cmplx *y = new cmplx[TEST_LEN];
for (int i = 0; i < TEST_LEN; i++)
x[i] = 0;
for (int i = 0; i < TEST_LEN; i++)
{
int N = std::min(4, TEST_LEN - i);
k(x, y, i, N);
}
delete[] x;
delete[] y;
return 1;
}
I am using the g++ compiler. For this code the compiler gives the following warning:
warning: unsupported return type 'cmplx' {aka 'std::complex'} for simd
for the lines containing the mult and add function.
It seems like it is not possible to vectorize the std::complex<double> type like this.
Is there a different way how this can be archieved?
Not easily. SIMD works quite well when you have values in the next N steps that behave the same way. So consider for example an array of 2D vectors:
X Y X Y X Y X Y
If we were to do a vector addition operation here,
X Y X Y X Y X Y
+ + + + + + + +
X Y X Y X Y X Y
The compiler will nicely vectorise that operation. If however we were to want to do something different for the X and Y values, the memory layout becomes problematic for SIMD:
X Y X Y X Y X Y
+ / + / + / + /
X Y X Y X Y X Y
If you consider for example the multiplication case:
(a + bi) (c + di) = (ac - bd) (ad + bc)i
Suddenly the operations are jumping between SIMD lanes, which is pretty much going to kill any decent vectorization.
Take a quick look at this godbolt: https://godbolt.org/z/rnVVgl
Addition boils down to some vaddps instructions (working on 8 floats at a time).
Multiply ends up using vfmadd231ss and vmulss (which both work on 1 float at a time).
The only easy way to automatically vectorise your complex code would be to seperate out the real and imaginary parts into 2 arrays:
struct ComplexArray {
float* real;
float* imaginary;
};
Within this godbolt you can see that the compiler is now using vfmadd213ps instructions (so again back to working on 8 floats at a time).
https://godbolt.org/z/Ostaax

SIMD Program slow runtime

I'm starting with SIMD programming but i don't know what to do at this moment. I'm trying to diminish runtime but its doing it the other way.
This is my basic code:
https://codepaste.net/a8ut89
void blurr2(double * u, double * r) {
int i;
double dos[2] = { 2.0, 2.0 };
for (i = 0; i < SIZE - 1; i++) {
r[i] = u[i] + u[i + 1];
}
}
blurr2: 0.43s
int contarNegativos(double * u) {
int i;
int contador = 0;
for (i = 0; i < SIZE; i++) {
if (u[i] < 0) {
contador++;
}
}
return contador;
}
negativeCount: 1.38s
void ord(double * v, double * u, double * r) {
int i;
for (i = 0; i < SIZE; i += 2) {
r[i] = *(__int64*)&(v[i]) | *(__int64*)&(u[i]);
}
}
ord: 0.33
And this is my SIMD code:
https://codepaste.net/fbg1g5
void blurr2(double * u, double * r) {
__m128d rp2;
__m128d rdos;
__m128d rr;
int i;
int sizeAux = SIZE % 2 == 1 ? SIZE : SIZE - 1;
double dos[2] = { 2.0, 2.0 };
rdos = *(__m128d*)dos;
for (i = 0; i < sizeAux; i += 2) {
rp2 = *(__m128d*)&u[i + 1];
rr = _mm_add_pd(*(__m128d*)&u[i], rp2);
*((__m128d*)&r[i]) = _mm_div_pd(rr, rdos);
}
}
blurr2: 0.42s
int contarNegativos(double * u) {
__m128d rcero;
__m128d rr;
int i;
double cero[2] = { 0.0, 0.0 };
int contador = 0;
rcero = *(__m128d*)cero;
for (i = 0; i < SIZE; i += 2) {
rr = _mm_cmplt_pd(*(__m128d*)&u[i], rcero);
if (((__int64 *)&rr)[0]) {
contador++;
};
if (((__int64 *)&rr)[1]) {
contador++;
};
}
return contador;
}
negativeCount: 1.42s
void ord(double * v, double * u, double * r) {
__m128d rr;
int i;
for (i = 0; i < SIZE; i += 2) {
*((__m128d*)&r[i]) = _mm_or_pd(*(__m128d*)&v[i], *(__m128d*)&u[i]);
}
}
ord: 0.35s
**Differents solutions.
Can you explain me what i'm doing wrong? I'm a bit lost...
Use _mm_loadu_pd instead of pointer-casting and dereferencing a __m128d. Your code is guaranteed to segfault on gcc/clang where __m128d is assumed to be aligned.
blurr2: multiply by 0.5 instead of dividing by 2. It will be much faster. (I commented the same thing on a question with the exact same code in the last day or two, was that also you?)
negativeCount: _mm_castpd_si128 the compare result to integer, and accumulate it with _mm_sub_epi64. (The bit pattern is all-zero or all-one, i.e. 2's complement 0 / -1).
#include <immintrin.h>
#include <stdint.h>
static const size_t SIZE = 1024;
uint64_t countNegative(double * u) {
__m128i counts = _mm_setzero_si128();
for (size_t i = 0; i < SIZE; i += 2) {
__m128d cmp = _mm_cmplt_pd(_mm_loadu_pd(&u[i]), _mm_setzero_pd());
counts = _mm_sub_epi64(counts, _mm_castpd_si128(cmp));
}
//return counts[0] + counts[1]; // GNU C only, and less efficient
// horizontal sum
__m128i hi64 = _mm_shuffle_epi32(counts, _MM_SHUFFLE(1, 0, 3, 2));
counts = _mm_add_epi64(counts, hi64);
uint64_t scalarcount = _mm_cvtsi128_si64(counts);
return scalarcount;
}
To learn more about efficient vector horizontal sums, see Fastest way to do horizontal float vector sum on x86. But the first rule is to do it outside the loop.
(source + asm on the Godbolt compiler explorer)
From MSVC (which I'm guessing you're using, or you'd get segfaults from *(__m128d*)foo), the inner loop is:
$LL4#countNegat:
movups xmm0, XMMWORD PTR [rcx]
lea rcx, QWORD PTR [rcx+16]
cmpltpd xmm0, xmm2
psubq xmm1, xmm0
sub rax, 1
jne SHORT $LL4#countNegat
It could maybe go faster with unrolling (and maybe two vector accumulators), but this is fairly good and might go close to 1.25 clocks per 16 bytes on Sandybridge/Haswell. (Bottleneck on 5 fused-domain uops).
Your version was actually unpacking to integer inside the inner loop! And if you were using MSVC -Ox, it was actually branching instead of using a branchless compare + conditional add. I'm surprised it wasn't slower than the scalar version.
Also, (int64_t *)&rr violates strict aliasing. char* can alias anything, but it's not safe to cast other pointers onto SIMD vectors and expect it to work. If it does, you got lucky. Compilers usually generate similar code for that or intrinsics, and usually not worse for proper intrinsics.
Do you know that ord function with SIMD is not 1:1 to ord function without using SIMD instructions ?
In ord function without using SIMD, result of OR operation is calculated for even indexes
r[0] = v[0] | u[0],
r[2] = v[2] | u[2],
r[4] = v[4] | u[4]
what with odd indexes? maybe, if OR operations are calculated for all indexes, it will take more time than now.

Why AVX dot product slower than native C++ code

I have the following AVX and Native codes:
__forceinline double dotProduct_2(const double* u, const double* v)
{
_mm256_zeroupper();
__m256d xy = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
__m256d temp = _mm256_hadd_pd(xy, xy);
__m128d dotproduct = _mm_add_pd(_mm256_extractf128_pd(temp, 0), _mm256_extractf128_pd(temp, 1));
return dotproduct.m128d_f64[0];
}
__forceinline double dotProduct_1(const D3& a, const D3& b)
{
return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
}
And respective test scripts:
std::cout << res_1 << " " << res_2 << " " << res_3 << '\n';
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < (1 << 30); ++i)
{
zx_1 += dotProduct_1(aVx[i % 10000], aVx[(i + 1) % 10000]);
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
std::cout << "NAIVE : " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
}
{
std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < (1 << 30); ++i)
{
zx_2 += dotProduct_2(&aVx[i % 10000][0], &aVx[(i + 1) % 10000][0]);
}
std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
std::cout << "AVX : " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
}
std::cout << math::min2(zx_1, zx_2) << " " << zx_1 << " " << zx_2;
Well, all of the data are aligned by 32. (D3 with __declspec... and aVx arr with _mm_malloc()..)
And, as i can see, native variant is equal/or faster than AVX variant. I can't understand it's nrmally behaviour ? Because i'm think that AVX is 'super FAST' ... If not, how i can optimize it ? I compile it on MSVC 2015(x64), with arch AVX. Also, my hardwre is intel i7 4750HQ(haswell)
Simple profiling with basic loops isn't a great idea - it usually just means you are memory bandwidth limited, so the tests end up coming out at about the same speed (memory is typically slower than the CPU, and that's basically all you are testing here).
As others have said, your code example isn't great, because you are constantly going across the lanes (which I assume is just to find the fastest dot product, and not specifically because a sum of all the dot products is the desired result?). To be honest, if you really need a fast dot product (for AOS data as presented here), I think I would prefer to replace the VHADDPD with a VADDPD + VPERMILPD (trading an additional instruction for twice the throughput, and a lower latency)
double dotProduct_3(const double* u, const double* v)
{
__m256d dp = _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
__m128d a = _mm256_extractf128_pd(dp, 0);
__m128d b = _mm256_extractf128_pd(dp, 1);
__m128d c = _mm_add_pd(a, b);
__m128d yy = _mm_unpackhi_pd(c, c);
__m128d dotproduct = _mm_add_pd(c, yy);
return _mm_cvtsd_f64(dotproduct);
}
asm:
dotProduct_3(double const*, double const*):
vmovapd ymm0,YMMWORD PTR [rsi]
vmulpd ymm0,ymm0,YMMWORD PTR [rdi]
vextractf128 xmm1,ymm0,0x1
vaddpd xmm0,xmm1,xmm0
vpermilpd xmm1,xmm0,0x3
vaddpd xmm0,xmm1,xmm0
vzeroupper
ret
Generally speaking, if you are using horizontal adds, you're doing it wrong! Whilst a 256bit register may seem ideal for a Vector4d, it's not actually a particularly great representation (especially if you consider that AVX512 is now available!). A very similar question to this came up recently: For C++ Vector3 utility class implementations, is array faster than struct and class?
If you want performance, then structure-of-arrays is the best way to go.
struct HybridVec4SOA
{
__m256d x;
__m256d y;
__m256d z;
__m256d w;
};
__m256d dot(const HybridVec4SOA& a, const HybridVec4SOA& b)
{
return _mm256_fmadd_pd(a.w, b.w,
_mm256_fmadd_pd(a.z, b.z,
_mm256_fmadd_pd(a.y, b.y,
_mm256_mul_pd(a.x, b.x))));
}
asm:
dot(HybridVec4SOA const&, HybridVec4SOA const&):
vmovapd ymm1,YMMWORD PTR [rdi+0x20]
vmovapd ymm2,YMMWORD PTR [rdi+0x40]
vmovapd ymm3,YMMWORD PTR [rdi+0x60]
vmovapd ymm0,YMMWORD PTR [rsi]
vmulpd ymm0,ymm0,YMMWORD PTR [rdi]
vfmadd231pd ymm0,ymm1,YMMWORD PTR [rsi+0x20]
vfmadd231pd ymm0,ymm2,YMMWORD PTR [rsi+0x40]
vfmadd231pd ymm0,ymm3,YMMWORD PTR [rsi+0x60]
ret
If you compare the latencies (and more importantly throughput) of load/mul/fmadd compared to hadd and extract, and then consider that the SOA version is computing 4 dot products at a time (instead of 1), you'll start to understand why it's the way to go...
You add too much overhead with vzeroupper and hadd instructions. Good way to write it, is to do all multiplies in a loop and aggregate the result just once at the end. Imagine you unroll original loop 4 times and use 4 accumulators:
for(i=0; i < (1<<30); i+=4) {
s0 += a[i+0] * b[i+0];
s1 += a[i+1] * b[i+1];
s2 += a[i+2] * b[i+2];
s3 += a[i+3] * b[i+3];
}
return s0+s1+s2+s3;
And now just replace unrolled loop with SIMD mul and add (or even FMA intrinsic if available)

C++ nested loop invariant code motion

I would like to know under which conditions invariant parts of nested loops can be optimized.
For doing so, I wrote two functions one of which implements the factorization of three nested loops while the other doesn't.
The non-factorized function looks like:
template<int k>
double __attribute__ ((noinline)) evaluate(const double u[], const double phi[])
{
double f = 0.;
for (int i3 = 0;i3<k;++i3)
for (int i2 = 0;i2<k;++i2)
for (int i1 = 0;i1<k;++i1)
f += u[i1+k*(i2+k*i3)] * phi[i1] * phi[i2] * phi[i3];
return f;
}
While the factorized function is:
template<int k>
double __attribute__ ((noinline)) evaluate_fact(const double u[], const double phi[])
{
double f3 = 0.;
for (int i3 = 0;i3<k;++i3)
{
double f2 = 0.;
for (int i2 = 0;i2<k;++i2)
{
double f1 = 0.;
for (int i1 = 0;i1<k;++i1)
{
f1 += u[i1+k*(i2+k*i3)] * phi[i1];
}
f2 += f1 * phi[i2];
}
f3 += f2 * phi[i3];
}
return f3;
}
That I call with the following main:
int main()
{
const static unsigned int k=20;
double u[k*k*k];
double phi[k];
phi[0] = 1.;
for (unsigned int i=1;i<k;++i)
phi[i] = phi[i-1]*.333;
double e = 0.;
for (unsigned int i=0;i<1000;++i)
{
e += evaluate<k>(u, phi);
//e += evaluate_fact<k>(u, phi);
}
std::cout << "Evaluate " << e << std::endl;
}
For a small k both functions generate the same assembly code but after a certain size k~=10 the assembly does not look the same anymore and callgrind shows more operations being performed in the non-factorized version.
How should I write my code (if at all possible), or what should I tell GCC such that evaluate() is optimized to evaluate_fact() ???
I am using GCC 7.1.0. with flags -Ofast -fmove-loop-invariants
Using -funroll-loops does not help unless I add --param max-completely-peeled-insns=10000 --param max-completely-peel-times=10000 but that is a completely different thing because it is basically unrolling everything, the assembly is extensive.
Using -fassociative-math doesn't help either.
This paper claims that: "Traditional loop-invariant code motion, which is commonly applied by general-purpose compilers, only checks invariance with respect to the innermost loop." Does that apply to my code?
Thanks!

_mm_load_ps caused segment fault

I have a code snippet. The snippet just loads 2 arrays and calculates dot product between them using SSE.
Code here:
using namespace std;
long long size = 3200000;
float* _random()
{
unsigned int seed = 123;
// float *t = malloc(size*sizeof(float));
float *t = new float[size];
int i;
float num = 0.0;
for(i=0; i < size; i++) {
num = rand()/(RAND_MAX+1.0);
t[i] = num;
}
return t;
}
float _dotProductVectorSSE(float *s1, float *s2)
{
float prod;
int i;
__m128 X, Y, Z;
for(i=0; i<size; i+=4)
{
X = _mm_load_ps(&s1[i]);
Y = _mm_load_ps(&s2[i]);
X = _mm_mul_ps(X, Y);
Z = _mm_add_ps(X, Z);
}
float *v = new float[4];
_mm_store_ps(v,Z);
for(i=0; i<4; i++)
{
// prod += Z[i];
std::cout << v[i] << endl;
}
return prod;
}
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
time_t start, stop;
double avg_time = 0;
double cur_time;
float* s1 = NULL;
float* s2 = NULL;
for(int i = 0; i < 100; i++)
{
s1 = _random();
s2 = _random();
start = clock();
float sse_product = _dotProductVectorSSE(s1, s2);
stop = clock();
cur_time = ((double) stop-start) / CLOCKS_PER_SEC;
avg_time += cur_time;
}
std::cout << "Averagely used " << avg_time/100 << " seconds." << endl;
return a.exec();
}
When I run, I got segment fault. Here is the backtrace:
(gdb) bt
0 0x0804965f in _mm_load_ps (__P=0xb6b56008) at /usr/lib/gcc/i586-suse-linux/4.6/include/xmmintrin.h:899
1 _dotProductVectorSSE (s1=0xb6b56008, s2=0xb5f20008) at ../simd/simd.cpp:37
2 0x0804987f in main (argc=1, argv=0xbfffee84) at ../simd/simd.cpp:80
Diassembler:
0x8049b30 push %ebp
0x8049b31 <+0x0001> push %edi
0x8049b32 <+0x0002> push %esi
0x8049b33 <+0x0003> push %ebx
0x8049b34 <+0x0004> sub $0x2c,%esp
0x8049b37 <+0x0007> mov 0x804c0a4,%esi
0x8049b3d <+0x000d> mov 0x40(%esp),%edx
0x8049b41 <+0x0011> mov 0x44(%esp),%ecx
0x8049b45 <+0x0015> mov 0x804c0a0,%ebx
0x8049b4b <+0x001b> cmp $0x0,%esi
0x8049b4e <+0x001e> jl 0x8049b7a <_Z20_dotProductVectorSSEPfS_+74>
0x8049b50 <+0x0020> jle 0x8049c10 <_Z20_dotProductVectorSSEPfS_+224>
0x8049b56 <+0x0026> add $0xffffffff,%ebx
0x8049b59 <+0x0029> adc $0xffffffff,%esi
0x8049b5c <+0x002c> xor %eax,%eax
0x8049b5e <+0x002e> shrd $0x2,%esi,%ebx
0x8049b62 <+0x0032> add $0x1,%ebx
0x8049b65 <+0x0035> shl $0x2,%ebx
**0x8049b68 <+0x0038> movaps (%edx,%eax,4),%xmm0**
0x8049b6c <+0x003c> mulps (%ecx,%eax,4),%xmm0
0x8049b70 <+0x0040> add $0x4,%eax
0x8049b73 <+0x0043> cmp %ebx,%eax
0x8049b75 <+0x0045> addps %xmm0,%xmm1
0x8049b78 <+0x0048> jne 0x8049b68 <_Z20_dotProductVectorSSEPfS_+56>
0x8049b7a <+0x004a> movaps %xmm1,0x10(%esp)
0x8049b7f <+0x004f> xor %ebx,%ebx
I am using QtCreator and defined in .pro file:
QMAKE_CXXFLAGS += -msse -msse2
DEFINES += __SSE__
DEFINES += __SSE2__
DEFINES += __MMX__
Please tell me how to fix that problem !
You are not ensuring that your data is 16 byte aligned (malloc/new are not sufficient in general) - you will either need to use _mm_loadu_ps instead of _mm_load_ps to deal with your potentially misaligned data, or preferably use a suitable method to allocate aligned memory (e.g. posix_memalign on Linux).
Note that you should _mm_load_ps and 16 byte aligned memory if you possibly can, otherwise use _mm_loadu_ps but note that this may reduce performance signficantly on some (older) CPUs.
Try the link below.
http://flyeater.wordpress.com/2010/11/29/memory-allocation-and-data-alignment-custom-mallocfree/
You basically allocate a bit more memory than you need, then calculate the address which is modulo 16 and use memory beginning from that address to load/store data.
Take care of pointer arithmetic.
Most of the code here ideone.com/fXKQhR is taken from the above link, sample usage.
I think, the _mm_malloc maybe helpful with you.