What I am trying to do ultimately is multiplying two complex numbers like this:
z1 = R1 + I1*j
z2 = R2 + I2*j
z3 = z1 * z2 = (R1*R2 - I1*I2) (R1*I2 + R2*I1)*j;
But what I have are two separate vectors for the real and complex part of both those complex numbers. So something like this:
v1 = [R1, R2, R3, R4 ... Rn] of z1
v2 = [I1, I2, I3, I4 ... In] of z1
v1 = [R1, R2, R3, R4 ... Rn] of z2
v2 = [I1, I2, I3, I4 ... In] of z2
So when I am trying to calculate z3 now, I do this:
foo (std::vector<double> real1, std::vector<double> imag1,
std::vector<double> real2, std::vector<double> imag2)
{
std::vector<double> realResult;
std::vector<double> imagResult;
for (size_t i = 0; i < real1.size(); i++)
{
realResult.push_back(real1[i]*real2[i] - imag1[i]*imag2[i]);
imagResult.push_back(real1[i]*imag2[i] + real2[i]*imag1[i]);
}
//And so on
}
Now, this function is eating a lot of time. There sure is another way of doing that can you think of something that I can use?
You might be able to make use of std::complex. This probably implements operations you require at least close to as well as they can be implemented.
EDIT (In reply to comment):
I would do this:
size_t num_items = real1.size();
std::vector<double> realResult;
realResult.reserve(num_items);
std::vector<double> imagResult;
imagResult.reserve(num_items);
for (size_t i = 0; i < num_items; ++i) {
// lalala not re-sizeing any vectors yey!
realResult.push_back(real1[i] * real2[i] - imag1[i] * imag2[i]);
imagResult.push_back(real1[i] * imag2[i] + real2[i] * imag1[i]);
}
Otherwise if you have a large input array and you are doing a lot of multiplication on doubles I'm afraid that might just be slow. Best you can do is mess around with getting things contiguous in memory for bonus cache points. Impossible to really say without profiling the code exactly what might work best.
Pass in parameter as const std::vector<double>& to avoid unnecessary copy
You may also consider computing each multiplication in parallel, if N is big enough, the overhead of parallel computing is worthwhile
Use std::valarray of std::complex. It is simple and optimized for arithmetic operations
foo(std::valarray<std::complex<double>> & z1,
std::valarray<std::complex<double>> & z2)
{
auto z3 = z1 * z2; // applies to each element of two valarrays, or a valarray and a value
// . . .
}
EDIT: Convert vectors to valarray
std::valarray<std::complex<double>> z1(real1.size());
for (size_t i = 0; i < z1.size(); ++i)
z1[i] = std::complex<double>(real1[i], imag1[i]);
Related
I have quite large (20-40 degree) slowly-converging (sometimes) floating-point polynomials. I'd like to optimize their evaluation using SIMD (SSE2, AVX1, AVX-512). I need both float-32 and double-64 solutions.
Value of coefficients are constants given in advance, and value of X to evaluate poly at is given as function argument.
Important note - I have just single input X for my function. So I can't do vertical optimization by computing poly for 8-16 Xs same time. It means I need some horizontal optimization within evaluation for single X.
I created related question that helps me to compute powers of X (e.g. X^1, X^2, ..., X^8), that are needed for SIMD evaluation.
It is obvious that SIMD should be used only after some threshold of polynomial degree, for quite small polys straight generic (non-SIMD) Horner's (or Estrin's) based method can be used like here. Also SIMD width (128 or 256 or 512) should be chosen based on poly degree.
Below I implemented AVX-256-Float32 variant using a kind of modified Horner's Method suited for SIMD (multiplying by x^8 instead of x^1). Credits to #PeterCordes for fast horizontal sum tutorial. Click on try-it-online link, it contains larger code which also has reference simple evaluation for comparison and time measurements:
Try it online!
template <size_t S, size_t I, typename MT = __m256, size_t Cnt>
inline MT EvalPoly8xF32Helper(MT xhi,
std::array<float, Cnt> const & A, MT r = _mm256_undefined_ps()) {
size_t constexpr K = 8;
if constexpr(I + K >= S)
r = _mm256_load_ps(&A[I]);
else {
#ifdef __FMA__
r = _mm256_fmadd_ps(r, xhi, _mm256_load_ps(&A[I]));
#else
r = _mm256_add_ps(_mm256_mul_ps(r, xhi),
_mm256_load_ps(&A[I]));
#endif
}
if constexpr(I < K)
return r;
else
return EvalPoly8xF32Helper<S, I - K>(xhi, A, r);
}
inline float _mm_fast_hsum_ps(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v);
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums);
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
template <size_t S, size_t Cnt>
inline float EvalPoly8xF32(
float x, std::array<float, Cnt> const & A) {
auto constexpr K = 8;
auto const x2 = x * x, x4 = x2 * x2, x8 = x4 * x4, x3 = x2 * x;
auto const powsx = _mm256_setr_ps(
1, x, x2, x3, x4, x4 * x, x4 * x2, x4 * x3);
auto r0 = EvalPoly8xF32Helper<S, (S - 1) / K * K>(
_mm256_set1_ps(x8), A);
r0 = _mm256_mul_ps(r0, powsx);
return _mm_fast_hsum_ps(_mm_add_ps(
_mm256_castps256_ps128(r0), _mm256_extractf128_ps(r0, 1)));
}
As one can see SIMD version gives quite large speedup compared to reference simple implementation. For AVX1-256-float32 and degree 32 case it gives around 4.5x times speedup (for degree 16 it gives 1.8x speedup which is also good)! Obviously that even just using FMA instructions inside reference implementations will already improve computation speed noticeably.
My question is whether you can suggest a faster method of evaluating polynomial, or even some ready-made code or library, or any optimizations to my code.
Most common target CPU that will be used is Intel Xeon Gold 6230 that has AVX-512, so I need to optimize code for it.
I am upgrading some code from SSE to AVX2. In general I can see that gather instructions are quite useful and benefit performance. However I encountered a case where gather instructions are less efficient than decomposing the gather operations into simpler ones.
In the code below, I have a vector of int32 b, a vector of double xi and 4 int32 indices packed in a 128 bit register bidx. I need to gather first from vector b, than from vector xi. I.e., in pseudo code, I need to do:
__m128i i = b[idx];
__m256d x = xi[i];
In the function below, I implement this in two ways using an #ifdef: via gather instructions, yielding a throughput of 290 Miter/sec and via elementary operations, yielding a throughput of 325 Miter/sec.
Can somebody explain what is going on? Thanks
inline void resolve( const __m256d& z, const __m128i& bidx, int32_t j
, const int32_t *b, const double *xi, int32_t* ri )
{
__m256d x;
__m128i i;
#if 0 // this code uses two gather instructions in sequence
i = _mm_i32gather_epi32(b, bidx, 4)); // i = b[bidx]
x = _mm256_i32gather_pd(xi, i, 8); // x = xi[i]
#else // this code does not use gather instructions
union {
__m128i vec;
int32_t i32[4];
} u;
x = _mm256_set_pd
( xi[(u.i32[3] = b[_mm_extract_epi32(bidx,3)])]
, xi[(u.i32[2] = b[_mm_extract_epi32(bidx,2)])]
, xi[(u.i32[1] = b[_mm_extract_epi32(bidx,1)])]
, xi[(u.i32[0] = b[_mm_cvtsi128_si32(bidx) ])]
);
i = u.vec;
#endif
// here we use x and i
__m256 ps256 = _mm256_castpd_ps(_mm256_cmp_pd(z, x, _CMP_LT_OS));
__m128 lo128 = _mm256_castps256_ps128(ps256);
__m128 hi128 = _mm256_extractf128_ps(ps256, 1);
__m128 blend = _mm_shuffle_ps(lo128, hi128, 0 + (2<<2) + (0<<4) + (2<<6));
__m128i lt = _mm_castps_si128(blend); // this is 0 or -1
i = _mm_add_epi32(i, lt);
_mm_storeu_si128(reinterpret_cast<__m128i*>(ri)+j, i);
}
Since your 'resolve' function is marked as inline I suppose it's called in a high frequency loop. Then you might also have a look at the dependencies of the input parameters from each other outside the 'resolve' function. The compiler might be able to optimize the inlined code better across loop boundaries when using the scalar code variant.
I was looking at some pseudocode for boids and wrote it in C++. However, I am finding that boids will occasionally collide with each other. I thought that I had programmed it correctly, given how simple the psuedocode is. yet, when i display the locations of all the boids, some of them have the same coordinates.
The pseudocode from the link:
PROCEDURE rule2(boid bJ)
Vector c = 0;
FOR EACH BOID b
IF b != bJ THEN
IF |b.position - bJ.position| < 100 THEN
c = c - (b.position - bJ.position)
END IF
END IF
END
RETURN c
END PROCEDURE
my code is:
std::pair <signed int, signed int> keep_distance(std::vector <Boid> & boids, Boid & boid){
signed int dx = 0;
signed int dy = 0;
for(Boid & b : boids){
if (boid != b){ // this checks an "id" number, not location
if (b.dist(boid) < MIN_DIST){
dx -= b.get_x() - boid.get_x();
dy -= b.get_y() - boid.get_y();
}
}
}
return std::pair <signed int, signed int> (dx, dy);
}
with
MIN_DIST = 100;
unsigned int Boid::dist(const Boid & b){
return (unsigned int) sqrt((b.x - x) * (b.x - x) + (b.y - y) * (b.y - y));
}
the only major difference is between these two codes should be that instead of vector c, im using the components instead.
the order of functions i am using to move each boid around is:
center_of_mass(boids, new_boids[i]); // rule 1
match_velocity(boids, new_boids[i]); // rule 3
keep_within_bound(new_boids[i]);
tendency_towards_place(new_boids[i], mouse_x, mouse_y);
keep_distance(boids, new_boids[i]); // rule 2
is there something obvious im not seeing? maybe some silly vector arithmetic i did wrong?
The rule doesn't say that boids cannot collide. They just don't want to. :)
As you can see in this snippet:
FOR EACH BOID b
v1 = rule1(b)
v2 = rule2(b)
v3 = rule3(b)
b.velocity = b.velocity + v1 + v2 + v3
b.position = b.position + b.velocity
END
There is no check to make sure they don't collide. If the numbers come out unfavorably they will still collide.
That being said, if you get the exact same position for multiple boids it is still very unlikely, though. It would point to a programming error.
Later in the article he has this code:
ROCEDURE move_all_boids_to_new_positions()
Vector v1, v2, v3, ...
Integer m1, m2, m3, ...
Boid b
FOR EACH BOID b
v1 = m1 * rule1(b)
v2 = m2 * rule2(b)
v3 = m3 * rule3(b)
b.velocity = b.velocity + v1 + v2 + v3 + ...
b.position = b.position + b.velocity
END
END PROCEDURE
(Though realistically I would make m1 a double rather than an Integer) If rule1 is the poorly named rule that makes boids attempt to avoid each other, simply increase the value of m1 and they will turn faster away from each other. Also, increasingMIN_DIST will cause them to notice that they're about to run into each other sooner, and decreasing their maximum velocity (vlim in the function limit_velocity) will allow them to react more sanely to near collisions.
As others mentioned, there's nothing that 100% guarantees collisions don't happen, but these tweaks will make collisions less likely.
I want to compute for k=0 to k=100
A[j][k]=((A[j][k]-con*A[r][k])%2);
for that I am storing (con*A[r][k]) in some int temp[5]
and then doing A[j][k]-temp[] in SIMD whats wrong in the code below its giving segmentation fault for line __m128i m5=_mm_sub_epi32(*m3,*m4);
while((k+4)<100)
{
__m128i *m3 = (__m128i*)A[j+k];
temp[0]=con*A[r][k];
temp[1]=con*A[r][k+1];
temp[2]=con*A[r][k+2];
temp[3]=con*A[r][k+3];
__m128i *m4 = (__m128i*)temp;
__m128i m5 =_mm_sub_epi32(*m3,*m4);
(temp_ptr)=(int*)&m5;
printf("%ld,%d,%ld\n",A[j][k],con,A[r][k]);
A[j][k] =temp_ptr[0]%2;
A[j][k+1]=temp_ptr[1]%2;
A[j][k+2]=temp_ptr[2]%2;
A[j][k+3]=temp_ptr[3]%2;
k=k+4;
}
Most likely, you didn't take care of the alignment. SIMD instructions require 16-byte alignment (see this article). Otherwise, your program will crash.
Either alignment, or you have wrong indexes somewhere, and access wrong memory.
Without the possible values for j, k, and r it's hard to tell why, but most likely you are overindexing one of your arrays
If you want to implement:
for (k = 0; k < 100; k += 4)
{
A[j][k] = (A[j][k] - con * A[r][k]) % 2;
}
and you want to see some benefit from SIMD, then you need to do it all in SIMD, i.e. don't mix SIMD and scalar code.
For example (untested):
const __m128i vcon = _mm_set1_epi32(con);
const __m128i vk1 = _mm_set1_epi32(1);
for (k = 0; k < 100; k += 4)
{
__m128i v1 = _mm_loadu_si128(&A[j][k]); // load v1 from A[j][k..k+3] (misaligned)
__m128i v2 = _mm_loadu_si128(&A[r][k]); // load v2 from A[r][k..k+3] (misaligned)
v2 = _mm_mullo_epi32(v2, vcon); // v2 = con * A[r][k..k+3]
v1 = _mm_sub_epi32(v1, v2); // v1 = A[j][k..k+3] - con * A[r][k..k+3]
v1 = _mm_and_si128(v1, vk1); // v1 = (A[j][k..k+3] - con * A[r][k..k+3]) % 2
_mm_storeu_si128(&A[j][k], v1); // store v1 back to A[j][k..k+3] (misaligned)
}
Note: if you can guarantee that each row of A is 16 byte aligned then you can change the misaligned loads/stores (_mm_loadu_si128/_mm_storeu_si128) to aligned loads/stores (_mm_load_si128/_mm_store_si128) - this will help performance somewhat, depending on what CPU you are targetting.
We have a situation we want to do a sort of weighted average of two values w1 & w2, based on how far two other values v1 & v2 are away from zero... for example:
If v1 is zero, it doesn't get weighted at all so we return w2
If v2 is zero, it doesn't get weighted at all so we return w1
If both values are equally far from zero, we do a mean average and return (w1 + w2 )/2
I've inherited code like:
float calcWeightedAverage(v1,v2,w1,w2)
{
v1=fabs(v1);
v2=fabs(v2);
return (v1/(v1+v2))*w1 + (v2/(v1+v2)*w2);
}
For a bit of background, v1 & v2 represent how far two different knobs are turned, the weighting of their individual resultant effects only depends how much they are turned, not in which direction.
Clearly, this has a problem when v1==v2==0, since we end up with return (0/0)*w1 + (0/0)*w2 and you can't do 0/0. Putting a special test in for v1==v2==0 sounds horrible mathematically, even if it wasn't bad practice with floating-point numbers.
So I wondered if
there was a standard library function to handle this
there's a neater mathematical representation
You're trying to implement this mathematical function:
F(x, y) = (W1 * |x| + W2 * |y|) / (|x| + |y|)
This function is discontinuous at the point x = 0, y = 0. Unfortunately, as R. stated in a comment, the discontinuity is not removable - there is no sensible value to use at this point.
This is because the "sensible value" changes depending on the path you take to get to x = 0, y = 0. For example, consider following the path F(0, r) from r = R1 to r = 0 (this is equivalent to having the X knob at zero, and smoothly adjusting the Y knob down from R1 to 0). The value of F(x, y) will be constant at W2 until you get to the discontinuity.
Now consider following F(r, 0) (keeping the Y knob at zero and adjusting the X knob smoothly down to zero) - the output will be constant at W1 until you get to the discontinuity.
Now consider following F(r, r) (keeping both knobs at the same value, and adjusting them down simulatneously to zero). The output here will be constant at W1 + W2 / 2 until you go to the discontinuity.
This implies that any value between W1 and W2 is equally valid as the output at x = 0, y = 0. There's no sensible way to choose between them. (And further, always choosing 0 as the output is completely wrong - the output is otherwise bounded to be on the interval W1..W2 (ie, for any path you approach the discontinuity along, the limit of F() is always within that interval), and 0 might not even lie in this interval!)
You can "fix" the problem by adjusting the function slightly - add a constant (eg 1.0) to both v1 and v2 after the fabs(). This will make it so that the minimum contribution of each knob can't be zero - just "close to zero" (the constant defines how close).
It may be tempting to define this constant as "a very small number", but that will just cause the output to change wildly as the knobs are manipulated close to their zero points, which is probably undesirable.
This is the best I could come up with quickly
float calcWeightedAverage(float v1,float v2,float w1,float w2)
{
float a1 = 0.0;
float a2 = 0.0;
if (v1 != 0)
{
a1 = v1/(v1+v2) * w1;
}
if (v2 != 0)
{
a2 = v2/(v1+v2) * w2;
}
return a1 + a2;
}
I don't see what would be wrong with just doing this:
float calcWeightedAverage( float v1, float v2, float w1, float w2 ) {
static const float eps = FLT_MIN; //Or some other suitably small value.
v1 = fabs( v1 );
v2 = fabs( v2 );
if( v1 + v2 < eps )
return (w1+w2)/2.0f;
else
return (v1/(v1+v2))*w1 + (v2/(v1+v2)*w2);
}
Sure, no "fancy" stuff to figure out your division, but why make it harder than it has to be?
Personally I don't see anything wrong with an explicit check for divide by zero. We all do them, so it could be argued that not having it is uglier.
However, it is possible to turn off the IEEE divide by zero exceptions. How you do this depends on your platform. I know on windows it has to be done process-wide, so you can inadvertantly mess with other threads (and they with you) by doing it if you aren't careful.
However, if you do that your result value will be NaN, not 0. I highly dooubt that's what you want. If you are going to have to put a special check in there anyway with different logic when you get NaN, you might as well just check for 0 in the denominator up front.
So with a weighted average, you need to look at the special case where both are zero. In that case you want to treat it as 0.5 * w1 + 0.5 * w2, right? How about this?
float calcWeightedAverage(float v1,float v2,float w1,float w2)
{
v1=fabs(v1);
v2=fabs(v2);
if (v1 == v2) {
v1 = 0.5;
} else {
v1 = v1 / (v1 + v2); // v1 is between 0 and 1
}
v2 = 1 - v1; // avoid addition and division because they should add to 1
return v1 * w1 + v2 * w2;
}
You chould test for fabs(v1)+fabs(v2)==0 (this seems to be the fastest given that you've already computed them), and return whatever value makes sense in this case (w1+w2/2?). Otherwise, keep the code as-is.
However, I suspect the algorithm itself is broken if v1==v2==0 is possible. This kind of numerical instability when the knobs are "near 0" hardly seems desirable.
If the behavior actually is right and you want to avoid special-cases, you could add the minimum positive floating point value of the given type to v1 and v2 after taking their absolute values. (Note that DBL_MIN and friends are not the correct value because they're the minimum normalized values; you need the minimum of all positive values, including subnormals.) This will have no effect unless they're already extremely small; the additions will just yield v1 and v2 in the usual case.
The problem with using an explicit check for zero is that you can end up with discontinuities in behaviour unless you are careful as outlined in cafs response ( and if its in the core of your algorithm the if can be expensive - but dont care about that until you measure...)
I tend to use something that just smooths out the weighting near zero instead.
float calcWeightedAverage(v1,v2,w1,w2)
{
eps = 1e-7; // Or whatever you like...
v1=fabs(v1)+eps;
v2=fabs(v2)+eps;
return (v1/(v1+v2))*w1 + (v2/(v1+v2)*w2);
}
Your function is now smooth, with no asymptotes or division by zero, and so long as one of v1 or v2 is above 1e-7 by a significant amount it will be indistinguishable from a "real" weighted average.
If the denominator is zero, how do you want it to default? You can do something like this:
static inline float divide_default(float numerator, float denominator, float default) {
return (denominator == 0) ? default : (numerator / denominator);
}
float calcWeightedAverage(v1, v2, w1, w2)
{
v1 = fabs(v1);
v2 = fabs(v2);
return w1 * divide_default(v1, v1 + v2, 0.0) + w2 * divide_default(v2, v1 + v2, 0.0);
}
Note that the function definition and use of static inline should really let the compiler know that it can inline.
This should work
#include <float.h>
float calcWeightedAverage(v1,v2,w1,w2)
{
v1=fabs(v1);
v2=fabs(v2);
return (v1/(v1+v2+FLT_EPSILON))*w1 + (v2/(v1+v2+FLT_EPSILON)*w2);
}
edit:
I saw there may be problems with some precision so instead of using FLT_EPSILON use DBL_EPSILON for accurate results (I guess you will return a float value).
I'd do like this:
float calcWeightedAverage(double v1, double v2, double w1, double w2)
{
v1 = fabs(v1);
v2 = fabs(v2);
/* if both values are equally far from 0 */
if (fabs(v1 - v2) < 0.000000001) return (w1 + w2) / 2;
return (v1*w1 + v2*w2) / (v1 + v2);
}