AVX2 barycentric interpolation of a vertex component - c++

I'm just starting out on the path of using simd intrinsics. My profiler has shown that a significant amount of time is being spent on vertex interpolation. I am targeting AVX2 and am trying to find an optimization for the following - given that I have 3 vector2s that need interpolation I imagine I should be able to load them into a single __m256 and do the multiply and add efficiently. Here is the code I am trying to convert - is it worth doing it as a 256bit operation? The vectors are unaligned.
Vector2 Interpolate( Vector3 uvw, Vector2 v0, Vector2 v1, Vector2 v2 )
{
Vector2 out;
out = v0 * uvw.x;
out += v1 * uvw.y;
out += v2 * uvw.z;
return out;
}
struct Vector2 { float x; float y; } ;
struct Vector3 { float x; float y; float z; } ;
My question is this - how do I load three unaligned vector2 into the single 256bit register so I can do the multiply and add?
I am using VS2013.

I was bored so I wrote it, not tested (but compiled, both Clang and GCC make reasonable code from this)
void interpolateAll(int n, float* scales, float* vin, float* vout)
{
// preconditions:
// (n & 7 == 0) (not really, but vout must be padded)
// scales & 31 == 0
// vin & 31 == 0
// vout & 31 == 0
// vin format:
// float v0x[8]
// float v0y[8]
// float v1x[8]
// float v1y[8]
// float v2x[8]
// float v2y[8]
// scales format:
// float scale0[8]
// float scale1[8]
// float scale2[8]
// vout format:
// float vx[8]
// float vy[8]
for (int i = 0; i < n; i += 8) {
__m256 scale_0 = _mm256_load_ps(scales + i * 3);
__m256 scale_1 = _mm256_load_ps(scales + i * 3 + 8);
__m256 scale_2 = _mm256_load_ps(scales + i * 3 + 16);
__m256 v0x = _mm256_load_ps(vin + i * 6);
__m256 v0y = _mm256_load_ps(vin + i * 6 + 8);
__m256 v1x = _mm256_load_ps(vin + i * 6 + 16);
__m256 v1y = _mm256_load_ps(vin + i * 6 + 24);
__m256 v2x = _mm256_load_ps(vin + i * 6 + 32);
__m256 v2y = _mm256_load_ps(vin + i * 6 + 40);
__m256 x = _mm256_mul_ps(scale_0, v0x);
__m256 y = _mm256_mul_ps(scale_0, v0y);
x = _mm256_fmadd_ps(scale_1, v1x, x);
y = _mm256_fmadd_ps(scale_1, v1y, y);
x = _mm256_fmadd_ps(scale_2, v2x, x);
y = _mm256_fmadd_ps(scale_2, v2y, y);
_mm256_store_ps(vout + i * 2, x);
_mm256_store_ps(vout + i * 2 + 8, y);
}
}
Uses Z boson's format, if I understood him correctly. In any case it's a nice format, from a SIMD perspective. Slightly inconvenient from a C++ perspective.
The FMAs do serialize the multiplies unnecessarily but that shouldn't matter since it's not part of a loop-carried dependency.
The predicted throughput of this (assuming a small enough array) is 2 iterations per 9 cycles, bottlenecked by the loads. In practice probably slightly worse, there was some talk about simple stores stealing p2 or p3 occasionally, that sort of thing, I'm not really sure. Anyway, that's enough time for 18 "FMAs" but there are only 12 (8 and 4 mulps), so it may be useful to move some extra computation in here if there is any.

Related

Efficient floating point scaling in C++

I'm working on my fast (and accurate) sin implementation in C++, and I have a problem regarding the efficient angle scaling into the +- pi/2 range.
My sin function for +-pi/2 using Taylor series is the following
(Note: FLOAT is a macro expanded to float or double just for the benchmark)
/**
* Sin for 'small' angles, accurate on [-pi/2, pi/2], fairly accurate on [-pi, pi]
*/
// To switch between float and double
#define FLOAT float
FLOAT
my_sin_small(FLOAT x)
{
constexpr FLOAT C1 = 1. / (7. * 6. * 5. * 4. * 3. * 2.);
constexpr FLOAT C2 = -1. / (5. * 4. * 3. * 2.);
constexpr FLOAT C3 = 1. / (3. * 2.);
constexpr FLOAT C4 = -1.;
// Correction for sin(pi/2) = 1, due to the ignored taylor terms
constexpr FLOAT corr = -1. / 0.9998431013994987;
const FLOAT x2 = x * x;
return corr * x * (x2 * (x2 * (x2 * C1 + C2) + C3) + C4);
}
So far so good... The problem comes when I try to scale an arbitrary angle into the +-pi/2 range. My current solution is:
FLOAT
my_sin(FLOAT x)
{
constexpr FLOAT pi = 3.141592653589793238462;
constexpr FLOAT rpi = 1 / pi;
// convert to +-pi/2 range
int n = std::nearbyint(x * rpi);
FLOAT xbar = (n * pi - x) * (2 * (n & 1) - 1);
// (2 * (n % 2) - 1) is a sign correction (see below)
return my_sin_small(xbar);
};
I made a benchmark, and I'm losing a lot for the +-pi/2 scaling.
Tricking with int(angle/pi + 0.5) is a nope since it is limited to the int precision, also requires +- branching, and i try to avoid branches...
What should I try to improve the performance for this scaling? I'm out of ideas.
Benchmark results for float. (In the benchmark the angle could be out of the validity range for my_sin_small, but for the bench I don't care about that...):
Benchmark results for double.
Sign correction for xbar in my_sin():
Algo accuracy compared to python sin() function:
Candidate improvements
Convert the radians x to rotations by dividing by 2*pi.
Retain only the fraction so we have an angle (-1.0 ... 1.0). This simplifies the OP's modulo step to a simple "drop the whole number" step instead. Going forward with different angle units simply involves a co-efficient set change. No need to scale back to radians.
For positive values, subtract 0.5 so we have (-0.5 ... 0.5) and then flip the sign. This centers the possible values about 0.0 and makes for better convergence of the approximating polynomial as compared to the math sine function. For negative values - see below.
Call my_sin_small1() that uses this (-0.5 ... 0.5) rotations range rather than [-pi ... +pi] radians.
In my_sin_small1(), fold constants together to drop the corr * step.
Rather than use the truncated Taylor's series, use a more optimal set. IMO, this will provide better answers, especially near +/-pi.
Notes: No int to/from float code. With more analysis, possible to get a better set of coefficients that fix my_sin(+/-pi) closer to 0.0. This is just a quick set of code to demo less FP steps and good potential results.
C like code for OP to port to C++
FLOAT my_sin_small1(FLOAT x) {
static const FLOAT A1 = -5.64744881E+01;
static const FLOAT A2 = +7.81017968E+01;
static const FLOAT A3 = -4.11145353E+01;
static const FLOAT A4 = +6.27923581E+00;
const FLOAT x2 = x * x;
return x * (x2 * (x2 * (x2 * A1 + A2) + A3) + A4);
}
FLOAT my_sin1(FLOAT x) {
static const FLOAT pi = 3.141592653589793238462;
static const FLOAT pi2i = 1/(pi * 2);
x *= pi2i;
FLOAT xfraction = 0.5f - (x - truncf(x));
return my_sin_small1(xfraction);
}
For negative values, use -my_sin1(-x) or like code to flip the sign - or add 0.5 in the above minus 0.5 step.
Test
#include <math.h>
#include <stdio.h>
int main(void) {
for (int d = 0; d <= 360; d += 20) {
FLOAT x = d / 180.0 * M_PI;
FLOAT y = my_sin1(x);
printf("%12.6f %11.8f %11.8f\n", x, sin(x), y);
}
}
Output
0.000000 0.00000000 -0.00022483
0.349066 0.34202013 0.34221691
0.698132 0.64278759 0.64255589
1.047198 0.86602542 0.86590189
1.396263 0.98480775 0.98496443
1.745329 0.98480775 0.98501128
2.094395 0.86602537 0.86603642
2.443461 0.64278762 0.64260530
2.792527 0.34202022 0.34183803
3.141593 -0.00000009 0.00000000
3.490659 -0.34202016 -0.34183764
3.839724 -0.64278757 -0.64260519
4.188790 -0.86602546 -0.86603653
4.537856 -0.98480776 -0.98501128
4.886922 -0.98480776 -0.98496443
5.235988 -0.86602545 -0.86590189
5.585053 -0.64278773 -0.64255613
5.934119 -0.34202036 -0.34221727
6.283185 0.00000017 -0.00022483
Alternate code below makes for better results near 0.0, yet might cost a tad more time. OP seems more inclined to speed.
FLOAT xfraction = 0.5f - (x - truncf(x));
// vs.
FLOAT xfraction = x - truncf(x);
if (x >= 0.5f) x -= 1.0f;
[Edit]
Below is a better set with about 10% reduced error.
-56.0833765f
77.92947047f
-41.0936875f
6.278635918f
Yet another approach:
Spend more time (code) to reduce the range to ±pi/4 (±45 degrees), then possible to use only 3 or 2 terms of a polynomial that is like the usually Taylors series.
float sin_quick_small(float x) {
const float x2 = x * x;
#if 0
// max error about 7e-7
static const FLOAT A2 = +0.00811656036940792f;
static const FLOAT A3 = -0.166597759850666f;
static const FLOAT A4 = +0.999994132743861f;
return x * (x2 * (x2 * A2 + A3) + A4);
#else
// max error about 0.00016
static const FLOAT A3 = -0.160343346851626f;
static const FLOAT A4 = +0.999031566686144f;
return x * (x2 * A3 + A4);
#endif
}
float cos_quick_small(float x) {
return cosf(x); // TBD code.
}
float sin_quick(float x) {
if (x < 0.0) {
return -sin_quick(-x);
}
int quo;
float x90 = remquof(fabsf(x), 3.141592653589793238462f / 2, &quo);
switch (quo % 4) {
case 0:
return sin_quick_small(x90);
case 1:
return cos_quick_small(x90);
case 2:
return sin_quick_small(-x90);
case 3:
return -cos_quick_small(x90);
}
return 0.0;
}
int main() {
float max_x = 0.0;
float max_error = 0.0;
for (int d = -45; d <= 45; d += 1) {
FLOAT x = d / 180.0 * M_PI;
FLOAT y = sin_quick(x);
double err = fabs(y - sin(x));
if (err > max_error) {
max_x = x;
max_error = err;
}
printf("%12.6f %11.8f %11.8f err:%11.8f\n", x, sin(x), y, err);
}
printf("x:%.6f err:%.6f\n", max_x, max_error);
return 0;
}

Eigen: Why is Map slower than Vector3d for this template expression?

I have a cloud of points in a std::vector<double> in an x, y, z pattern, and a std::vector<int> of indices where each triplet of consecutive integers is the connectivity of a face. Basically a simple triangular mesh data structure.
I have to compute the areas of all the faces and I am benchmarking several methods:
I can wrap chunks of data in an Eigen::Map<const Eigen::Vector3d> like this:
static void face_areas_eigenmap(const std::vector<double>& V,
const std::vector<int>& F,
std::vector<double>& FA) {
// Number of faces is size / 3.
for (auto f = 0; f < F.size() / 3; ++f) {
// Get vertex indices of face f.
auto v0 = F[f * 3];
auto v1 = F[f * 3 + 1];
auto v2 = F[f * 3 + 2];
// View memory at each vertex position as a vector.
Eigen::Map<const Eigen::Vector3d> x0{&V[v0 * 3]};
Eigen::Map<const Eigen::Vector3d> x1{&V[v1 * 3]};
Eigen::Map<const Eigen::Vector3d> x2{&V[v2 * 3]};
// Compute and store face area.
FA[f] = 0.5 * (x1 - x0).cross(x2 - x0).norm();
}
}
Or I can choose to create Eigen::Vector3d like this:
static void face_areas_eigenvec(const std::vector<double>& V,
const std::vector<int>& F,
std::vector<double>& FA) {
for (auto f = 0; f < F.size() / 3; ++f) {
auto v0 = F[f * 3];
auto v1 = F[f * 3 + 1];
auto v2 = F[f * 3 + 2];
// This is the only change, swap Map for Vector3d.
Eigen::Vector3d x0{&V[v0 * 3]};
Eigen::Vector3d x1{&V[v1 * 3]};
Eigen::Vector3d x2{&V[v2 * 3]};
FA[f] = 0.5 * (x1 - x0).cross(x2 - x0).norm();
}
}
Finally I am also considering the hardcoded version with the explicit cross product and norm:
static void face_areas_ptr(const std::vector<double>& V,
const std::vector<int>& F, std::vector<double>& FA) {
for (auto f = 0; f < F.size() / 3; ++f) {
const auto* x0 = &V[F[f * 3] * 3];
const auto* x1 = &V[F[f * 3 + 1] * 3];
const auto* x2 = &V[F[f * 3 + 2] * 3];
std::array<double, 3> s0{x1[0] - x0[0], x1[1] - x0[1], x1[2] - x0[2]};
std::array<double, 3> s1{x2[0] - x0[0], x2[1] - x0[1], x2[2] - x0[2]};
std::array<double, 3> c{s0[1] * s1[2] - s0[2] * s1[1],
s0[2] * s1[0] - s0[0] * s1[2],
s0[0] * s1[1] - s0[1] * s1[0]};
FA[f] = 0.5 * std::sqrt(c[0] * c[0] + c[1] * c[1] + c[2] * c[2]);
}
}
I have benchmarked these methods and the version using Eigen::Map is always the slowest despite doing the same exact thing as the one using Eigen::Vector3d, I was expecting no change in performance as a map is basically a pointer.
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BM_face_areas_eigenvec 59757936 ns 59758018 ns 11
BM_face_areas_ptr 58305018 ns 58304436 ns 11
BM_face_areas_eigenmap 62356850 ns 62354710 ns 10
I have tried switching the Eigen template expression in the map version with the same code as in the pointer version:
std::array<double, 3> s0{x1[0] - x0[0], x1[1] - x0[1], x1[2] - x0[2]};
std::array<double, 3> s1{x2[0] - x0[0], x2[1] - x0[1], x2[2] - x0[2]};
std::array<double, 3> c{s0[1] * s1[2] - s0[2] * s1[1],
s0[2] * s1[0] - s0[0] * s1[2],
s0[0] * s1[1] - s0[1] * s1[0]};
FA[f] = 0.5 * std::sqrt(c[0] * c[0] + c[1] * c[1] + c[2] * c[2]);
And magically the timings are comparable:
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
BM_face_areas_array 58967864 ns 58967891 ns 11
BM_face_areas_ptr 60034545 ns 60034682 ns 11
BM_face_areas_eigenmap 60382482 ns 60382027 ns 11
Is there something wrong with Eigen::Map in Eigen expressions to be aware of?
Looking at the compiler output it seems like the second version makes the compiler emit fewer memory loads by aggregating some of them into vector loads.
https://godbolt.org/z/qs38P41eh
Eigen's code for cross does not contain any explicit vectorization. It depends on the compiler doing a good job with it. And because you call cross on an expression (the subtractions), the compiler gives up a little too soon. Basically, it is the compiler's fault for not finding the same optimization.
Your third code works the same as the second because the compiler recognizes the subtraction (creation of s0 and s1) as something it can do vectorized, resulting in equivalent code. You can achieve the same with Eigen if you do it like this:
Eigen::Map<const Eigen::Vector3d> x0{&V[v0 * 3]};
Eigen::Map<const Eigen::Vector3d> x1{&V[v1 * 3]};
Eigen::Map<const Eigen::Vector3d> x2{&V[v2 * 3]};
Eigen::Vector3d s0 = x1 - x0;
Eigen::Vector3d s1 = x2 - x0;
// Compute and store face area.
FA[f] = 0.5 * s0.cross(s1).norm();

B-spline curve with graphics.h c++

I am trying to draw a curve with B-spline. I did my research about what is B-spline and how can I use it in a program algorithm. After all that stuff, I finally find a code to make this right in Stack Overflow. I made some changes on this code and try to use in my program. It works but I have two problems with that.
Firstly the curve is in the right shape but not in the right position. It's like 20-40 pixel different than should to be.
Secondly in my function last part, I am dividing the two result of x and y to a number but it (divide number) seems like have to change for all the circumstances.
And finally, it's working for 6 coordinates as you can see.
How can I bind the number of coordinates to divide the number and fix the flip at the spline?
PS: I need to write that code with C
Here is my code's functions :
1-That's my B-spline calculate function:
void BSplineCurve(const Dot& point1,
const Dot& point2,
const Dot& point3,
const Dot& point4,
Dot& result,
const double t)
{
const double t2 = t * t;
const double t3 = t2 * t;
const double mt = 1.0 - t;
const double mt3 = mt * mt * mt;
const double bi3 = mt3;
const double bi2 = 3 * t3 - 6 * t2 + 4;
const double bi1 = -3 * t3 + 3 * t2 + 3 * t + 1;
const double bi = t3;
result.x = point1.x * bi3 + point2.x * bi2 + point3.x * bi1 + point4.x * bi;
result.x /= 4;
result.y = point1.y * bi3 + point2.y * bi2 + point3.y * bi1 + point4.y * bi;
result.y /= 4;
}
2- That's my Draw Function :
Dot points[6] = {ControlPoint1, ControlPoint2, ControlPoint3, ControlPoint4, ControlPoint5,
ControlPoint6};
for(double t = 5.9999;t > 2.0; t -= 0.001)
{
const int start = static_cast<int>(t)+1;
BSplineCurve(points[start -3 ],
points[start - 2],
points[start - 1],
points[start ],
DrawCurve,
start - t);
Draw1Dot(DrawCurve,points[0],distanceToEdges);}
3- And finally my Draw pixel function :
void Draw1Dot(Dot Koor, Dot mesafe, int ortala)
{
putpixel(mesafe.x + Koor.x + ortala, mesafe.y + Koor.y + ortala, 3);
}
Can you help me understand what I'm doing wrong?

How to speed up bilinear interpolation of image?

I'm trying to rotate image with interpolation, but it's too slow for real time for big images.
the code something like:
for(int y=0;y<dst_h;++y)
{
for(int x=0;x<dst_w;++x)
{
//do inverse transform
fPoint pt(Transform(Point(x, y)));
//in coor of src
int x1= (int)floor(pt.x);
int y1= (int)floor(pt.y);
int x2= x1+1;
int y2= y1+1;
if((x1>=0&&x1<src_w&&y1>=0&&y1<src_h)&&(x2>=0&&x2<src_w&&y2>=0&&y2<src_h))
{
Mask[y][x]= 1; //show pixel
float dx1= pt.x-x1;
float dx2= 1-dx1;
float dy1= pt.y-y1;
float dy2= 1-dy1;
//bilinear
pd[x].blue= (dy2*(ps[y1*src_w+x1].blue*dx2+ps[y1*src_w+x2].blue*dx1)+
dy1*(ps[y2*src_w+x1].blue*dx2+ps[y2*src_w+x2].blue*dx1));
pd[x].green= (dy2*(ps[y1*src_w+x1].green*dx2+ps[y1*src_w+x2].green*dx1)+
dy1*(ps[y2*src_w+x1].green*dx2+ps[y2*src_w+x2].green*dx1));
pd[x].red= (dy2*(ps[y1*src_w+x1].red*dx2+ps[y1*src_w+x2].red*dx1)+
dy1*(ps[y2*src_w+x1].red*dx2+ps[y2*src_w+x2].red*dx1));
//nearest neighbour
//pd[x]= ps[((int)pt.y)*src_w+(int)pt.x];
}
else
Mask[y][x]= 0; //transparent pixel
}
pd+= dst_w;
}
How I can speed up this code, I try to parallelize this code but it seems there is no speed up because of memory access pattern (?).
The key is to do most of your computations as ints. The only thing that is necessary to do as a float is the weighting. See here for a good resource.
From that same resource:
int px = (int)x; // floor of x
int py = (int)y; // floor of y
const int stride = img->width;
const Pixel* p0 = img->data + px + py * stride; // pointer to first pixel
// load the four neighboring pixels
const Pixel& p1 = p0[0 + 0 * stride];
const Pixel& p2 = p0[1 + 0 * stride];
const Pixel& p3 = p0[0 + 1 * stride];
const Pixel& p4 = p0[1 + 1 * stride];
// Calculate the weights for each pixel
float fx = x - px;
float fy = y - py;
float fx1 = 1.0f - fx;
float fy1 = 1.0f - fy;
int w1 = fx1 * fy1 * 256.0f;
int w2 = fx * fy1 * 256.0f;
int w3 = fx1 * fy * 256.0f;
int w4 = fx * fy * 256.0f;
// Calculate the weighted sum of pixels (for each color channel)
int outr = p1.r * w1 + p2.r * w2 + p3.r * w3 + p4.r * w4;
int outg = p1.g * w1 + p2.g * w2 + p3.g * w3 + p4.g * w4;
int outb = p1.b * w1 + p2.b * w2 + p3.b * w3 + p4.b * w4;
int outa = p1.a * w1 + p2.a * w2 + p3.a * w3 + p4.a * w4;
wow you are doing a lot inside most inner loop like:
1.float to int conversions
can do all on floats ...
they are these days pretty fast
the conversion is what is killing you
also you are mixing float and ints together (if i see it right) which is the same ...
2.transform(x,y)
any unnecessary call makes heap trashing and slow things down
instead add 2 variables xx,yy and interpolate them insde your for loops
3.if ....
why to heck are you adding if ?
limit the for ranges before loop and not inside ...
the background can be filled with other fors before or later

Rotate a vector about another vector

I am writing a 3d vector class for OpenGL. How do I rotate a vector v1 about another vector v2 by an angle A?
You may find quaternions to be a more elegant and efficient solution.
After seeing this answer bumped recently, I though I'd provide a more robust answer. One that can be used without necessarily understanding the full mathematical implications of quaternions. I'm going to assume (given the C++ tag) that you have something like a Vector3 class with 'obvious' functions like inner, cross, and *= scalar operators, etc...
#include <cfloat>
#include <cmath>
...
void make_quat (float quat[4], const Vector3 & v2, float angle)
{
// BTW: there's no reason you can't use 'doubles' for angle, etc.
// there's not much point in applying a rotation outside of [-PI, +PI];
// as that covers the practical 2.PI range.
// any time graphics / floating point overlap, we have to think hard
// about degenerate cases that can arise quite naturally (think of
// pathological cancellation errors that are *possible* in seemingly
// benign operations like inner products - and other running sums).
Vector3 axis (v2);
float rl = sqrt(inner(axis, axis));
if (rl < FLT_EPSILON) // we'll handle this as no rotation:
{
quat[0] = 0.0, quat[1] = 0.0, quat[2] = 0.0, quat[3] = 1.0;
return; // the 'identity' unit quaternion.
}
float ca = cos(angle);
// we know a maths library is never going to yield a value outside
// of [-1.0, +1.0] right? Well, maybe we're using something else -
// like an approximating polynomial, or a faster hack that's a little
// rough 'around the edge' cases? let's *ensure* a clamped range:
ca = (ca < -1.0f) ? -1.0f : ((ca > +1.0f) ? +1.0f : ca);
// now we find cos / sin of a half-angle. we can use a faster identity
// for this, secure in the knowledge that 'sqrt' will be valid....
float cq = sqrt((1.0f + ca) / 2.0f); // cos(acos(ca) / 2.0);
float sq = sqrt((1.0f - ca) / 2.0f); // sin(acos(ca) / 2.0);
axis *= sq / rl; // i.e., scaling each element, and finally:
quat[0] = axis[0], quat[1] = axis[1], quat[2] = axis[2], quat[3] = cq;
}
Thus float quat[4] holds a unit quaternion that represents the axis and angle of rotation, given the original arguments (, v2, A).
Here's a routine for quaternion multiplication. SSE/SIMD can probably speed this up, but complicated transform & lighting are typically GPU-driven in most scenarios. If you remember complex number multiplication as a little weird, quaternion multiplication is more so. Complex number multiplication is a commutative operation: a*b = b*a. Quaternions don't even preserve this property, i.e., q*p != p*q :
static inline void
qmul (float r[4], const float q[4], const float p[4])
{
// quaternion multiplication: r = q * p
float w0 = q[3], w1 = p[3];
float x0 = q[0], x1 = p[0];
float y0 = q[1], y1 = p[1];
float z0 = q[2], z1 = p[2];
r[3] = w0 * w1 - x0 * x1 - y0 * y1 - z0 * z1;
r[0] = w0 * x1 + x0 * w1 + y0 * z1 - z0 * y1;
r[1] = w0 * y1 + y0 * w1 + z0 * x1 - x0 * z1;
r[2] = w0 * z1 + z0 * w1 + x0 * y1 - y0 * x1;
}
Finally, rotating a 3D 'vector' v (or if you prefer, the 'point' v that the question has named v1, represented as a vector), using the quaternion: float q[4] has a somewhat strange formula: v' = q * v * conjugate(q). Quaternions have conjugates, similar to complex numbers. Here's the routine:
static inline void
qrot (float v[3], const float q[4])
{
// 3D vector rotation: v = q * v * conj(q)
float r[4], p[4];
r[0] = + v[0], r[1] = + v[1], r[2] = + v[2], r[3] = +0.0;
glView__qmul(r, q, r);
p[0] = - q[0], p[1] = - q[1], p[2] = - q[2], p[3] = q[3];
glView__qmul(r, r, p);
v[0] = r[0], v[1] = r[1], v[2] = r[2];
}
Putting it all together. Obviously you can make use of the static keyword where appropriate. Modern optimising compilers may ignore the inline hint depending on their own code generation heuristics. But let's just concentrate on correctness for now:
How do I rotate a vector v1 about another vector v2 by an angle A?
Assuming some sort of Vector3 class, and (A) in radians, we want the quaternion representing the rotation by the angle (A) about the axis v2, and we want to apply that quaternion rotation to v1 for the result:
float q[4]; // we want to find the unit quaternion for `v2` and `A`...
make_quat(q, v2, A);
// what about `v1`? can we access elements with `operator [] (int)` (?)
// if so, let's assume the memory: `v1[0] .. v1[2]` is contiguous.
// you can figure out how you want to store and manage your Vector3 class.
qrot(& v1[0], q);
// `v1` has been rotated by `(A)` radians about the direction vector `v2` ...
Is this the sort of thing that folks would like to see expanded upon in the Beta Documentation site? I'm not altogether clear on its requirements, expected rigour, etc.
This may prove useful:
double c = cos(A);
double s = sin(A);
double C = 1.0 - c;
double Q[3][3];
Q[0][0] = v2[0] * v2[0] * C + c;
Q[0][1] = v2[1] * v2[0] * C + v2[2] * s;
Q[0][2] = v2[2] * v2[0] * C - v2[1] * s;
Q[1][0] = v2[1] * v2[0] * C - v2[2] * s;
Q[1][1] = v2[1] * v2[1] * C + c;
Q[1][2] = v2[2] * v2[1] * C + v2[0] * s;
Q[2][0] = v2[0] * v2[2] * C + v2[1] * s;
Q[2][1] = v2[2] * v2[1] * C - v2[0] * s;
Q[2][2] = v2[2] * v2[2] * C + c;
v1[0] = v1[0] * Q[0][0] + v1[0] * Q[0][1] + v1[0] * Q[0][2];
v1[1] = v1[1] * Q[1][0] + v1[1] * Q[1][1] + v1[1] * Q[1][2];
v1[2] = v1[2] * Q[2][0] + v1[2] * Q[2][1] + v1[2] * Q[2][2];
Use a 3D rotation matrix.
The easiest-to-understand way would be rotating the coordinate axis so that vector v2 aligns with the Z axis, then rotate by A around the Z axis, and rotate back so that the Z axis aligns with v2.
When you have written down the rotation matrices for the three operations, you'll probably notice that you apply three matrices after each other. To reach the same effect, you can multiply the three matrices.
I found this here:
http://steve.hollasch.net/cgindex/math/rotvec.html
let
[v] = [vx, vy, vz] the vector to be rotated.
[l] = [lx, ly, lz] the vector about rotation
| 1 0 0|
[i] = | 0 1 0| the identity matrix
| 0 0 1|
| 0 lz -ly |
[L] = | -lz 0 lx |
| ly -lx 0 |
d = sqrt(lx*lx + ly*ly + lz*lz)
a the angle of rotation
then
matrix operations gives:
[v] = [v]x{[i] + sin(a)/d*[L] + ((1 - cos(a))/(d*d)*([L]x[L]))}
I wrote my own Matrix3 class and Vector3Library that implemented this vector rotation. It works absolutely perfectly. I use it to avoid drawing models outside the field of view of the camera.
I suppose this is the "use a 3d rotation matrix" approach. I took a quick look at quaternions, but have never used them, so stuck to something I could wrap my head around.