Does anyone know any c++/opengl sourcecode demos for 2D rigid body physics using runge kutta?
I want to build a physics engine but I need some reference code to understand better how others have implemented this.
There are a lot of things you have to take care to do this nicely. I will focus on the integrator implementation and what I have found works good for me.
For all the degrees of freedom in your system implement a function to return the accelerations a as a function of time t, positions x and velocities v. This should operate on arrays or vectors of quantities and not just scalars.
a = accel(t,x,v);
After each RK step evaluate the acceleration to be ready for the next step. In the loop then do this:
// assume t,x[],v[], a[] are known
// step time t -> t+h and calc new values
float h2=h/2;
vec q1 = v + h2*a;
vec k1 = accel(t+h2, x+h2*v, q1);
vec q2 = v + h2*k1;
vec k2 = accel(t+h2, x+h2*q1, q2);
vec q3 = v + h*k2;
vec k3 = accel(t_h, x+h*q2, q3);
float h6 = h/6;
t = t + h;
x = x + h*(v+h6*(a+k1+k2));
v = v + h6*(a+2*k1+2*k2+k3);
a = accel(t,x,v);
Why? Well the standard RK method requires you to make a 2xN state vector, but the derivatives of the fist N elements are equal to the last N elements. If you split the problem up to two N state vectors and simplify a little you will arrive at the above scheme for 2nd order RK.
I have done this and the results are identical to commercial software for a plan system with N=6 degrees of freedom.
I have implemented a Gauss-Newton optimization process which involves calculating the increment by solving a linearized system Hx = b. The H matrx is calculated by H = J.transpose() * W * J and b is calculated from b = J.transpose() * (W * e) where e is the error vector. Jacobian here is a n-by-6 matrix where n is in thousands and stays unchanged across iterations and W is a n-by-n diagonal weight matrix which will change across iterations (some diagonal elements will be set to zero). However I encountered a speed issue.
When I do not add the weight matrix W, namely H = J.transpose()*J and b = J.transpose()*e, my Gauss-Newton process can run very fast in 0.02 sec for 30 iterations. However when I add the W matrix which is defined outside the iteration loop, it becomes so slow (0.3~0.7 sec for 30 iterations) and I don't understand if it is my coding problem or it normally takes this long.
Everything here are Eigen matrices and vectors.
I defined my W matrix using .asDiagonal() function in Eigen library from a vector of inverse variances. then just used it in the calculation for H ad b. Then it gets very slow. I wish to get some hints about the potential reasons for this huge slowdown.
There are only two matrices. Jacobian is definitely dense. Weight matrix is generated from a vector by the function vec.asDiagonal() which comes from the dense library so I assume it is also dense.
The code is really simple and the only difference that's causing the time change is the addition of the weight matrix. Here is a code snippet:
for (int iter=0; iter<max_iter; ++iter) {
// obtain error vector
error = ...
// calculate H and b - the fast one
Eigen::MatrixXf H = J.transpose() * J;
Eigen::VectorXf b = J.transpose() * error;
// calculate H and b - the slow one
Eigen::MatrixXf H = J.transpose() * weight_ * J;
Eigen::VectorXf b = J.transpose() * (weight_ * error);
// obtain delta and update state
del = H.ldlt().solve(b);
T <- T(del) // this is pseudo code, meaning update T with del
It is in a function in a class, and weight matrix now for debug purposes is defined as a class variable that can be accessed by the function and is defined before the function is called.
I guess that weight_ is declared as a dense MatrixXf? If so, then replace it by w.asDiagonal() everywhere you use weight_, or make the later an alias to the asDiagonal expression:
auto weight = w.asDiagonal();
This way Eigen will knows that weight is a diagonal matrix and computations will be optimized as expected.
Because the matrix multiplication is just the diagonal, you can change it to use coefficient wise multiplication like so:
MatrixXd m;
VectorXd w;
w.setLinSpaced(5, 2, 6);
std::cout << (m.array().rowwise() * w.array().transpose()).matrix() << "\n";
Likewise, the matrix vector product can be written as:
(w.array() * error.array()).matrix()
This avoids the zero elements in the matrix. Without an MCVE for me to base this on, YMMV...
I'm trying to implement the Runge Kutta method in Fortran and am facing a convergence problem. I don't know how much of the code I should show, so I'll describe the problem in detail, and please guide me as to what I should add/remove to/from the post to make it answerable.
I have a 6-dimensional vector of position and velocity of a ball, and a corresponding system of diff. eqs. that describe the equations of motions, from which I want to calculate the trajectory of the ball, and compare results for different orders of the RK method.
Let's focus on 3rd order RK. The model I use is implemented as follows:
k1 = h * f(vec_old,omega,phi)
k2 = h * f(vec_old + 0.5d0 * k1,omega,phi)
k3 = h * f(vec_old + 2d0 * k2 - k1,omega,phi)
vec = vec_old + (k1 + 4d0 * k2 + k3) / 6d0
Where f is the function that constitutes the equations of motion (or equivalently the RHS of my system of diff. eqs). Note that f is time independent, therefore has only 1 argument. h takes the role of a small time step dt.
If we wish to calculate the trajectory of the ball for a finite time total_time, and allow for a total error of epsilon, then we need to ensure each step takes a proportional fraction of the error. For the first step, I then did the following:
vec1 = solve(3,vec_old,h,omega,phi)
vec2 = solve(3,vec_old,h/2d0,omega,phi)
do while (maxval((/(abs(vec1(i) - vec2(i)),i=1,6)/)) > eps * h / (tot_time - current_time))
h = h / 2d0
vec1 = solve(3,vec_old,h,omega,phi)
vec2 = solve(3,vec_old,h/2d0,omega,phi)
end do
vec = (8d0/7d0) * vec2 - (1d0/7d0) * vec1
Where solve(3,vec_old,h,omega,phi) is the function that calculates the single RK step described above. 3 denotes the RK order we are using, vec_old is the current state of the position-velocity vector, h, h/2d0 both represent the time step being used, and omega,phi are just some extra parameters for f. Finally, for the first step we set current_time = 0d0.
The point is that if we use a 3rd order RK, we should have an error in $O(h^3)$, and thus fall off faster than linearly in h. Therefore, we should expect the while loop to eventually come to a halt for small enough h.
My problem is that the loop doesn't converge, and not even close - the ratio
maxval(...) / eps * (...)
remains pretty much constant, all the way until eps * h / (tot_time - current_time)) becomes zero due to finite precision.
For completeness, this is my definition for f:
function f(vec_old,omega,phi) result(vec)
real(8),intent(in) :: vec_old(6),omega,phi
real(8) :: vec(6)
real(8) :: v,Fv
v = sqrt(vec_old(4)**2+vec_old(5)**2+vec_old(6)**2)
Fv = 0.0039d0 + 0.0058d0 / (1d0 + exp((v-35d0)/5d0))
vec(1) = vec_old(4)
vec(2) = vec_old(5)
vec(3) = vec_old(6)
vec(4) = -Fv * v * vec_old(4) + 4.1d-4 * omega * (vec_old(6)*sin(phi) - vec_old(5)*cos(phi))
vec(5) = -Fv * v * vec_old(5) + 4.1d-4 * omega * vec_old(4)*cos(phi)
vec(6) = -Fv * v * vec_old(6) - 4.1d-4 * omega * vec_old(4)*sin(phi) - 9.8d0
end function f
Does anyone have any idea as to why the while loop doesn't converge?
If anything else is needed (output, other pieces of code etc.) please tell me and I'll add it. Also, if trimming is required, I'll cut whatever would be considered unnecessary. Thanks!
To compute the step error using the half step method, you need to compute the approximation at t+h in both cases, which means two steps with step size h/2. As it is now you compare the approximation at t+h to the approximation at t+h/2 which gives you an error of size f(vec(t+h/2))*h/2.
Thus change to a 3-step procedure
vec1 = solve(3,vec_old,h,omega,phi)
vec2 = solve(3,vec_old,h/2d0,omega,phi)
vec2 = solve(3,vec2 ,h/2d0,omega,phi)
in both locations, the difference of vec2-vec1 should then be of order h^4.
I would like to ask for a fast way to perform the following operations, either in native Matlab, C++, or using toolboxes/libraries, whichever would give the fastest solutions.
Let M be a tensor of D dimensions: n1 x n2 x... x nD, and let v1, v2,..., vD be D vectors whose dimensions are respectively n1, n2,..., nD.
Compute the product M*vi (1 <= i <= D). The result is a multi-dimensional array of (D-1) dimensions.
Compute the product of M with all vectors, except vi.
For example, with D = 3:
The product of M and v1 is a tensor N of 2 dimensions (i.e. a matrix) where
N[i2][i3] = Sum_over_i1 of M[i1][i2][i3]*v1[i1]
The product of M and v2 is a matrix N where
N[i1][i3] = Sum_over_i2 of M[i1][i2][i3]*v2[i2]
The product of M and v2 and v3 is a vector v where
v[i1] = Sum_over_i2 of (Sum_over_i3 of M[i1][i2][i3]*v2[i2]*v3[i3])
A further question: the above but for sparse tensors.
An example of Matlab code is given below.
Thank you very much in advance for your help!!
n1 = 3;
n2 = 5;
n3 = 4;
M = randn(n1,n2,n3);
v1 = randn(n1,1);
v2 = randn(n2,1);
v3 = randn(n3,1);
%% N = M*v2
N = zeros(n1,n3);
for i1=1:n1
for i3=1:n3
for i2=1:n2
N(i1,i3) = N(i1,i3) + M(i1,i2,i3)*v2(i2);
%% v = M*v2*v3
v = zeros(n1,1);
for i1=1:n1
for i2=1:n2
for i3=1:n3
v(i1) = v(i1) + M(i1,i2,i3)*v2(i2)*v3(i3);
I've noticed that operation you are describing takes (D - 1) dimensional slices of M and scales them by the corresponding entry of vi subsequently summing the result over the indices of vi. This code seems to work for getting N in your example:
N2 = squeeze(sum(M.*(v2)', 2));
To get v in your code, all you need to do is multiply N by v3:
v2 = N2*v3;
On older versions of MatLab the element-wise operator .* doesn't work the way I've used it above. One alternative is bsxfun:
N2 = squeeze(sum(bsxfun(#times, M, v2'), 2));
Just checked: In terms of performance, the bsxfun way seems as fast as the .* way for large arrays, at least on R2016b.
UPDATE: the (sparse) three-dimensional matrix v in my question below is symmetric: v(i1,i2,i3) = v(j1,j2,j3) where (j1,j2,j3) is any of the 6 permutations of (i1,i2,i3), i.e.
v(i1,i2,i3) = v(i1,i3,i2) = v(i2,i3,i1) = v(i2,i1,i3) = v(i3,i1,i2) = v(i3,i2,i1).
Moreover, v(i1,i2,i3) != 0 only when i1 != i2 && i1 != i3 && i2 != i3.
E.g. v(i,i,j) = 0, v(i, k, k) = 0, v(k, j, k) = 0, etc...
I thought that with this additional information, I could already get a significant speed-up by doing the following:
Remark: v contains duplicate values (a triplet (i,j,k) has 6 permutations, and the values of v for these 6 are the same).
So I defined a more compact matrix uthat contains only non-duplicates of v. The indices of u are (i1,i2,i3) where i1 < i2 < i3. The length of u is equal to the length of v divided by 6.
I computed the sum by iterating over the new value vector and the new index vectors.
With this, I only got a little speed-up. I realized that instead of iterating N times doing a multiplication each time, I iterated N/6 times doing 6 multiplications each time, and that's pretty much the same as before :(
Hope somebody could come up with a better solution.
--- (Original question) ---
In my program I have an expensive operation that is repeated every iteration.
I have three n-dimensional vectors x1, x2 and x3 that are supposed to change every iteration.
I have four N-dimensional vectors I1, I2, I3 and v that are pre-defined and will not change, where:
I1, I2 and I3 contain the indices of respectively x1, x2 and x3 (the elements in I_i are between 0 and n-1)
v is a vector of values.
For example:
We can see v as a (reshaped) sparse three-dimensional matrix, each index k of v corresponds to a triplet (i1,i2,i3) of indices of x1, x2, x3.
I want to compute at each iteration three n-dimensional vectors y1, y2 and y3 defined by:
y1[i1] = sum_{i2,i3} v(i1,i2,i3)*x2(i2)*x3(i3)
y2[i2] = sum_{i1,i3} v(i1,i2,i3)*x1(i1)*x3(i3)
y3[i3] = sum_{i1,i2} v(i1,i2,i3)*x1(i1)*x2(i2)
More precisely what the program does is:
Compute y1 then update x1 = f(y1)
Compute y2 then update x2 = f(y2)
Compute y3 then update x3 = f(y3)
where f is some external function.
I would like to know if there is a C++ library that helps me to do so as fast as possible. Using for loops is just too slow.
Thank you very much for your help!
Update: Looks like it's not easy to get a better solution than the straight-forward for loops. If the vector of indices I1 above is ordered in non-decreasing order, can we compute y1 faster?
For example: I1 = [0 0 0 0 1 1 2 2 2 3 3 3 ... n n].
The simple answer is no, at least, not trivially. Your access pattern (e.g. x2(i2)*x3(i3)) does not (at least at compile time) access contiguous memory, but rather has a layer of indirection. Due to this, SIMD instructions are pretty useless, as they work on chunks of memory. What you may want to consider doing is creating a copy of xM sorted according to iM, removing the layer of indirection. This should reduce the number of cache misses in that xM(iM) generates and since it's accessed N times, that may reduce some of the wall time (assuming N is large).
If maximal accuracy is not critical, you may want to consider using a FFT method instead of the convolution (at least, that's how I understood your question. Feel free to correct me if I'm wrong).
Assuming you are doing a convolution and the vectors (a and b, same size as in your question) are large, the result (c) can be calculated naïvely as
// O(n^2)
for(int i = 0; i < c.size(); i++)
c(i) = a(i) * b.array();
Using the convolution theorem, you could take the Fourier transform of both a and b and perform an element wise multiplication and then take the inverse Fourier transform of the result to get c (will probably differ a little):
// O(n log(n)); note A, B, and C are vectors of complex floating point numbers
fft.fwd(a, A);
fft.fwd(b, B);
C = A.array() * B.array();
fft.inv(C, c);
I have this runge kutta code. However, one mentioned my approach is wrong. And I couldn't really understand why from him, so anyone here, who could give a hint on why this way is wrong?
Vector3d r = P.GetAcceleration();
Vector3d s = P.GetAcceleration() + 0.5*m_dDeltaT*r;
Vector3d t = P.GetAcceleration() + 0.5*m_dDeltaT*s;
Vector3d u = P.GetAcceleration() + m_dDeltaT*t;
P.Velocity += m_dDeltaT * (r + 2.0 * (s + t) + u) / 6.0);
Vector3d are storing the coordinates, x, y, z.
The GetAcceleration returns the acceleration for each x, y, and z.
You have some acceleration function
a(p,q) where p=(x,y,z) and q=(vx,vy,vz)
Your order 1 system that can be solved via RK4 is
dotp = q
dotq = a(p,q)
The stages of the RK method involve an offset of the state vector(s)
k1p = q
k1q = a(p,q)
p1 = p + 0.5*dt*k1p
q1 = q + 0.5*dt*k1q
k2p = q1
k2q = a(p1,q1)
p2 = p + 0.5*dt*k2p
q2 = p + 0.5*dt*k2q
k3p = q2
k3q = a(p2,q2)
etc. You can either adjust the state vectors of the point P for each step, saving the original coordinates, or use a temporary copy of P to compute k2, k3, k4.
You haven't defined your methods, but the thing that's jumping out at me is you're mixing your results with your inputs. Since Runge-Kutta is a method for calculating y_(n+1) = y_n + hsum(b_ik_i), I would expect your solution to keep your _n terms on the right, and your (n+1) terms on the left. This is NOT what you're doing. Instead, s(n+1) is dependent on r_(n+1) instead of on r_n, t_(n+1) on s_(n+1), and so on. This smells of an error where you attempted to limit the number of variables being used.
With that in mind, can you indicate the actual intermediate values of the calculations your program generates and compare them with the intended intermediate values?