Sort array using Mixed integer constraints - linear-programming

Assume that I have
X = X_1....x_n Variables
and I want to build variables Y = Y_1...Y_n using constraints
such that the elements of Y are the sorted elements of X.
For example when
n = 2
Y_1 = min(X_1,X_2)
Y_2 = max(X_1,X_2)
this can be achieved with BIG m notation and 1 binary variables(B).
Y_1 <= X_1
Y_1 <= X_2
Y_1 + B * M >= X_1
Y_1 + (1 - B) * M >= X_2
Y_2 >= X_1
Y_2 >= X_2
Y_2 <= X_1 + (1 - B) * M
Y_2 <= X_2 + B * M
how can i build such a constraint for the general case? further, how can I do it with a minimum amount of binary variables?

Unfortunately, your approach can not be generalized.
Let's assume x[i] ∈ [L[i],U[i]].
First we introduce some assignment variables:
sum(j, p[i,j]) = 1 ∀i
sum(i, p[i,j]) = 1 ∀j
p[i,j] ∈ {0,1}
Now we can do
y[i] = sum(j, p[i,j]*x[j])
y[i+1] >= y[i]
but this is nonlinear. So we do instead:
Linearization of q[i,j] = p[i,j]*x[j]:
L[j]*p[i,j] <= q[i,j] <= U[j]*p[i,j] ∀i,j
x[j]-U[j]*(1-p[i,j]) <= q[i,j] <= x[j]-L[j]*(1-p[i,j]) ∀i,j
q[i,j] free
Calculate y:
y[i] = sum(j, q[i,j]) ∀i
y[i+1] >= y[i] ∀i but the last
y[i] free
This is just one approach. There are other formulations none of them really much simpler than this. This approach needs n^2 binary variables.
A small test case:
---- 74 VARIABLE x.L
i1 -72.953, i2 26.640, i3 -7.413, i4 -41.832, i5 -40.446, i6 -60.748, i7 -37.622
i8 35.341, i9 -86.124, i10 -1.560, i11 60.284, i12 -9.823, i13 58.321, i14 23.703
i15 -73.684, i16 19.667, i17 -71.349, i18 -55.050, i19 2.203, i20 -24.663, i21 -43.503
i22 -37.687, i23 -72.174, i24 -72.681, i25 -8.824, i26 39.548, i27 -52.591, i28 0.987
i29 35.956, i30 -30.807, i31 -68.515, i32 -9.459, i33 -66.911, i34 45.721, i35 -43.236
i36 -44.463, i37 -6.925, i38 24.830, i39 7.334, i40 -27.320, i41 -13.336, i42 -76.643
i43 -35.977, i44 -73.598, i45 -25.906, i46 -67.654, i47 4.287, i48 -10.957, i49 21.653
i50 -39.167, i51 17.540, i52 15.509, i53 -2.124, i54 -46.566, i55 -82.363, i56 -67.305
i57 15.262, i58 -14.333, i59 -85.537, i60 36.240, i61 -67.940, i62 -58.333, i63 3.244
i64 13.203, i65 -68.595, i66 -92.701, i67 1.280, i68 -3.644, i69 -23.246, i70 -43.362
i71 -51.336, i72 -43.147, i73 -68.123, i74 53.356, i75 -42.744, i76 31.107, i77 -43.717
i78 -56.794, i79 16.927, i80 -85.527, i81 -69.082, i82 -94.795, i83 -58.025, i84 -24.606
i85 -56.416, i86 -58.833, i87 -49.729, i88 -47.562, i89 -27.919, i90 52.985, i91 63.897
i92 -38.035, i93 -28.051, i94 19.678, i95 -28.997, i96 46.798, i97 -61.927, i98 23.847
i99 -81.919, i100 0.390
---- 74 VARIABLE y.L
i1 -94.795, i2 -92.701, i3 -86.124, i4 -85.537, i5 -85.527, i6 -82.363, i7 -81.919
i8 -76.643, i9 -73.684, i10 -73.598, i11 -72.953, i12 -72.681, i13 -72.174, i14 -71.349
i15 -69.082, i16 -68.595, i17 -68.515, i18 -68.123, i19 -67.940, i20 -67.654, i21 -67.305
i22 -66.911, i23 -61.927, i24 -60.748, i25 -58.833, i26 -58.333, i27 -58.025, i28 -56.794
i29 -56.416, i30 -55.050, i31 -52.591, i32 -51.336, i33 -49.729, i34 -47.562, i35 -46.566
i36 -44.463, i37 -43.717, i38 -43.503, i39 -43.362, i40 -43.236, i41 -43.147, i42 -42.744
i43 -41.832, i44 -40.446, i45 -39.167, i46 -38.035, i47 -37.687, i48 -37.622, i49 -35.977
i50 -30.807, i51 -28.997, i52 -28.051, i53 -27.919, i54 -27.320, i55 -25.906, i56 -24.663
i57 -24.606, i58 -23.246, i59 -14.333, i60 -13.336, i61 -10.957, i62 -9.823, i63 -9.459
i64 -8.824, i65 -7.413, i66 -6.925, i67 -3.644, i68 -2.124, i69 -1.560, i70 0.390
i71 0.987, i72 1.280, i73 2.203, i74 3.244, i75 4.287, i76 7.334, i77 13.203
i78 15.262, i79 15.509, i80 16.927, i81 17.540, i82 19.667, i83 19.678, i84 21.653
i85 23.703, i86 23.847, i87 24.830, i88 26.640, i89 31.107, i90 35.341, i91 35.956
i92 36.240, i93 39.548, i94 45.721, i95 46.798, i96 52.985, i97 53.356, i98 58.321
i99 60.284, i100 63.897

Related

IPOPT does not obey constraints but does not record the violation when using CppAD

I am trying to evaluate the coefficients and time of two fifth-order polynomials (one each for x and y position) that minimizes effort and time (the objective function) when connecting an initial position, velocity, and orientation to a desired final position and orientation with 0 velocity (equality constraints). Here is the code:
#include <vector>
#include <cppad/cppad.hpp>
#include <cppad/ipopt/solve.hpp>
using CppAD::AD;
typedef struct {
double x, y, theta, linear_velocity;
} Waypoint;
typedef std::vector<Waypoint> WaypointList;
struct TrajectoryConfig {
//! gain on accumulated jerk term in cost function
double Kj;
//! gain on time term in cost function
double Kt;
//! gain on terminal velocity term in cost function
double Kv;
};
class Trajectory {
public:
explicit Trajectory(TrajectoryConfig config);
~Trajectory();
void updateConfigs(TrajectoryConfig config);
void solve(WaypointList waypoints);
private:
//! solution vector
std::vector<double> solution_;
//! gain on accumulated jerk term in cost function
double Kj_;
//! gain on time term in cost function
double Kt_;
//! gain on terminal velocity term in cost function
double Kv_;
};
/*
Trajectory(TrajectoryConfig)
class constructor. Initializes class given configuration struct
*/
Trajectory::Trajectory(TrajectoryConfig config) {
Kj_ = config.Kj;
Kt_ = config.Kt;
Kv_ = config.Kv;
}
Trajectory::~Trajectory() {
std::cerr << "Trajectory Destructor!" << std::endl;
}
enum Indices { A0 = 0, A1, A2, A3, A4, A5, B0, B1, B2, B3, B4, B5, T };
class FGradEval {
public:
size_t M_;
// gains on cost;
double Kj_, Kt_;
// constructor
FGradEval(double Kj, double Kt) {
M_ = 13; // no. of parameters per trajectory segment: 2 x 6 coefficients + 1 time
Kj_ = Kj;
Kt_ = Kt;
}
typedef CPPAD_TESTVECTOR(AD<double>) ADvector;
void operator()(ADvector& fgrad, const ADvector& vars) {
fgrad[0] = 0;
AD<double> accum_jerk;
AD<double> a0, a1, a2, a3, a4, a5;
AD<double> b0, b1, b2, b3, b4, b5;
AD<double> T, T2, T3, T4, T5;
AD<double> x, y, vx, vy;
size_t offset = 1;
a0 = vars[Indices::A0];
a1 = vars[Indices::A1];
a2 = vars[Indices::A2];
a3 = vars[Indices::A3];
a4 = vars[Indices::A4];
a5 = vars[Indices::A5];
b0 = vars[Indices::B0];
b1 = vars[Indices::B1];
b2 = vars[Indices::B2];
b3 = vars[Indices::B3];
b4 = vars[Indices::B4];
b5 = vars[Indices::B5];
T = vars[Indices::T];
T2 = T*T;
T3 = T*T2;
T4 = T*T3;
T5 = T*T4;
x = a0 + a1*T + a2*T2 + a3*T3 + a4*T4 + a5*T5;
y = b0 + b1*T + b2*T2 + b3*T3 + b4*T4 + b5*T5;
vx = a1 + 2*a2*T + 3*a3*T2 + 4*b4*T3 + 5*a5*T4;
vy = b1 + 2*b2*T + 3*b3*T2 + 4*b4*T3 + 5*b5*T4;
//! cost-terms
//! accum_jerk is the analytic integral of int_0^T (jerk_x^2 + jerk_y^2) dt
accum_jerk = 36 * T * (a3*a3 + b3*b3) + 144 * T2 * (a3*a4 + b3*b4) + T3 * (240*(a3*a5 + b3*b5) + 192*(a4*a4 + b4*b4))
+ 720 * T4 * (a4*a5 + b4*b5) + 720 * T5 * (a5*a5 + b5*b5);
fgrad[0] += Kj_ * accum_jerk;
fgrad[0] += Kt_ * T;
//! initial equality constraints
fgrad[offset] = vars[Indices::A0];
fgrad[1 + offset] = vars[Indices::B0];
fgrad[2 + offset] = vars[Indices::A1];
fgrad[3 + offset] = vars[Indices::B1];
offset += 4;
//! terminal inequality constraints
fgrad[offset] = x;
fgrad[offset + 1] = y;
fgrad[offset + 2] = vx;
fgrad[offset + 3] = vy;
}
};
void Trajectory::solve(WaypointList waypoints) {
if (waypoints.size() != 2) {
std::cerr << "Trajectory::solve - Function requires 2 waypoints." << std::endl;
return;
}
//! status flag for solution
bool ok;
//! typedef for ipopt/cppad
typedef CPPAD_TESTVECTOR(double) Dvector;
//! no. of variables for optimization problem
size_t n_vars = 13;
//! no. of constraints
size_t n_cons = 4 * 2; // the start and final waypoint each contribute 4 constraints (x, y, theta, v) -> (x, y, vx, vy)
//! create vector container for optimizer solution
//! and initialize it to zero
Dvector vars(n_vars);
for (size_t i = 0; i < n_vars; i++) {
vars[i] = 0;
}
//! set initial state (this will only determine the first two coefficients of the initial polynomials)
double v = (fabs(waypoints[0].linear_velocity) < 1e-3)
? 1e-3 : waypoints[0].linear_velocity;
vars[Indices::A0] = waypoints[0].x;
vars[Indices::B0] = waypoints[0].y;
vars[Indices::A1] = v * cos(waypoints[0].theta);
vars[Indices::B1] = v * sin(waypoints[0].theta);
vars[Indices::T] = 0;
//! there are no explicit bounds on vars, so set to something large for the optimizer
//! we could perhaps put bounds on the coeffs corresponding to acc, jerk, snap, ..
Dvector vars_lb(n_vars);
Dvector vars_ub(n_vars);
for (size_t i = 0; i < n_vars; i++) {
vars_lb[i] = -1e10;
vars_ub[i] = 1e10;
}
//! time must be non-negative!
vars_lb[Indices::T] = 0;
//! set the bounds on the constraints
Dvector cons_lb(n_cons);
Dvector cons_ub(n_cons);
//! offset term on index
size_t offset = 0;
//! initial equality constraint - we must start from where we are!
cons_lb[0] = waypoints[0].x;
cons_ub[0] = waypoints[0].x;
cons_lb[1] = waypoints[0].y;
cons_ub[1] = waypoints[0].y;
cons_lb[2] = v * cos(waypoints[0].theta);
cons_ub[2] = v * cos(waypoints[0].theta);
cons_lb[3] = v * sin(waypoints[0].theta);
cons_ub[3] = v * sin(waypoints[0].theta);
offset += 4;
//! terminal point
cons_lb[offset] = waypoints[1].x;
cons_ub[offset] = waypoints[1].x;
cons_lb[offset + 1] = waypoints[1].y;
cons_ub[offset + 1] = waypoints[1].y;
cons_lb[offset + 2] = 1e-3 * cos(waypoints[1].theta);
cons_ub[offset + 2] = 1e-3 * cos(waypoints[1].theta);
cons_lb[offset + 3] = 1e-3 * sin(waypoints[1].theta);
cons_ub[offset + 3] = 1e-3 * sin(waypoints[1].theta);
//! create instance of objective function class
FGradEval fg_eval(Kj_, Kt_);
//! IPOPT INITIALIZATION
std::string options;
options += "Integer print_level 5\n";
options += "Sparse true forward\n";
options += "Sparse true reverse\n";
options += "Integer max_iter 100\n";
// options += "Numeric tol 1e-4\n";
//! compute the solution
CppAD::ipopt::solve_result<Dvector> solution;
//! solve
CppAD::ipopt::solve<Dvector, FGradEval>(
options, vars, vars_lb, vars_ub, cons_lb, cons_ub, fg_eval, solution);
//! check if the solver was successful
ok = solution.status == CppAD::ipopt::solve_result<Dvector>::success;
//! if the solver was unsuccessful, exit
//! this case will be handled by calling method
if (!ok) {
std::cerr << "Trajectory::solve - Failed to find a solution!" << std::endl;
return;
}
//! (DEBUG) output the final cost
std::cout << "Final Cost: " << solution.obj_value << std::endl;
//! populate output with argmin vector
for (size_t i = 0; i < n_vars; i++) {
solution_.push_back(solution.x[i]);
}
return;
}
Where I am having problems is in the following:
The initial equality constraint (starting position, velocity, and orientation) is being upheld, while the terminal velocity constraint is not. The algorithm terminates at the correct final (x,y,angle), but the velocity is not zero. I have looked through the code and I cannot understand why the position and orientation at the endpoint would be obeyed while the velocity would not. My suspicion is that my definition of the equality constraints is not what I think it is.
The problem does not converge regularly, but this seems a fairly simple problem as defined (see output)
******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
Ipopt is released as open source code under the Eclipse Public License (EPL).
For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************
This is Ipopt version 3.11.9, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).
Number of nonzeros in equality constraint Jacobian...: 30
Number of nonzeros in inequality constraint Jacobian.: 0
Number of nonzeros in Lagrangian Hessian.............: 23
Total number of variables............................: 13
variables with only lower bounds: 0
variables with lower and upper bounds: 13
variables with only upper bounds: 0
Total number of equality constraints.................: 8
Total number of inequality constraints...............: 0
inequality constraints with only lower bounds: 0
inequality constraints with lower and upper bounds: 0
inequality constraints with only upper bounds: 0
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
0 9.9999900e-03 1.00e+00 5.00e-04 -1.0 0.00e+00 - 0.00e+00 0.00e+00 0
1 5.9117705e-02 1.00e+00 1.20e+02 -1.0 5.36e+07 - 1.04e-05 7.63e-06f 18
2 1.1927070e+00 1.00e+00 2.62e+06 -1.0 9.21e+05 -4.0 6.16e-15 2.29e-23H 1
3 2.9689692e-01 1.00e+00 1.80e+05 -1.0 2.24e+13 - 1.83e-07 8.42e-10f 20
4r 2.9689692e-01 1.00e+00 1.00e+03 -0.0 0.00e+00 - 0.00e+00 4.58e-07R 11
5r 2.1005820e+01 9.99e-01 5.04e+02 -0.0 6.60e-02 - 9.90e-01 4.95e-01f 2
6r 7.7118141e+04 9.08e-01 5.18e+03 -0.0 2.09e+00 - 4.21e-01 1.00e+00f 1
7r 1.7923891e+04 7.82e-01 1.54e+03 -0.0 3.63e+00 - 9.90e-01 1.00e+00f 1
8r 5.9690221e+03 5.41e-01 5.12e+02 -0.0 2.92e+00 - 9.90e-01 1.00e+00f 1
9r 4.6855625e+03 5.54e-01 1.95e+02 -0.0 5.14e-01 - 9.92e-01 1.00e+00f 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
10r 8.4901226e+03 5.55e-01 5.18e+01 -0.0 2.24e-01 - 1.00e+00 1.00e+00f 1
Number of Iterations....: 10
(scaled) (unscaled)
Objective...............: 8.4901225582208808e+03 8.4901225582208808e+03
Dual infeasibility......: 6.3613117039244315e+06 6.3613117039244315e+06
Constraint violation....: 5.5503677023620179e-01 5.5503677023620179e-01
Complementarity.........: 9.9999982900301554e-01 9.9999982900301554e-01
Overall NLP error.......: 6.3613117039244315e+06 6.3613117039244315e+06
Number of objective function evaluations = 43
Number of objective gradient evaluations = 6
Number of equality constraint evaluations = 71
Number of inequality constraint evaluations = 0
Number of equality constraint Jacobian evaluations = 12
Number of inequality constraint Jacobian evaluations = 0
Number of Lagrangian Hessian evaluations = 10
Total CPU secs in IPOPT (w/o function evaluations) = 0.006
Total CPU secs in NLP function evaluations = 0.001
EXIT: Maximum Number of Iterations Exceeded.
I am not looking for an answer to my problem specifically. What I am hoping for are some suggestions as to why my problem may not be working as expected. Specifically, do my constraints make sense, as defined? Is the variable initialization done properly?
The problem was in the following lines:
x = a0 + a1*T + a2*T2 + a3*T3 + a4*T4 + a5*T5;
y = b0 + b1*T + b2*T2 + b3*T3 + b4*T4 + b5*T5;
vx = a1 + 2*a2*T + 3*a3*T2 + 4*b4*T3 + 5*a5*T4;
vy = b1 + 2*b2*T + 3*b3*T2 + 4*b4*T3 + 5*b5*T4;
Specifically,
vx = a1 + 2*a2*T + 3*a3*T2 + 4*b4*T3 + 5*a5*T4;
should be
vx = a1 + 2*a2*T + 3*a3*T2 + 4*a4*T3 + 5*a5*T4;
based upon the mapping of a's to the x-coordinate and b's to the y-coordinate.
This fixed the problem of constraint violation.
With regards to the problem of convergence/feasibility, I found that ensuring that the initial guess is in the feasible set (obeys the equality constraints) fixed this problem; measures of optimizer performance (inf_pr and inf_du, etc...) were much smaller after fixing the initial condition.

Scaling issues with OpenMP

I have written a code for a special type of 3D CFD Simulation, the Lattice-Boltzmann Method (quite similar to a code supplied with the Book "The Lattice Boltzmann Method" by Timm Krüger et alii).
Multithreading the program with OpenMP I have experienced issues that I can't quite understand: The results prove to be strongly dependent on the overall domain size.
The basic principle is that each cell of a 3D domain gets assigned certain values for 19 distribution functions (0-18) in discrete directions. They are laid down in two linear arrays allocated in the heap (one population is layed out in a separte array): The 18 populations of a certain cell are contiguous in memory, the values of consecutive x-values lay next to each other and so on (so sort of row-major: populations->x->y->z).
Those distribution functions redistribute according to certain values within the cell and then get streamed to the neighbouring cells. For this reason I have two populations f1 and f2. The algorithm takes the values from f1, redistributes them and copies them into f2. Then the pointers are swapped and the algorithm starts again.
The code is working perfectly fine on a single core but when I try to parallelise it on multiple cores I get a performance that depends on the overall size of the domain: For very small domains (10^3 cells) the algorithm is comparably slow with 15 million cells per second, for quite small domains (30^3 cells) the algorithm is about quite fast with over 60 million cells per second and for anything larger than that the performance drops again to about 30 million cells per second. Executing the code on a single core only leads to the same performance of about 15 million cells per second. These results of course vary between different processors but qualitatively the same issue remains!
The core of the code boils down to this parallelised loop that is executed over and over again and the pointers to f1 and f2 are swapped:
#pragma omp parallel for default(none) shared(f0,f1,f2) schedule(static)
for(unsigned int z = 0; z < NZ; ++z)
{
for(unsigned int y = 0; y < NY; ++y)
{
for(unsigned int x = 0; x < NX; ++x)
{
/// temporary populations
double ft0 = f0[D3Q19_ScalarIndex(x,y,z)];
double ft1 = f1[D3Q19_FieldIndex(x,y,z,1)];
double ft2 = f1[D3Q19_FieldIndex(x,y,z,2)];
double ft3 = f1[D3Q19_FieldIndex(x,y,z,3)];
double ft4 = f1[D3Q19_FieldIndex(x,y,z,4)];
double ft5 = f1[D3Q19_FieldIndex(x,y,z,5)];
double ft6 = f1[D3Q19_FieldIndex(x,y,z,6)];
double ft7 = f1[D3Q19_FieldIndex(x,y,z,7)];
double ft8 = f1[D3Q19_FieldIndex(x,y,z,8)];
double ft9 = f1[D3Q19_FieldIndex(x,y,z,9)];
double ft10 = f1[D3Q19_FieldIndex(x,y,z,10)];
double ft11 = f1[D3Q19_FieldIndex(x,y,z,11)];
double ft12 = f1[D3Q19_FieldIndex(x,y,z,12)];
double ft13 = f1[D3Q19_FieldIndex(x,y,z,13)];
double ft14 = f1[D3Q19_FieldIndex(x,y,z,14)];
double ft15 = f1[D3Q19_FieldIndex(x,y,z,15)];
double ft16 = f1[D3Q19_FieldIndex(x,y,z,16)];
double ft17 = f1[D3Q19_FieldIndex(x,y,z,17)];
double ft18 = f1[D3Q19_FieldIndex(x,y,z,18)];
/// microscopic to macroscopic
double r = ft0 + ft1 + ft2 + ft3 + ft4 + ft5 + ft6 + ft7 + ft8 + ft9 + ft10 + ft11 + ft12 + ft13 + ft14 + ft15 + ft16 + ft17 + ft18;
double rinv = 1.0/r;
double u = rinv*(ft1 - ft2 + ft7 + ft8 + ft9 + ft10 - ft11 - ft12 - ft13 - ft14);
double v = rinv*(ft3 - ft4 + ft7 - ft8 + ft11 - ft12 + ft15 + ft16 - ft17 - ft18);
double w = rinv*(ft5 - ft6 + ft9 - ft10 + ft13 - ft14 + ft15 - ft16 + ft17 - ft18);
/// collision & streaming
double trw0 = omega*r*w0; //temporary variables
double trwc = omega*r*wc;
double trwd = omega*r*wd;
double uu = 1.0 - 1.5*(u*u+v*v+w*w);
double bu = 3.0*u;
double bv = 3.0*v;
double bw = 3.0*w;
unsigned int xp = (x + 1) % NX; //calculate x,y,z coordinates of neighbouring cells
unsigned int yp = (y + 1) % NY;
unsigned int zp = (z + 1) % NZ;
unsigned int xm = (NX + x - 1) % NX;
unsigned int ym = (NY + y - 1) % NY;
unsigned int zm = (NZ + z - 1) % NZ;
f0[D3Q19_ScalarIndex(x,y,z)] = bomega*ft0 + trw0*(uu); //redistribute distribution functions and stream to neighbouring cells
double cu = bu;
f2[D3Q19_FieldIndex(xp,y, z, 1)] = bomega*ft1 + trwc*(uu + cu*(1.0 + 0.5*cu));
cu = -bu;
f2[D3Q19_FieldIndex(xm,y, z, 2)] = bomega*ft2 + trwc*(uu + cu*(1.0 + 0.5*cu));
cu = bv;
f2[D3Q19_FieldIndex(x, yp,z, 3)] = bomega*ft3 + trwc*(uu + cu*(1.0 + 0.5*cu));
cu = -bv;
f2[D3Q19_FieldIndex(x, ym,z, 4)] = bomega*ft4 + trwc*(uu + cu*(1.0 + 0.5*cu));
cu = bw;
f2[D3Q19_FieldIndex(x, y, zp, 5)] = bomega*ft5 + trwc*(uu + cu*(1.0 + 0.5*cu));
cu = -bw;
f2[D3Q19_FieldIndex(x, y, zm, 6)] = bomega*ft6 + trwc*(uu + cu*(1.0 + 0.5*cu));
cu = bu+bv;
f2[D3Q19_FieldIndex(xp,yp,z, 7)] = bomega*ft7 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = bu-bv;
f2[D3Q19_FieldIndex(xp,ym,z, 8)] = bomega*ft8 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = bu+bw;
f2[D3Q19_FieldIndex(xp,y, zp, 9)] = bomega*ft9 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = bu-bw;
f2[D3Q19_FieldIndex(xp,y, zm,10)] = bomega*ft10 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = -bu+bv;
f2[D3Q19_FieldIndex(xm,yp,z, 11)] = bomega*ft11 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = -bu-bv;
f2[D3Q19_FieldIndex(xm,ym,z, 12)] = bomega*ft12 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = -bu+bw;
f2[D3Q19_FieldIndex(xm,y, zp,13)] = bomega*ft13 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = -bu-bw;
f2[D3Q19_FieldIndex(xm,y, zm,14)] = bomega*ft14 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = bv+bw;
f2[D3Q19_FieldIndex(x, yp,zp,15)] = bomega*ft15 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = bv-bw;
f2[D3Q19_FieldIndex(x, yp,zm,16)] = bomega*ft16 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = -bv+bw;
f2[D3Q19_FieldIndex(x, ym,zp,17)] = bomega*ft17 + trwd*(uu + cu*(1.0 + 0.5*cu));
cu = -bv-bw;
f2[D3Q19_FieldIndex(x, ym,zm,18)] = bomega*ft18 + trwd*(uu + cu*(1.0 + 0.5*cu));
}
}
}
It would be awesome if someone could give me tips on how to find the reason for this particular behaviour or even has an idea what could cause this problem.
If needed I can supply a full version of the simplified code!
Thanks a lot in advance!
Achieving scaling on shared memory systems (threaded code on a single machine) is quite tricky, and often requires large amounts of tuning. What's likely happening in your code is that part of the domain for each thread fits into cache for the "quite small" problem size, but as the problem size increases in NX and NY, the data per thread stops fitting into cache.
To avoid issues like this, it is better to decompose the domain into fixed size blocks that do not change in size with the domain, but rather in number.
const unsigned int numBlocksZ = std::ceil(static_cast<double>(NZ) / BLOCK_SIZE);
const unsigned int numBlocksY = std::ceil(static_cast<double>(NY) / BLOCK_SIZE);
const unsigned int numBlocksX = std::ceil(static_cast<double>(NX) / BLOCK_SIZE);
#pragma omp parallel for default(none) shared(f0,f1,f2) schedule(static,1)
for(unsigned int block = 0; block < numBlocks; ++block)
{
unsigned int startZ = BLOCK_SIZE* (block / (numBlocksX*numBlocksY));
unsigned int endZ = std::min(startZ + BLOCK_SIZE, NZ);
for(unsigned int z = startZ; z < endZ; ++z) {
unsigned int startY = BLOCK_SIZE*(((block % (numBlocksX*numBlocksY)) / numBlocksX);
unsigned int endY = std::min(startY + BLOCK_SIZE, NY);
for(unsigned int y = startY; y < endY; ++y)
{
unsigned int startX = BLOCK_SIZE(block % numBlocksX);
unsigned int endX = std::min(startX + BLOCK_SIZE, NX);
for(unsigned int x = startX; x < endX; ++x)
{
...
}
}
}
An approach like the above should also increase cache locality by using 3d blocking (assuming this is a 3d stencil operation), and further improve your performance. You'll need to tune BLOCK_SIZE to find what gives you the best performance on a given system (I'd start small and increase in powers of two, e.g., 4, 8, 16...).

Is there a chance to make the bilinear interpolation faster?

First I want to provide you with some context.
I have two kind of images I need to merge. The first image is the background image with the format 8BppGrey and a resolution of 320x240. The second image is the forground image with the format 32BppRGBA and a resolution of 64x48.
Update
The github repo with an MVP is at the bottom of the question.
To do it I resize the second image with bilinear interpolation to the same size as the first one and then use blending to merge both to one image. Blending only happens when the alpha value of the second image is greater then 0.
I need to do it as fast as possible so my idea was to combine the resize and merge / blend process.
To achieve this I used the resize function from the writeablebitmapex repository and added merging / blending.
Everything works as expected but I want to decrease the execution time.
This are the current debug timings:
// CPU: Intel(R) Core(TM) i7-4810MQ CPU # 2.80GHz
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 5 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 6 ms
MediaServer: Resizing took 6 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.
Do I have any chance to increase the performance and lower the execution time of the resize / merge / blend process?
Are there some parts I maybe can parallelize?
Do I maybe have a chance to use some processor features?
A huge performance hit is the nested loop but I have no idea how I could write it better.
I would like to reach 1 or 2 ms for the whole process. Is this even possible?
Here's the modified visual c++ function I use.
pd is the backbuffer of the writeable bitmap I use to display the
result in wpf. The format I use is the default 32BppRGBA.
pixels is the int[] array of the 64x48 32BppRGBA image
widthSource and heightSource is the size of the pixels image
width and height is the target size of the output image
baseImage is the int[] array of the 320x240 8BppGray image
VC++ code:
unsigned int Resize(int* pd, int* pixels, int widthSource, int heightSource, int width, int height, byte* baseImage)
{
unsigned int start = clock();
float xs = (float)widthSource / width;
float ys = (float)heightSource / height;
float fracx, fracy, ifracx, ifracy, sx, sy, l0, l1, rf, gf, bf;
int c, x0, x1, y0, y1;
byte c1a, c1r, c1g, c1b, c2a, c2r, c2g, c2b, c3a, c3r, c3g, c3b, c4a, c4r, c4g, c4b;
byte a, r, g, b;
// Bilinear
int srcIdx = 0;
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x++)
{
sx = x * xs;
sy = y * ys;
x0 = (int)sx;
y0 = (int)sy;
// Calculate coordinates of the 4 interpolation points
fracx = sx - x0;
fracy = sy - y0;
ifracx = 1.0f - fracx;
ifracy = 1.0f - fracy;
x1 = x0 + 1;
if (x1 >= widthSource)
{
x1 = x0;
}
y1 = y0 + 1;
if (y1 >= heightSource)
{
y1 = y0;
}
// Read source color
c = pixels[y0 * widthSource + x0];
c1a = (byte)(c >> 24);
c1r = (byte)(c >> 16);
c1g = (byte)(c >> 8);
c1b = (byte)(c);
c = pixels[y0 * widthSource + x1];
c2a = (byte)(c >> 24);
c2r = (byte)(c >> 16);
c2g = (byte)(c >> 8);
c2b = (byte)(c);
c = pixels[y1 * widthSource + x0];
c3a = (byte)(c >> 24);
c3r = (byte)(c >> 16);
c3g = (byte)(c >> 8);
c3b = (byte)(c);
c = pixels[y1 * widthSource + x1];
c4a = (byte)(c >> 24);
c4r = (byte)(c >> 16);
c4g = (byte)(c >> 8);
c4b = (byte)(c);
// Calculate colors
// Alpha
l0 = ifracx * c1a + fracx * c2a;
l1 = ifracx * c3a + fracx * c4a;
a = (byte)(ifracy * l0 + fracy * l1);
// Write destination
if (a > 0)
{
// Red
l0 = ifracx * c1r + fracx * c2r;
l1 = ifracx * c3r + fracx * c4r;
rf = ifracy * l0 + fracy * l1;
// Green
l0 = ifracx * c1g + fracx * c2g;
l1 = ifracx * c3g + fracx * c4g;
gf = ifracy * l0 + fracy * l1;
// Blue
l0 = ifracx * c1b + fracx * c2b;
l1 = ifracx * c3b + fracx * c4b;
bf = ifracy * l0 + fracy * l1;
// Cast to byte
float alpha = a / 255.0f;
r = (byte)((rf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
g = (byte)((gf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
b = (byte)((bf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
pd[srcIdx++] = (255 << 24) | (r << 16) | (g << 8) | b;
}
else
{
// Alpha, Red, Green, Blue
pd[srcIdx++] = (255 << 24) | (baseImage[srcIdx] << 16) | (baseImage[srcIdx] << 8) | baseImage[srcIdx];
}
}
}
unsigned int end = clock() - start;
return end;
}
Github repo
One action that may speed up your code is to avoid type conversions from integer to float and vice versa. This can be achieved by having an int value in the suitable range instead of floats on range 0..1
Something like this:
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x++)
{
int sx1 = x * widthSource ;
int x0 = sx1 / width;
int fracx = (sx1 % width) ; // range 0..width - 1
which turns into something like
l0 = (fracx * c2a + (width - fracx) * c1a) / width ;
And so on. A bit tricky but doable
Thank you for all the help but the problem was the managed c++ project. I transfered the function now to my native c++ library and used the managed c++ part only as a wrapper for the c# application.
After the compiler optimization the function is now finished in 1ms.
Edit:
I will mark my own answer for now as the solution because the optimization from #marom leads to a broken image.
The common way to speedup a resize operation with bilinear interpolation is to:
Exploit the fact that x0 and fracx are independent from the row and that y0and fracy are independent from the column. Even though you haven't pulled out the computation of y0 and fracy out of the x-loop, compiler optimization should take care of that. However, for x0 and fracx, one needs to pre-compute the values for all columns and store them in an array. Complexity for computing x0 and fracx becomes O(width) compared to O(width*height) without pre-computation.
Do the whole processing with integers by replacing floating point arithmetics by integer arithmetics, thereby using shift operations instead of integer divisions.
For better readability, I did not implement the pre-computation of x0 and fracx in the following code. Pre-computation is straight-forward anyways.
Note that FACTOR = 2048 is the max you can do with 32-bit signed integers here (2048 * 2048 * 255 is just fine). For higher precision, you should switch to int64_t and then increase FACTOR and SHIFT, respectively.
I placed the border check into the inner loop for better readability. For an optimized implementation one should remove it by iterating in both loops just before this case happens and add special handling for the border pixels.
In case someone is wondering what the + (FACTOR * FACTOR / 2) is for, it is for rounding in conjunction with the subsequent division.
Finally note that (FACTOR * FACTOR / 2) and 2 * SHIFT are evaluated at compile time.
#define FACTOR 2048
#define SHIFT 11
const int xs = (int) ((double) FACTOR * widthSource / width + 0.5);
const int ys = (int) ((double) FACTOR * heightSource / height + 0.5);
for (int y = 0; y < height; y++)
{
const int sy = y * ys;
const int y0 = sy >> SHIFT;
const int fracy = sy - (y0 << SHIFT);
for (int x = 0; x < width; x++)
{
const int sx = x * xs;
const int x0 = sx >> SHIFT;
const int fracx = sx - (x0 << SHIFT);
if (x0 >= widthSource - 1 || y0 >= heightSource - 1)
{
// insert special handling here
continue;
}
const int offset = y0 * widthSource + x0;
target[y * width + x] = (unsigned char)
((source[offset] * (FACTOR - fracx) * (FACTOR - fracy) +
source[offset + 1] * fracx * (FACTOR - fracy) +
source[offset + widthSource] * (FACTOR - fracx) * fracy +
source[offset + widthSource + 1] * fracx * fracy +
(FACTOR * FACTOR / 2)) >> (2 * SHIFT));
}
}
For clarification, to match the variables used by the OP, for instance, in the case of the alpha channel it is:
a = (unsigned char)
((c1a * (FACTOR - fracx) * (FACTOR - fracy) +
c2a * fracx * (FACTOR - fracy) +
c3a * (FACTOR - fracx) * fracy +
c4a * fracx * fracy +
(FACTOR * FACTOR / 2)) >> (2 * SHIFT));

Invert a (small) permutation

I'm using the shuffle function of OpenCL to sort a float3 vector, like this (the last component of the actual 4d vector is ignored):
uint4 mask = (uint4)(0,1,2,3);
mask.xyz = res.x < res.y ? (res.x >= res.z ? mask.yxz : mask.yzx) : (res.y >= res.z ? mask.xyz : mask.xzy);
float4 abcd = shuffle(res,mask);
I then manipulate each component of the vector abcd, and want to reverse the sorting permutation, as follows:
uint4 inv_mask = ... // ???
res = shuffle(abcd,inv_mask); // Inverse the sorting permutation
How do I calculate the inverse mask efficiently?
The number of possibilities is very limited:
x >= y >= z => mask.xyz = (0,1,2), inv_mask = (0,1,2)
x >= z >= y => mask.xyz = (0,2,1), inv_mask = (0,2,1)
y >= x >= z => mask.xyz = (1,0,2), inv_mask = (1,0,2)
y >= z >= x => mask.xyz = (1,2,0), inv_mask = (2,0,1)
z >= x >= y => mask.xyz = (2,0,1), inv_mask = (1,2,0)
z >= y >= x => mask.xyz = (2,1,0), inv_mask = (2,1,0)
Notice that only two of the six possible permutation contains more than one swap, and thus the rest 4 permutations are inverted by themselves.
Once you have computed mask, you can use the following code to get inv_mask:
inv_mask.xyz = mask.xyz == (int3)(1,2,0) ? (int3)(2,0,1) : (mask.xyz == (int3)(2,0,1) ? (int3)(1,2,0) : mask.xyz);
Did you mean
uint4 invmask = (uint4)(3,3,3,3) - mask;
?
For a mask (0,3,1,2) this gives you (3-0, 3-3, 3-1, 3-2) = (3,0,2,1)

C/C++ Bit Twiddling

in the spirit of graphics.stanford.edu/~seander/bithacks.html I need to solve the following problem:
int x;
int pow2; // always a positive power of 2
int sgn; // always either 0 or 1
// ...
// ...
if(sgn == 0)
x -= pow2;
else
x += pow2;
Of course I need to avoid the conditional. So far the best I came up with is
x -= (1|(~sgn+1))*pow2
but that involves a multiplication which I also would like to avoid. Thanks in advance.
EDIT: Thanks all,
x -= (pow2^-sgn) + sgn
seems to do the trick!
I would try
x -= (pow2 ^ (~sgn+1)) + sgn
or, as suggested by lijie in the comments
x -= (pow2 ^ -sgn) + sgn
If sgn is 0, ~sgn+1 is also 0, so pow2 ^ (~sgn+1) == pow2. If sgn is 1, (~sgn+1) is 0xFFFFFFFF, and (pow2 ^ (~sgn+1)) + sgn == -pow2.
mask = sgn - 1; // generate mask: sgn == 0 => mask = -1, sgn == 1 => mask = 0
x = x + (mask & (-pow2)) + (~mask & (pow2)); // use mask to select +/- pow2 for addition
Off the top of my head:
int subMask = sgn - 1;
x -= pow2 & subMask;
int addMask = -sgn;
x += pow2 & addMask;
No guarantees on whether it works or whether this is smart, this is just a random idea that popped into my head.
EDIT: let's make this a bit less readable (aka more compact):
x += (pow2 & -sgn) - (pow2 & (sgn-1));
I would change the interface and replace the multiplication by left shift. (Use exponent instead of pow2)
You can do something like (from the link)
x += ((pow2 ^ -sgn) + sgn)