How to speed up my bilinear interpolation?

How to speed up my bilinear interpolation? - c++

I have spent countless hours trying to speed up my bilinear interpolation up, with no avail. I even tried a SSE version (a double version and a float version), but that was even slower than this version.
Does anyone have any ideas?
template <typename T>
__forceinline void interp2_mx(const T& x, const T& y,
const T* z,
const int32_t& n,
const int32_t& mm2,
const int32_t& nm2,
T& val,
const T& extrapval = T(0))
{
int64_t xp = (int64_t)(x) - 1; // adjust for MATLAB indexing
int64_t yp = (int64_t)(y) - 1;
if (xp < 0 || xp > nm2 || yp < 0 || yp > mm2)
{
val = extrapval;
}
else
{
const T* line = z + yp * n + xp;
T xf = x - (int64_t)(x);
T yf = y - (int64_t)(y);
T x1mf = (T)1 - xf;
T y1mf = (T)1 - yf;
T v00 = x1mf * y1mf * (*(line));
T v01 = xf * y1mf * (*(line + 1));
T v10 = x1mf * yf * (*(line + n));
T v11 = xf * yf * (*(line + n + 1));
val = v00 + v01 + v10 + v11;
}
}
template <typename T>
void interp2(const T* z,
const int32_t& mz, const int32_t& nz,
const T* xi, const T* yi,
const int32_t& mi, const int32_t& ni,
T* zi,
const T& extrapval = T(0))
{
const int32_t nzm2 = nz - 2;
const int32_t mzm2 = mz - 2;
#pragma omp parallel for
for (int m = 0; m < mi; ++m)
{
T* line_zi = zi + m * ni;
const T* x = xi + m * ni;
const T* y = yi + m * ni;
for (int n = 0; n < ni; ++n, ++x, ++y, ++line_zi)
{
interp2_mx((*x), (*y), z, nz, mzm2, nzm2, (*line_zi));
}
}
}

Your calculation of xf does a float-to-int64_t-to-float conversion. I assume you know the value is in range, otherwise this would be Undefined Behavior (and mathematically pointless). std::modf() may be the better function as it directly calculates the desired value.
I also think that adjacent pixels have rather related xf & x1mf values, yet you recalculate them. I'm not sure about this as your coordinates seem to be stored indirectly ((*x), (*y)). It may very well be more efficient to calculate those on the fly. Since these pointers may alias the output, they can't be prefetched and the reads will be blocking the memory bus.

Related

Arbitrary multiprecision inside the integrand of a GSL monte carlo integration

My aim is to use GSL monte carlo integration for an integrand in which an arbitrary multiprecision library (Boost) is used. I decided to use an arbitrary multiprecision library because the integral had difficulties to reach convergence.
This is the actual mathematical formula describing what I am trying to code. My guess is that I do not reach convergence and therefore NaN because and can get very small values, near zeros.
This is my code:
mp::float128 PDFfunction(double invL, int t, double invtau, double x0, double x, int n_lim) {
const double c = M_PI * (M_PI/4) * ((2 * t) * invtau);
mp::float128 res = 0;
for(int n = 1; n <= n_lim; ++n){
res += exp(-1 * (n * n) * c) * cos((n * M_PI * x) * invL) * cos((n * M_PI * x0) * invL);
}
mp::float128 res_tot = invL + ((2 * invL) * res);
return res_tot;
}
The following lines define the integral that I carry out using GSL:
struct my_f_params {double x0; double xt_pos; double y0; double yt_pos; double invLx; double invLy;
double invtau_x; double invtau_y; int n_lim; double tax_rate;};
double g(double *k, size_t dim, void *p){
struct my_f_params * fp = (struct my_f_params *)p;
mp::float128 temp_pbx = prob1Dbox(fp->invLx, k[0], fp->invtau_x, fp->x0, fp->xt_pos, fp->n_lim);
mp::float128 temp_pby = prob1Dbox(fp->invLy, k[0], fp->invtau_y, fp->y0, fp->yt_pos, fp->n_lim);
mp::float128 AFac = (-2 * k[0] * fp->tax_rate);
mp::float128 res = exp(log(temp_pbx) + log(temp_pby) + AFac);
return res.convert_to<double>();
}
double integrate_integral(const double& x0, const double& xt_pos, const double& y0,
const double& yt_pos, const double& invLx, const double& invLy, const double& invtau_x,
const double& invtau_y, const int& n_lim, const double& tax_rate){
double res, err;
double xl[1] = {0};
double xu[1] = {10000000};
const gsl_rng_type *T;
gsl_rng *r;
gsl_monte_function G;
struct my_f_params params = {x0, xt_pos, y0, yt_pos, invLx, invLy, invtau_x, invtau_y, n_lim, tax_rate};
G.f = &g;
G.dim = 1;
G.params = &params;
size_t calls = 10000;
gsl_rng_env_setup ();
T = gsl_rng_default;
r = gsl_rng_alloc (T);
gsl_monte_vegas_state *s = gsl_monte_vegas_alloc (1);
gsl_monte_vegas_integrate (&G, xl, xu, 1, 10000, r, s,
&res, &err);
do
{
gsl_monte_vegas_integrate (&G, xl, xu, 1, calls/5, r, s,
&res, &err);
}
while (fabs (gsl_monte_vegas_chisq (s) - 1.0) > 0.5);
gsl_monte_vegas_free (s);
gsl_rng_free (r);
return res;
}
When I try to run integral_integrate with x0 = 0; xt_pos = 0; y0 = 0; yt_pos = 10; invLx = invLy = 0.09090909; invtau_x = invtau_y = 0.000661157; n_lim = 1000; tax_rate = 7e-8; I get NaN. Why is this the case? I was not expecting this result since I used Log-Sum-Exp to get rid of possible underflow.

Can MATLAB C generation coder generate C-code that fits embedded system?

I need to convert this code into C code.
Questions:
Will MATLAB Coder generate C code that are memory safe, e.g they not using calloc or malloc. Misra C standard does not allow coder to use dynamical memory allocation. It's dangerous for embedded system due to memory leaks.
Will MATLAB Coder generate C code with dynamical matrix as argument e.g. functions with arguments foo(float* A, int m, int n) or foo(int m, int n, float A[m][n]) or is fix size example foo(float A[3][5]), only available as option?
Will MATLAB Coder generate C code that can be fitted into an embedded system. How about the internal C++ commands in the .m files such as horzcat, size and vertcat? Will they become 100% portable C-code?
Will MATLAB Coder generate functions that have call by reference? Example foo(float* input, float* output) instead of float* output = foo(float* input)
function [U] = mpc (A, B, C, x, N, r, lb)
## Find matrix
PHI = phiMat(A, C, N);
GAMMA = gammaMat(A, B, C, N);
## Solve first with no constraints
U = solve(PHI, GAMMA, x, N, r, 0, 0, false);
## Then use the last U as upper bound
U = solve(PHI, GAMMA, x, N, r, lb, U(end), true);
end
function U = solve(PHI, GAMMA, x, N, r, lb, ub, constraints)
## Set U
U = zeros(N, 1);
## Iterate Gaussian Elimination
for i = 1:N
## Solve u
if(i == 1)
u = (r - PHI(i,:)*x)/GAMMA(i,i)
else
u = (r - PHI(i,:)*x - GAMMA(i,1:i-1)*U(1:i-1) )/GAMMA(i,i)
end
## Constraints
if(constraints == true)
if(u > ub)
u = ub;
elseif(u < lb)
u = lb;
end
end
## Save u
U(i) = u
end
end
function PHI = phiMat(A, C, N)
## Create the special Observabillity matrix
PHI = [];
for i = 1:N
PHI = vertcat(PHI, C*A^i);
end
end
function GAMMA = gammaMat(A, B, C, N)
## Create the lower triangular toeplitz matrix
GAMMA = [];
for i = 1:N
GAMMA = horzcat(GAMMA, vertcat(zeros((i-1)*size(C*A*B, 1), size(C*A*B, 2)),cabMat(A, B, C, N-i+1)));
end
end
function CAB = cabMat(A, B, C, N)
## Create the column for the GAMMA matrix
CAB = [];
for i = 0:N-1
CAB = vertcat(CAB, C*A^i*B);
end
end
My C-code. Yes its working!
/*
* Generalized_Predictive_Control.c
*
* Created on:
* Author:
*/
#include "Generalized_Predictive_Control.h"
/*
* Parameters
*/
int adim;
int ydim;
int rdim;
int horizon;
/*
* Deceleration
*/
static void obsv(float* PHI, const float* A, const float* C);
static void kalman(float* x, const float* A, const float* B, float* u, const float* K, float* y, const float* C);
static void mul(float* A, float* B, float* C, int row_a, int column_a, int column_b);
static void tran(float* A, int row, int column);
static void CAB(float* GAMMA, float* PHI, const float* A, const float* B, const float* C);
static void solve(float* GAMMA, float* PHI, float* x, float* u, float* r, float lb, float ub, int constraintsON);
static void print(float* A, int row, int column);
void GPC(int adim_, int ydim_, int rdim_, int horizon_, const float* A, const float* B, const float* C, const float* D, const float* K, float* u, float* r, float* y, float* x){
/*
* Set the dimensions
*/
adim = adim_;
ydim = ydim_;
rdim = rdim_;
horizon = horizon_;
/*
* Identify the model - Extended Least Square
*/
int n = 5;
float* phi;
float* theta;
//els(phi, theta, n, y, u, P);
/*
* Create a state space model with Observable canonical form
*/
/*
* Create the extended observability matrix
*/
float PHI[horizon*ydim*adim];
memset(PHI, 0, horizon*ydim*adim*sizeof(float));
obsv(PHI, A, C);
/*
* Create the lower triangular toeplitz matrix
*/
float GAMMA[horizon*rdim*horizon*ydim];
memset(GAMMA, 0, horizon*rdim*horizon*ydim*sizeof(float));
CAB(GAMMA, PHI, A, B, C);
/*
* Solve the best input value
*/
solve(GAMMA, PHI, x, u, r, 0, 0, 0);
solve(GAMMA, PHI, x, u, r, 0, *(u), 1);
/*
* Estimate the state vector
*/
kalman(x, A, B, u, K, y, C);
}
/*
* Identify the model
*/
static void els(float* P, float* phi, float* theta, int polyLength, int totalPolyLength, float* y, float* u, float* e){
/*
* move phi with the inputs, outputs, errors one step to right
*/
for(int i = 0; i < polyLength; i++){
*(phi + i+1 + totalPolyLength*0) = *(phi + i + totalPolyLength*0); // Move one to right for the y's
*(phi + i+1 + totalPolyLength*1) = *(phi + i + totalPolyLength*1); // Move one to right for the u's
*(phi + i+1 + totalPolyLength*2) = *(phi + i + totalPolyLength*2); // Move one to right for the e's
}
/*
* Add the current y, u and e
(*phi + totalPolyLength*0) = -*(y + 0); // Need to be negative!
(*phi + totalPolyLength*1) = *(u + 0);
(*phi + totalPolyLength*2) = *(e + 0);
*/
/*
* phi'*theta
*/
float y_est = 0;
for(int i = 0; i < totalPolyLength; i++){
y_est += *(phi + i) * *(theta + i);
}
float epsilon = *(y + 0) - y_est; // In this case, y is only one element array
/*
* phi*epsilon
*/
float phi_epsilon[totalPolyLength];
memset(phi_epsilon, 0, totalPolyLength*sizeof(float));
for(int i = 0; i < totalPolyLength; i++){
*(phi_epsilon + i) = *(phi + i) * epsilon;
}
/*
* P_vec = P*phi_epsilon
*/
float P_vec[totalPolyLength];
memset(P_vec, 0, totalPolyLength*sizeof(float));
mul(P, phi_epsilon, P_vec, totalPolyLength, totalPolyLength, 1);
/*
* Update our estimated vector theta = theta + P_vec
*/
for(int i = 0; i < totalPolyLength; i++){
*(theta + i) = *(theta + i) + *(P_vec + i);
}
/*
* Update P = P - (P*phi*phi'*P)/(1 + phi'*P*phi)
*/
// Create phi'
float phiT[totalPolyLength];
memset(phiT, 0, totalPolyLength*sizeof(float));
memcpy(phiT, phi, totalPolyLength*sizeof(float));
tran(phiT, totalPolyLength, 1);
// phi'*P
float phiT_P[totalPolyLength];
memset(phiT_P, 0, totalPolyLength*sizeof(float));
mul(phiT, P, phiT_P, 1, totalPolyLength, totalPolyLength);
// phi*phi'*P
float phi_phiT_P[totalPolyLength*totalPolyLength];
memset(phi_phiT_P, 0, totalPolyLength*totalPolyLength*sizeof(float));
mul(phi, phiT_P, phi_phiT_P, totalPolyLength, 1, totalPolyLength);
// P*phi*phi'*P
float P_phi_phiT_P[totalPolyLength*totalPolyLength];
memset(P_phi_phiT_P, 0, totalPolyLength*totalPolyLength*sizeof(float));
mul(P, phi_phiT_P, P_phi_phiT_P, totalPolyLength, totalPolyLength, totalPolyLength);
// P*phi
float P_phi[totalPolyLength];
memset(P_phi, 0, totalPolyLength*sizeof(float));
mul(P, phi, P_phi, totalPolyLength, totalPolyLength, 1);
// phi'*P*phi
float phiT_P_phi[1];
memset(phiT_P_phi, 0, 1*sizeof(float));
mul(phiT, P_phi, phiT_P_phi, 1, totalPolyLength, 1);
// P = P - (P_phi_phiT_P) / (1+phi'*P*phi)
for(int i = 0; i < totalPolyLength*totalPolyLength; i++){
*(P + i) = *(P + i) - *(P_phi_phiT_P + i) / (1 + *(phiT_P_phi));
}
}
/*
* This will solve if GAMMA is square!
*/
static void solve(float* GAMMA, float* PHI, float* x, float* u, float* r, float lb, float ub, int constraintsON){
/*
* Now we are going to solve on the form
* Ax=b, where b = (R*r-PHI*x) and A = GAMMA and x = U
*/
/*
* R_vec = R*r
*/
float R_vec[horizon*ydim];
memset(R_vec, 0, horizon*ydim*sizeof(float));
for(int i = 0; i < horizon*ydim; i++){
for (int j = 0; j < rdim; j++) {
*(R_vec + i + j) = *(r + j);
}
i += rdim-1;
}
/*
* PHI_vec = PHI*x
*/
float PHI_vec[horizon*ydim];
memset(PHI_vec, 0, horizon * ydim * sizeof(float));
mul(PHI, x, PHI_vec, horizon*ydim, adim, 1);
/*
* Solve now (R_vec - PHI_vec) = GAMMA*U
* Notice that this is ONLY for Square GAMMA with lower triangular toeplitz matrix e.g SISO case
* This using Gaussian Elimination backward substitution
*/
float U[horizon];
float sum = 0.0;
memset(U, 0, horizon*sizeof(float));
for(int i = 0; i < horizon; i++){
for(int j = 0; j < i; j++){
sum += *(GAMMA + i*horizon + j) * *(U + j);
}
float newU = (*(R_vec + i) - *(PHI_vec + i) - sum) / (*(GAMMA + i*horizon + i));
if(constraintsON == 1){
if(newU > ub)
newU = ub;
if(newU < lb)
newU = lb;
}
*(U + i) = newU;
sum = 0.0;
}
//print(U, horizon, 1);
/*
* Set last U to u
*/
if(constraintsON == 0){
*(u + 0) = *(U + horizon - 1);
}else{
*(u + 0) = *(U + 0);
}
}
/*
* Lower traingular toeplitz of extended observability matrix
*/
static void CAB(float* GAMMA, float* PHI, const float* A, const float* B, const float* C){
/*
* First create the initial C*A^0*B == C*I*B == C*B
*/
float CB[ydim*rdim];
memset(CB, 0, ydim*rdim*sizeof(float));
mul((float*)C, (float*)B, CB, ydim, adim, rdim);
/*
* Take the transpose of CB so it will have dimension rdim*ydim instead
*/
tran(CB, ydim, rdim);
/*
* Create the CAB matrix from PHI*B
*/
float PHIB[horizon*ydim*rdim];
mul(PHI, (float*) B, PHIB, horizon*ydim, adim, rdim); // CAB = PHI*B
tran(PHIB, horizon*ydim, rdim);
/*
* We insert GAMMA = [CB PHI;
* 0 CB PHI;
* 0 0 CB PHI;
* 0 0 0 CB PHI] from left to right
*/
for(int i = 0; i < horizon; i++) {
for(int j = 0; j < rdim; j++) {
memcpy(GAMMA + horizon*ydim*(i*rdim+j) + ydim*i, CB + ydim*j, ydim*sizeof(float)); // Add CB
memcpy(GAMMA + horizon*ydim*(i*rdim+j) + ydim*i + ydim, PHIB + horizon*ydim*j, (horizon-i-1)*ydim*sizeof(float)); // Add PHI*B
}
}
/*
* Transpose of gamma
*/
tran(GAMMA, horizon*rdim, horizon*ydim);
//print(CB, rdim, ydim);
//print(PHIB, rdim, horizon*ydim);
//print(GAMMA, horizon*ydim, horizon*rdim);
}
/*
* Transpose
*/
static void tran(float* A, int row, int column) {
float B[row*column];
float* transpose;
float* ptr_A = A;
for (int i = 0; i < row; i++) {
transpose = &B[i];
for (int j = 0; j < column; j++) {
*transpose = *ptr_A;
ptr_A++;
transpose += row;
}
}
// Copy!
memcpy(A, B, row*column*sizeof(float));
}
/*
* [C*A^1; C*A^2; C*A^3; ... ; C*A^horizon] % Extended observability matrix
*/
static void obsv(float* PHI, const float* A, const float* C){
/*
* This matrix will A^(i+1) all the time
*/
float A_pow[adim*adim];
memset(A_pow, 0, adim * adim * sizeof(float));
float A_copy[adim*adim];
memcpy(A_copy, (float*) A, adim * adim * sizeof(float));
/*
* Temporary matrix
*/
float T[ydim*adim];
memset(T, 0, ydim * adim * sizeof(float));
/*
* Regular T = C*A^(1+i)
*/
mul((float*) C, (float*) A, T, ydim, adim, adim);
/*
* Insert temporary T into PHI
*/
memcpy(PHI, T, ydim*adim*sizeof(float));
/*
* Do the rest C*A^(i+1) because we have already done i = 0
*/
for(int i = 1; i < horizon; i++){
mul((float*) A, A_copy, A_pow, adim, adim, adim); // Matrix power A_pow = A*A_copy
mul((float*) C, A_pow, T, ydim, adim, adim); // T = C*A^(1+i)
memcpy(PHI + i*ydim*adim, T, ydim*adim*sizeof(float)); // Insert temporary T into PHI
memcpy(A_copy, A_pow, adim * adim * sizeof(float)); // A_copy <- A_pow
}
}
/*
* x = Ax - KCx + Bu + Ky % Kalman filter
*/
static void kalman(float* x, const float* A, const float* B, float* u, const float* K, float* y, const float* C) {
/*
* Compute the vector A_vec = A*x
*/
float A_vec[adim*1];
memset(A_vec, 0, adim*sizeof(float));
mul((float*) A, x, A_vec, adim, adim, 1);
/*
* Compute the vector B_vec = B*u
*/
float B_vec[adim*1];
memset(B_vec, 0, adim*sizeof(float));
mul((float*) B, u, B_vec, adim, rdim, 1);
/*
* Compute the vector C_vec = C*x
*/
float C_vec[ydim*1];
memset(C_vec, 0, ydim*sizeof(float));
mul((float*) C, x, C_vec, ydim, adim, 1);
/*
* Compute the vector KC_vec = K*C_vec
*/
float KC_vec[adim*1];
memset(KC_vec, 0, adim*sizeof(float));
mul((float*) K, C_vec, KC_vec, adim, ydim, 1);
/*
* Compute the vector Ky_vec = K*y
*/
float Ky_vec[adim*1];
memset(Ky_vec, 0, adim*sizeof(float));
mul((float*) K, y, Ky_vec, adim, ydim, 1);
/*
* Now add x = A_vec - KC_vec + B_vec + Ky_vec
*/
for(int i = 0; i < adim; i++){
*(x + i) = *(A_vec + i) - *(KC_vec + i) + *(B_vec + i) + *(Ky_vec + i);
}
}
/*
* C = A*B
*/
static void mul(float* A, float* B, float* C, int row_a, int column_a, int column_b) {
// Data matrix
float* data_a = A;
float* data_b = B;
for (int i = 0; i < row_a; i++) {
// Then we go through every column of b
for (int j = 0; j < column_b; j++) {
data_a = &A[i * column_a];
data_b = &B[j];
*C = 0; // Reset
// And we multiply rows from a with columns of b
for (int k = 0; k < column_a; k++) {
*C += *data_a * *data_b;
data_a++;
data_b += column_b;
}
C++; // ;)
}
}
}
/*
* Print matrix or vector - Just for error check
*/
static void print(float* A, int row, int column) {
for (int i = 0; i < row; i++) {
for (int j = 0; j < column; j++) {
printf("%0.18f ", *(A++));
}
printf("\n");
}
printf("\n");
}

Disclaimer: I work on MATLAB Coder
There is a configuration setting to tell MATLAB Coder to generate code without using dynamically allocated memory or issue an error telling you why it can't do so.
cfg = coder.config('lib');
cfg.DynamicMemoryAllocation = 'Off';
codegen -config cfg ...
MATLAB Coder supports generating code with fixed-size arrays, variable-sized arrays, and dynamically allocated arrays. The various generated signature formats are shown in the documentation. For non-dynamically allocated variable-sized arrays, a common signature is something like: foo(x_data[100], x_size[2])
Yes, the generated code is generally portable and independent of MATLAB for the hardware you specify when generating code. The full list of available functions and classes supported for code generation is listed here. In a very small number of cases, the generated code needs to depend on libraries from MATLAB. Those cases will be called out in the documentation. Fundamental operations like horzcat and vertcat produce portable code that is independent of MATLAB.
Yes. For array outputs and MATLAB functions with multiple outputs, the generated code will return outputs by reference. It also supports passing an argument by reference in some cases when the corresponding MATLAB function has the same variable as an input and output: function A = foo(A,B) with a call like: y = foo(y,z); can produce something like void foo(double A[100], const double B[20]); where A is an input and output.

Optimizing custom IDCT calculation

For an image format I have to decode a jpeg like compression. The steps are pretty standard, however the color coding and the produced IDCT values seem to be slightly different (I think, I am no expert). This is why using a standard JPEG library produced invalid results it seems and I had to implement it myself.
Now I am not really happy with the performance. After optimizing the naive approach by doing some caching of data that does not change I need ~150ms to decode a 1024x1024 image. Of that 120ms is spent in the idct function which looks like this:
float idctHelper(const int16_t *inBlock, int32_t u, int32_t v, int32_t blockWidth, int32_t blockHeight) {
glm::vec<4, float, glm::packed_lowp> vec1{};
glm::vec<4, float, glm::packed_lowp> vec2{};
glm::vec<4, float, glm::packed_lowp> vec3{};
float result = 0.0f;
for (auto y = 0; y < blockHeight; ++y) {
for (auto x = 0; x < blockWidth; x += 4) {
const auto idx = (v * 8 + u) * 64 + y * 8 + x;
vec1 = glm::vec<4, float, glm::packed_lowp>(inBlock[y * blockWidth + x], inBlock[y * blockWidth + x + 1], inBlock[y * blockWidth + x + 2], inBlock[y * blockWidth + x + 3]);
vec2 = glm::vec<4, float, glm::packed_lowp>(idctLookup[idx], idctLookup[idx + 1], idctLookup[idx + 2], idctLookup[idx + 3]);
vec3 = vec1 * vec2;
result += vec3.x + vec3.y + vec3.z + vec3.w;
}
}
return result;
}
template<typename T, typename U = T>
U clamp(T value, T min, T max) {
return static_cast<U>(std::min<T>(std::max<T>(value, min), max));
}
void idct(int16_t *outBlock, int16_t *inBlock, bool isLuminance, int32_t blockWidth = 8, int32_t blockHeight = 8) {
for (auto y = 0; y < blockHeight; ++y) {
for (auto x = 0; x < blockWidth; ++x) {
auto value = static_cast<int16_t>(std::round(
0.25f * idctHelper(inBlock, x, y, blockWidth, blockHeight)));
if (isLuminance) {
value = clamp<int16_t>(static_cast<int16_t>(value + 128), 0, 255);
} else {
value = clamp<int16_t>(value, -256, 255);
}
outBlock[y * blockWidth + x] = value;
}
}
}
As you can see I have tried to use (potentially) SIMD and low precision operations to speed up the idctHelper but now I am kinda lost on where I can continue optimizing this. I would really appreciate any idea or hint.

Bilinear Image Sampling Non-Reproducible Access Violation

I have a template 2D image buffer class that can be used with many values types. The values are stored as a 1D dynamic array of T, accessed by a Row method to get a pointer to the correct row.
One of the methods of the class is used to sample a value in the image bilinearly.
The code generally works, but ever so rarely I get an access violation exception in this method in production which I can't seem to recreate, because the crash dump doesn't include the coordinates that were passed to the method.
These are the relevant parts of the code:
T* data;
int width, height;
T* Row(int y) const { return data + width * y; }
T GetValueBilinear(float x, float y) const
{
const float PIXEL_CENTER_OFFSET = 0.5F;
const float cx = clamp(0.0F, width - 1.0F, x - PIXEL_CENTER_OFFSET);
const float cy = clamp(0.0F, height - 1.0F, y - PIXEL_CENTER_OFFSET);
const float tx = fmod(cx, 1.0F);
const float ty = fmod(cy, 1.0F);
const int xInt = (int)cx;
const int yInt = (int)cy;
const T* r0 = Row(yInt);
const T* r1 = ty && yInt < (height - 1) ? Row(yInt + 1) : r0;
//interpolate on Y
const T& c00 = r0[xInt];
const T& c01 = r1[xInt];
T c0 = lerp(c00, c01, ty);
if (tx && xInt < (width - 1))
{
//interpolate on X
const T& c10 = r0[xInt + 1];
const T& c11 = r1[xInt + 1];
T c1 = lerp(c10, c11, ty);
return lerp(c0, c1, tx);
}
else
{
return c0;
}
}
The definitions for clamp, and lerp are:
template <typename T>
inline T clamp(T min, T max, T value) { return value < min ? min : value > max ? max : value; }
template <typename T>
inline T lerp(T a, T b, float t) { return a + (b - a) * t; } //i.e. a(1-t)+bt
Do you see any obvious errors which would cause an access violation for any values of x and y which are not NaN?
You can assume that width, height and data are valid and correct (i.e., positive dimensions - in this particular case 1280x720, data is not dangling pointer).
If it matters, then T is a float in this case.
The fact that this is non-reproducible and generally working 99.9% of the time, makes me feel like it could be an accuracy issue, though I can't see where it would come from.
Alternatively, what debugging techniques could I use to analyze the crash dumps more effectively?

I tested your GetValueBilinear with 1073741824 random values for the pair (x,y) on a 1280x720 data with no access violation.. so I would say it is working fine 99.999999%1 of the time :-) I suspect the problem is not in GetValueBilinear but elsewhere...
#include <cmath>
#include <algorithm>
template <typename T>
inline T clamp(T min, T max, T value) { return value < min ? min : value > max ? max : value; }
template <typename T>
inline T lerp(T a, T b, float t) { return a + (b - a) * t; } //i.e. a(1-t)+bt
template < typename T >
class C
{
public:
C(int w, int h) : height(h), width(w) {
float lower_bound = T(0);
float upper_bound = std::nextafter(T(255), std::numeric_limits<T>::max());
std::uniform_real_distribution<float> unif(lower_bound, upper_bound);
std::default_random_engine re;
data = new T[width*height];// I know... a leak! But... who cares?!
std::generate(data, data + (width*height), [&]() {return unif(re); });
}
T GetValueBilinear(float x, float y) const
{
const float PIXEL_CENTER_OFFSET = 0.5F;
const float cx = clamp(0.0F, width - 1.0F, x - PIXEL_CENTER_OFFSET);
const float cy = clamp(0.0F, height - 1.0F, y - PIXEL_CENTER_OFFSET);
const float tx = fmod(cx, 1.0F);
const float ty = fmod(cy, 1.0F);
const int xInt = (int)cx;
const int yInt = (int)cy;
const T* r0 = Row(yInt);
const T* r1 = ty && yInt < (height - 1) ? Row(yInt + 1) : r0;
//interpolate on Y
const T& c00 = r0[xInt];
const T& c01 = r1[xInt];
T c0 = lerp(c00, c01, ty);
if (tx && xInt < (width - 1))
{
//interpolate on X
const T& c10 = r0[xInt + 1];
const T& c11 = r1[xInt + 1];
T c1 = lerp(c10, c11, ty);
return lerp(c0, c1, tx);
}
else
{
return c0;
}
}
T* data;
int width, height;
T* Row(int y) const { return data + width * y; }
};
#include <random>
#include <iostream>
#include <Windows.h>
float x;
float y;
LONG WINAPI my_filter(_In_ struct _EXCEPTION_POINTERS *ExceptionInfo)
{
std::cout << x << " " << y << "\n";
return EXCEPTION_EXECUTE_HANDLER;
}
int main()
{
auto a = ::SetUnhandledExceptionFilter(my_filter);
float lower_bound = -(1 << 20);
float upper_bound = -lower_bound;
std::uniform_real_distribution<float> unif(lower_bound, upper_bound);
std::default_random_engine re;
float acc = 0;
C<float> img(1280, 720);
img.GetValueBilinear(1.863726958e-043, 1.5612089e-038);
for (size_t i = 0; i < (1 << 30); i++) {
x = unif(re);
y = unif(re);
acc += img.GetValueBilinear(x, y);
}
return static_cast<int>(acc);
}
1Even if no access violation was found I cannot say that the algorithm works well 100%, using a naïve model and this R code:
prop.test(0,1073741824)
I get a confidence interval for the true value of the proportion, the interval is (0.000000e+00, 4.460345e-09) and so the success percentage is (1-4.460345e-09)*100, but... do not trust me, I am not a statistician!

Fast implementation of covariance of two 8-bit arrays

I need to compare a big amount of similar images of small size (up to 200x200).
So I try to implement SSIM (Structural similarity see https://en.wikipedia.org/wiki/Structural_similarity ) algorithm.
SSIM requires calculation of covariance of two 8-bit gray images.
A trivial implementation look like:
float SigmaXY(const uint8_t * x, const uint8_t * y, size_t size, float averageX, float averageY)
{
float sum = 0;
for(size_t i = 0; i < size; ++i)
sum += (x[i] - averageX) * (y[i] - averageY);
return sum / size;
}
But it has poor performance.
So I hope to improve it with using SIMD or CUDA (I heard that it can be done).
Unfortunately I have no experience to do this.
How it will look? And where I have to go?

I have another nice solution!
At first I want to mention some mathematical formulas:
averageX = Sum(x[i])/size;
averageY = Sum(y[i])/size;
And therefore:
Sum((x[i] - averageX)*(y[i] - averageY))/size =
Sum(x[i]*y[i])/size - Sum(x[i]*averageY)/size -
Sum(averageX*y[i])/size + Sum(averageX*averageY)/size =
Sum(x[i]*y[i])/size - averageY*Sum(x[i])/size -
averageX*Sum(y[i])/size + averageX*averageY*Sum(1)/size =
Sum(x[i]*y[i])/size - averageY*averageX -
averageX*averageY + averageX*averageY =
Sum(x[i]*y[i])/size - averageY*averageX;
It allows to modify our algorithm:
float SigmaXY(const uint8_t * x, const uint8_t * y, size_t size, float averageX, float averageY)
{
uint32_t sum = 0; // If images will have size greater then 256x256 than you have to use uint64_t.
for(size_t i = 0; i < size; ++i)
sum += x[i]*y[i];
return sum / size - averageY*averageX;
}
And only after that we can use SIMD (I used SSE2):
#include <emmintrin.h>
inline __m128i SigmaXY(__m128i x, __m128i y)
{
__m128i lo = _mm_madd_epi16(_mm_unpacklo_epi8(x, _mm_setzero_si128()), _mm_unpacklo_epi8(y, _mm_setzero_si128()));
__m128i hi = _mm_madd_epi16(_mm_unpackhi_epi8(y, _mm_setzero_si128()), _mm_unpackhi_epi8(y, _mm_setzero_si128()));
return _mm_add_epi32(lo, hi);
}
float SigmaXY(const uint8_t * x, const uint8_t * y, size_t size, float averageX, float averageY)
{
uint32_t sum = 0;
size_t i = 0, alignedSize = size/16*16;
if(size >= 16)
{
__m128i sums = _mm_setzero_si128();
for(; i < alignedSize; i += 16)
{
__m128i _x = _mm_loadu_si128((__m128i*)(x + i));
__m128i _y = _mm_loadu_si128((__m128i*)(y + i));
sums = _mm_add_epi32(sums, SigmaXY(_x, _y));
}
uint32_t _sums[4];
_mm_storeu_si128(_sums, sums);
sum = _sums[0] + _sums[1] + _sums[2] + _sums[3];
}
for(; i < size; ++i)
sum += x[i]*y[i];
return sum / size - averageY*averageX;
}

There is a SIMD implementation of the algorithm (I used SSE4.1):
#include <smmintrin.h>
template <int shift> inline __m128 SigmaXY(const __m128i & x, const __m128i & y, __m128 & averageX, __m128 & averageY)
{
__m128 _x = _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(x, shift)));
__m128 _y = _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(y, shift)));
return _mm_mul_ps(_mm_sub_ps(_x, averageX), _mm_sub_ps(_y, averageY))
}
float SigmaXY(const uint8_t * x, const uint8_t * y, size_t size, float averageX, float averageY)
{
float sum = 0;
size_t i = 0, alignedSize = size/16*16;
if(size >= 16)
{
__m128 sums = _mm_setzero_ps();
__m128 avgX = _mm_set1_ps(averageX);
__m128 avgY = _mm_set1_ps(averageY);
for(; i < alignedSize; i += 16)
{
__m128i _x = _mm_loadu_si128((__m128i*)(x + i));
__m128i _y = _mm_loadu_si128((__m128i*)(y + i));
sums = _mm_add_ps(sums, SigmaXY<0>(_x, _y, avgX, avgY);
sums = _mm_add_ps(sums, SigmaXY<4>(_x, _y, avgX, avgY);
sums = _mm_add_ps(sums, SigmaXY<8>(_x, _y, avgX, avgY);
sums = _mm_add_ps(sums, SigmaXY<12>(_x, _y, avgX, avgY);
}
float _sums[4];
_mm_storeu_ps(_sums, sums);
sum = _sums[0] + _sums[1] + _sums[2] + _sums[3];
}
for(; i < size; ++i)
sum += (x[i] - averageX) * (y[i] - averageY);
return sum / size;
}
I hope that it will useful for you.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to speed up my bilinear interpolation? - c++

Related

Arbitrary multiprecision inside the integrand of a GSL monte carlo integration

Can MATLAB C generation coder generate C-code that fits embedded system?

Optimizing custom IDCT calculation

Bilinear Image Sampling Non-Reproducible Access Violation

Fast implementation of covariance of two 8-bit arrays

Categories

Resources