Bilinear interpolation in C/C++ and CUDA - c++
I want to emulate the behavior of CUDA bilinear interpolation on CPU, but I found that the return value of tex2D seems not fit to the bilinear formula.
I guess that casting the interpolation coefficients from float to 9-bit fixed point format with 8 bits of fractional value[1] results in different values.
According to the conversion fomula [2, line 106], the result of the conversion will be the same as the input float when the coeffient is 1/2^n, with n=0,1,..., 8, but I still (not always) receive weird values.
Below I report an example of weird values. In this case, weird values always happen when id = 2*n+1, could anyone tell me why?
Src Array:
Src[0][0] = 38;
Src[1][0] = 39;
Src[0][1] = 118;
Src[1][1] = 13;
Texture Definition:
static texture<float4, 2, cudaReadModeElementType> texElnt;
texElnt.addressMode[0] = cudaAddressModeClamp;
texElnt.addressMode[1] = cudaAddressModeClamp;
texElnt.filterMode = cudaFilterModeLinear;
texElnt.normalized = false;
Kernel Function:
static __global__ void kernel_texElnt(float* pdata, int w, int h, int c, float stride/*0.03125f*/) {
const int gx = blockIdx.x*blockDim.x + threadIdx.x;
const int gy = blockIdx.y*blockDim.y + threadIdx.y;
const int gw = gridDim.x * blockDim.x;
const int gid = gy*gw + gx;
if (gx >= w || gy >= h) {
return;
}
float2 pnt;
pnt.x = (gx)*(stride)/*1/32*/;
pnt.y = 0.0625f/*1/16*/;
float4 result = tex2D( texElnt, pnt.x + 0.5, pnt.y + 0.5f);
pdata[gid*3 + 0] = pnt.x;
pdata[gid*3 + 1] = pnt.y;
pdata[gid*3 + 2] = result.x;
}
Bilinear Result of CUDA
id pnt.x pnt.y tex2D
0 0.00000 0.0625 43.0000000
1 0.03125 0.0625 42.6171875
2 0.06250 0.0625 42.6484375
3 0.09375 0.0625 42.2656250
4 0.12500 0.0625 42.2968750
5 0.15625 0.0625 41.9140625
6 0.18750 0.0625 41.9453125
7 0.21875 0.0625 41.5625000
8 0.25000 0.0625 41.5937500
9 0.28125 0.0625 41.2109375
0 0.31250 0.0625 41.2421875
10 0.34375 0.0625 40.8593750
11 0.37500 0.0625 40.8906250
12 0.40625 0.0625 40.5078125
13 0.43750 0.0625 40.5390625
14 0.46875 0.0625 40.1562500
15 0.50000 0.0625 40.1875000
16 0.53125 0.0625 39.8046875
17 0.56250 0.0625 39.8359375
18 0.59375 0.0625 39.4531250
19 0.62500 0.0625 39.4843750
20 0.65625 0.0625 39.1015625
21 0.68750 0.0625 39.1328125
22 0.71875 0.0625 38.7500000
23 0.75000 0.0625 38.7812500
24 0.78125 0.0625 38.3984375
25 0.81250 0.0625 38.4296875
26 0.84375 0.0625 38.0468750
27 0.87500 0.0625 38.0781250
28 0.90625 0.0625 37.6953125
29 0.93750 0.0625 37.7265625
30 0.96875 0.0625 37.3437500
31 1.00000 0.0625 37.3750000
CPU Result:
// convert coefficient ((1-α)*(1-β)), (α*(1-β)), ((1-α)*β), (α*β) to fixed point format
id pnt.x pnt.y tex2D
0 0.00000 0.0625 43.00000000
1 0.03125 0.0625 43.23046875
2 0.06250 0.0625 42.64843750
3 0.09375 0.0625 42.87890625
4 0.12500 0.0625 42.29687500
5 0.15625 0.0625 42.52734375
6 0.18750 0.0625 41.94531250
7 0.21875 0.0625 42.17578125
8 0.25000 0.0625 41.59375000
9 0.28125 0.0625 41.82421875
0 0.31250 0.0625 41.24218750
10 0.34375 0.0625 41.47265625
11 0.37500 0.0625 40.89062500
12 0.40625 0.0625 41.12109375
13 0.43750 0.0625 40.53906250
14 0.46875 0.0625 40.76953125
15 0.50000 0.0625 40.18750000
16 0.53125 0.0625 40.41796875
17 0.56250 0.0625 39.83593750
18 0.59375 0.0625 40.06640625
19 0.62500 0.0625 39.48437500
20 0.65625 0.0625 39.71484375
21 0.68750 0.0625 39.13281250
22 0.71875 0.0625 39.36328125
23 0.75000 0.0625 38.78125000
24 0.78125 0.0625 39.01171875
25 0.81250 0.0625 38.42968750
26 0.84375 0.0625 38.66015625
27 0.87500 0.0625 38.07812500
28 0.90625 0.0625 38.30859375
29 0.93750 0.0625 37.72656250
30 0.96875 0.0625 37.95703125
31 1.00000 0.0625 37.37500000
I leave a simple code on my github [3], after running the program you will got two files in D:\.
Edit 2014/01/20
I run the program with different increments and found the specification of tex2D "when alpha multiplied beta is less than 0.00390625, the return of tex2D does not match the bilinear interpolation formula"
Already satisfactory answers have been provided to this question, so now I just want to give a compendium of hopefully useful information on bilinear interpolation, how can it be implemented in C++ and the different ways it can be done in CUDA.
Maths behind bilinear interpolation
Assume that the original function T(x, y) is sampled at the Cartesian regular grid of points (i, j) with 0 <= i < M1, 0 <= j < M2 and i and j integers. For each value of y, one can first use 0 <= a < 1 to represent an arbitrary point i + a comprised between i and i + 1. Then, a linear interpolation along the y = j axis (which is parallel to the x axis) at that point can be performed obtaining
where r(x,y) is the function interpolating the samples of T(x,y). The same can be done for the line y = j + 1, obtaining
Now, for each i + a, an interpolation along the y axis can be performed on the samples r(i+a,j) and r(i+a,j+1). Accordingly, if one uses 0 <= b < 1 to represent an arbitrary point j + b located between j and j + 1, then a linear interpolation along the x = i + a axis (which is parallel to the y axis) can be worked out, so getting the final result
Note that the relations between i, j, a, b, x and y are the following
C/C++ implementation
Let me stress that this implementation, as well as the following CUDA ones, assume, as done at the beginning, that the samples of T are located on the Cartesian regular grid of points (i, j) with 0 <= i < M1, 0 <= j < M2 and i and j integers (unit spacing). Also, the routine is provided in single precision, complex (float2) arithmetics, but it can be easily cast in other arithmetics of interest.
void bilinear_interpolation_function_CPU(float2 * __restrict__ h_result, float2 * __restrict__ h_data,
float * __restrict__ h_xout, float * __restrict__ h_yout,
const int M1, const int M2, const int N1, const int N2){
float2 result_temp1, result_temp2;
for(int k=0; k<N2; k++){
for(int l=0; l<N1; l++){
const int ind_x = floor(h_xout[k*N1+l]);
const float a = h_xout[k*N1+l]-ind_x;
const int ind_y = floor(h_yout[k*N1+l]);
const float b = h_yout[k*N1+l]-ind_y;
float2 h00, h01, h10, h11;
if (((ind_x) < M1)&&((ind_y) < M2)) h00 = h_data[ind_y*M1+ind_x]; else h00 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y) < M2)) h10 = h_data[ind_y*M1+ind_x+1]; else h10 = make_float2(0.f, 0.f);
if (((ind_x) < M1)&&((ind_y+1) < M2)) h01 = h_data[(ind_y+1)*M1+ind_x]; else h01 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y+1) < M2)) h11 = h_data[(ind_y+1)*M1+ind_x+1]; else h11 = make_float2(0.f, 0.f);
result_temp1.x = a * h10.x + (-h00.x * a + h00.x);
result_temp1.y = a * h10.y + (-h00.y * a + h00.y);
result_temp2.x = a * h11.x + (-h01.x * a + h01.x);
result_temp2.y = a * h11.y + (-h01.y * a + h01.y);
h_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
h_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);
}
}
}
The if/else statements within the above code are simply boundary checkings. If the sample falls outside the [0, M1-1] x [0, M2-1], then it is set to 0.
Standard CUDA implementation
This is a "standard" CUDA implementation tracing the above CPU one. No usage of texture memory.
__global__ void bilinear_interpolation_kernel_GPU(float2 * __restrict__ d_result, const float2 * __restrict__ d_data,
const float * __restrict__ d_xout, const float * __restrict__ d_yout,
const int M1, const int M2, const int N1, const int N2)
{
const int l = threadIdx.x + blockDim.x * blockIdx.x;
const int k = threadIdx.y + blockDim.y * blockIdx.y;
if ((l<N1)&&(k<N2)) {
float2 result_temp1, result_temp2;
const int ind_x = floor(d_xout[k*N1+l]);
const float a = d_xout[k*N1+l]-ind_x;
const int ind_y = floor(d_yout[k*N1+l]);
const float b = d_yout[k*N1+l]-ind_y;
float2 d00, d01, d10, d11;
if (((ind_x) < M1)&&((ind_y) < M2)) d00 = d_data[ind_y*M1+ind_x]; else d00 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y) < M2)) d10 = d_data[ind_y*M1+ind_x+1]; else d10 = make_float2(0.f, 0.f);
if (((ind_x) < M1)&&((ind_y+1) < M2)) d01 = d_data[(ind_y+1)*M1+ind_x]; else d01 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y+1) < M2)) d11 = d_data[(ind_y+1)*M1+ind_x+1]; else d11 = make_float2(0.f, 0.f);
result_temp1.x = a * d10.x + (-d00.x * a + d00.x);
result_temp1.y = a * d10.y + (-d00.y * a + d00.y);
result_temp2.x = a * d11.x + (-d01.x * a + d01.x);
result_temp2.y = a * d11.y + (-d01.y * a + d01.y);
d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);
}
}
CUDA implementation with texture fetch
This is the same implementation as above, but the global memory is now accessed by the texture cache. For example, T[i,j] is accessed as
tex2D(d_texture_fetch_float,ind_x,ind_y);
(where, of course ind_x = i and ind_y = j, and d_texture_fetch_float is assumed to be a global scope variable) instead of
d_data[ind_y*M1+ind_x];
Note that the hard-wired texture filtering capabilities are not exploited here. The routine below has the same precision as the above one and could result somewhat faster than that on old CUDA architectures.
__global__ void bilinear_interpolation_kernel_GPU_texture_fetch(float2 * __restrict__ d_result,
const float * __restrict__ d_xout, const float * __restrict__ d_yout,
const int M1, const int M2, const int N1, const int N2)
{
const int l = threadIdx.x + blockDim.x * blockIdx.x;
const int k = threadIdx.y + blockDim.y * blockIdx.y;
if ((l<N1)&&(k<N2)) {
float2 result_temp1, result_temp2;
const int ind_x = floor(d_xout[k*N1+l]);
const float a = d_xout[k*N1+l]-ind_x;
const int ind_y = floor(d_yout[k*N1+l]);
const float b = d_yout[k*N1+l]-ind_y;
const float2 d00 = tex2D(d_texture_fetch_float,ind_x,ind_y);
const float2 d10 = tex2D(d_texture_fetch_float,ind_x+1,ind_y);
const float2 d11 = tex2D(d_texture_fetch_float,ind_x+1,ind_y+1);
const float2 d01 = tex2D(d_texture_fetch_float,ind_x,ind_y+1);
result_temp1.x = a * d10.x + (-d00.x * a + d00.x);
result_temp1.y = a * d10.y + (-d00.y * a + d00.y);
result_temp2.x = a * d11.x + (-d01.x * a + d01.x);
result_temp2.y = a * d11.y + (-d01.y * a + d01.y);
d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);
}
}
Texture binding can be done according to
void TextureBindingBilinearFetch(const float2 * __restrict__ data, const int M1, const int M2)
{
size_t pitch;
float* data_d;
gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2));
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
gpuErrchk(cudaBindTexture2D(0,&d_texture_fetch_float,data_d,&desc,M1,M2,pitch));
d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp;
d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp;
gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice));
}
Note that now we need no if/else boundary checking, because the texture will automatically clamp to zero the samples falling outside the [0, M1-1] x [0, M2-1] sampling region, thanks to the instructions
d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp;
d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp;
CUDA implementation with texture interpolation
This is the last implementation and uses the hard-wired capabilities of texture filtering.
__global__ void bilinear_interpolation_kernel_GPU_texture_interp(float2 * __restrict__ d_result,
const float * __restrict__ d_xout, const float * __restrict__ d_yout,
const int M1, const int M2, const int N1, const int N2)
{
const int l = threadIdx.x + blockDim.x * blockIdx.x;
const int k = threadIdx.y + blockDim.y * blockIdx.y;
if ((l<N1)&&(k<N2)) { d_result[k*N1+l] = tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f); }
}
Note that the interpolation formula implemented by this feature is the same as derived above, but now
where x_B = x - 0.5 and y_B = y - 0.5. This explains the 0.5 offset in the instruction
tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f)
In this case, texture binding should be done as follows
void TextureBindingBilinearInterp(const float2 * __restrict__ data, const int M1, const int M2)
{
size_t pitch;
float* data_d;
gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2));
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
gpuErrchk(cudaBindTexture2D(0,&d_texture_interp_float,data_d,&desc,M1,M2,pitch));
d_texture_interp_float.addressMode[0] = cudaAddressModeClamp;
d_texture_interp_float.addressMode[1] = cudaAddressModeClamp;
d_texture_interp_float.filterMode = cudaFilterModeLinear; // --- Enable linear filtering
d_texture_interp_float.normalized = false; // --- Texture coordinates will NOT be normalized
gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice));
}
Note that, as already mentioned in the other answers, a and b are stored in 9-bit fixed point format with 8 bits of fractional value, so this approach will be very fast, but less accurate than those above.
The UV interpolants are truncated to 9 bits, not the participating texel values. In Chapter 10 (Texturing) of The CUDA Handbook, this is described in detail (including CPU emulation code) for the 1D case. Code is open source and may be found at https://github.com/ArchaeaSoftware/cudahandbook/blob/master/texturing/tex1d_9bit.cu
Wrong formula of bilinear interpolation makes result of texture fetching weird.
Formula - 1: you can find it in cuda appendix or wiki easily
tex(x,y)=(1−α)(1−β)T[i,j] + α(1−β)T[i+1,j] + (1−α)βT[i,j+1] + αβT[i+1,j+1]
Formula - 2: reduce times of multiply
tex(x,y)=T[i,j] + α(T[i+1,j]-T[i,j]) + β(T[i,j+1]-T[i,j]) + αβ(T[i,j]+T[i+1,j+1] - T[i+1, j]-T[i,j+1])
If you use 9-bit fixed point format to formula 1, you will get mismatch result of texture fetching, but formula 2 works fine.
Conclusion :
If you want to emulate the bilinear interpolation implemented by cuda texture, you should use formula 3. Try it!
Formula - 3:
tex(x,y)=T[i,j] + frac(α)(T[i+1,j]-T[i,j]) + frac(β)(T[i,j+1]-T[i,j]) + frac(αβ)(T[i,j]+T[i+1,j+1] - T[i+1, j]-T[i,j+1])
// frac(x) turns float to 9-bit fixed point format with 8 bits of fraction values.
float frac( float x ) {
float frac, tmp = x - (float)(int)(x);
float frac256 = (float)(int)( tmp*256.0f + 0.5f );
frac = frac256 / 256.0f;
return frac;
}
Related
Strange uint8_t conversion with OpenCV
I have encountered a strange behavior from the Matrix class in OpenCV regarding the conversion float to uint8_t. It seems that OpenCV with the Matrix class converts float to uint8_t by doing a ceil instead of just truncating the decimal. #include <iostream> #include <opencv2/core/core.hpp> #include <opencv2/imgcodecs.hpp> int main() { cv::Mat m1(1, 1, CV_8UC1); cv::Mat m2(1, 1, CV_8UC1); cv::Mat m3(1, 1, CV_8UC1); m1.at<uint8_t>(0, 0) = 121; m2.at<uint8_t>(0, 0) = 105; m3.at<uint8_t>(0, 0) = 82; cv::Mat x = m1 * 0.5 + m2 * 0.25 + m3 * 0.25; printf("%d \n", x.at<uint8_t>(0, 0)); uint8_t r = 121 * 0.5 + 105 * 0.25 + 82 * 0.25; printf("%d \n\n", r); return 0; } Output: 108 107 Do you know why this append and how to correct this behavior ? Thank you,
The strange behavior is a result of cv::MatExpr and Lasy evaluation usage as described here. The actual result equals: round(round(121*0.5 + 105*0.25) + 82*0.25) = 108 The rounding is used because the element type is UINT8 (integer type). The computation order is a result of the "Lasy evaluation" strategy. Following the computation process using the debugger is challenging, because OpenCV implementation includes operator overloading, templates, macros and pointer to functions... The actual computation is performed in static void scalar_loop function in dst[x] = op::r(src1[x], src2[x], scalar); When for example: src1[x] = 121, src2[x] = 105 and scalar = 0.5. It executes an inline function: inline uchar c_add<uchar, float>(uchar a, uchar b, float alpha, float beta, float gamma) { return saturate_cast<uchar>(CV_8TO32F(a) * alpha + CV_8TO32F(b) * beta + gamma); } The actual rounding is in saturate_cast: template<> inline uchar saturate_cast<uchar>(float v) { int iv = cvRound(v); return saturate_cast<uchar>(iv); } cvRound uses an SIMD intrinsic return _mm_cvtss_si32(t) It's equivalent to: return (int)(value + (value >= 0 ? 0.5f : -0.5f)); The Lasy evaluation stages builds MatExpr with alpha and beta scalars. cv::Mat x = m1 * 0.5 + m2 * 0.25 + m3 * 0.25; //m1 = 121, m2 = 105, m3 = 82 The expression is built recursively (hard to follow). Following the "operator +" function (using the debugger): MatExpr operator + (const MatExpr& e1, const MatExpr& e2) { MatExpr en; e1.op->add(e1, e2, en); return en; } State 1: e1.a data = 121 (UINT8) e1.b (NULL) e1.alpha = 0.5 e1.beta = 0 e2.a data = 105 (UINT8) e1.b (NULL) e1.alpha = 0.25 e1.beta = 0 Result: en.a data = 121 (UINT8) en.b data = 105 (UINT8) en.alpha = 0.5 en.beta = 0.25 State 2: e1.a data = 121 (UINT8) e1.b data = 105 (UINT8) e1.alpha = 0.5 e1.beta = 0.25 e2.a data = 82 (UINT8) e1.b (NULL) e1.alpha = 0.25 e1.beta = 0 en.a data = 87 (UINT8) <--- 121*0.5 + 105*0.25 = 86.7500 rounded to 87 en.b data = 82 (UINT8) en.alpha = 1 en.beta = 0.25 Stage 3: (in MatExpr::operator Mat() const): m data = 108 (UINT8) <--- 87*1 + 82*0.25 = 87 + 20.5 = 107.5 rounded to 108 You may try to follow the computation process using the debugger. It requires building OpenCV from sources, in Debug configuration, and a lot of patient...
Algorithm for converting Serial Date (Excel) to Year-Month-Day in C++
This post here provides a very neat & pure C++ algorithm for converting a serial date (Excel) to its explicit year-month-day representation (and back). Let me paste a compressed version for convenience: void ExcelSerialDateToDMY(int nSerialDate, int& nDay, int& nMonth, int& nYear) { // Modified Julian to DMY calculation with an addition of 2415019 int l = nSerialDate + 68569 + 2415019; int n = int(( 4 * l ) / 146097); l = l - int(( 146097 * n + 3 ) / 4); int i = int(( 4000 * ( l + 1 ) ) / 1461001); l = l - int(( 1461 * i ) / 4) + 31; int j = int(( 80 * l ) / 2447); nDay = l - int(( 2447 * j ) / 80); l = int(j / 11); nMonth = j + 2 - ( 12 * l ); nYear = 100 * ( n - 49 ) + i + l; } int DMYToExcelSerialDate(int nDay, int nMonth, int nYear) { // DMY to Modified Julian calculated with an extra subtraction of 2415019. return int(( 1461 * ( nYear + 4800 + int(( nMonth - 14 ) / 12) ) ) / 4) + int(( 367 * ( nMonth - 2 - 12 * ( ( nMonth - 14 ) / 12 ) ) ) / 12) - int(( 3 * ( int(( nYear + 4900 + int(( nMonth - 14 ) / 12) ) / 100) ) ) / 4) + nDay - 2415019 - 32075; } For example 2019-06-22 <--> 43638 2000-01-28 <--> 36553 1989-09-21 <--> 32772 The above post is from 2002, so I am wondering whether there are alternative implementations, which are better. By "better" I mean e.g. faster, shorter or less obscure. Or even algorithms, which perhaps provide a certain amount of pre-calculations (e.g. record 1 Jan serial date for a desired range of years, say 1900 to 2200, and then perform a fast look up).
The algorithms you show are very good. On my platform (clang++ -O3) they produce object code with no branches (pipeline stallers) and no accesses to far away memory (cache misses). As a pair, there is a range of validity from -4800-03-01 to millions of years in the future (plenty of range). Throughout this range they model the Gregorian calendar. Here are some alternative algorithms that are very similar. One difference is that yours have an epoch of 1900-01-01 and the ones I'm presenting have an epoch of 1970-01-01. However it is very easy to adjust the epoch by the difference of these epochs (25569 days) as shown below: constexpr std::tuple<int, unsigned, unsigned> civil_from_days(int z) noexcept { static_assert(std::numeric_limits<unsigned>::digits >= 18, "This algorithm has not been ported to a 16 bit unsigned integer"); static_assert(std::numeric_limits<int>::digits >= 20, "This algorithm has not been ported to a 16 bit signed integer"); z += 719468 - 25569; const int era = (z >= 0 ? z : z - 146096) / 146097; const unsigned doe = static_cast<unsigned>(z - era * 146097); // [0, 146096] const unsigned yoe = (doe - doe/1460 + doe/36524 - doe/146096) / 365; // [0, 399] const int y = static_cast<int>(yoe) + era * 400; const unsigned doy = doe - (365*yoe + yoe/4 - yoe/100); // [0, 365] const unsigned mp = (5*doy + 2)/153; // [0, 11] const unsigned d = doy - (153*mp+2)/5 + 1; // [1, 31] const unsigned m = mp + (mp < 10 ? 3 : -9); // [1, 12] return std::tuple<int, unsigned, unsigned>(y + (m <= 2), m, d); } constexpr int days_from_civil(int y, unsigned m, unsigned d) noexcept { static_assert(std::numeric_limits<unsigned>::digits >= 18, "This algorithm has not been ported to a 16 bit unsigned integer"); static_assert(std::numeric_limits<int>::digits >= 20, "This algorithm has not been ported to a 16 bit signed integer"); y -= m <= 2; const int era = (y >= 0 ? y : y-399) / 400; const unsigned yoe = static_cast<unsigned>(y - era * 400); // [0, 399] const unsigned doy = (153*(m + (m > 2 ? -3 : 9)) + 2)/5 + d-1; // [0, 365] const unsigned doe = yoe * 365 + yoe/4 - yoe/100 + doy; // [0, 146096] return era * 146097 + static_cast<int>(doe) - (719468 - 25569); } These algorithms are valid for millions of years both forward and backwards (including prior to -4800-03-01). Though that extra range won't buy you much because the Gregorian calendar didn't even start until 1582-10-15. I compiled both pairs of algorithms on macOS using clang++ -O3 -S and the set I have produces slightly smaller object code (about 10%). Though they are all so small, branch-less and cache-miss-free, trying to verify that benefit by measuring performance would be a challenging exercise. I do not find the readability of either set superior to the other. However this pair of algorithms does come with an irritatingly exhaustive derivation for those who are curious how these algorithms work, and unit tests to ensure the algorithms are working over a range of +/-1 million years. One could gain a very slight bit of performance in the above algorithms by limiting the range of validity to [2000-03-01, 2400-02-29] by setting const int era = 5 in both algorithms. I have not performance tested this option. I would expect such a gain to be in the noise level. Or there might be some miniscule performance advantage by limiting the range from [0000-03-01, millions of years forward] by not accounting for negative values of era: In civil_from_days: const int era = z / 146097; In days_from_civil: const int era = y / 400;
How to convert triangular matrix indexes in to row, column coordinates?
I have these indexes: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,etc... Which are indexes of nodes in a matrix (including diagonal elements): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 etc... and I need to get i,j coordinates from these indexes: 1,1 2,1 2,2 3,1 3,2 3,3 4,1 4,2 4,3 4,4 5,1 5,2 5,3 5,4 5,5 6,1 6,2 6,3 6,4 6,5 6,6 etc... When I need to calculate coordinates I have only one index and cannot access others.
Not optimized at all : int j = idx; int i = 1; while(j > i) { j -= i++; } Optimized : int i = std::ceil(std::sqrt(2 * idx + 0.25) - 0.5); int j = idx - (i-1) * i / 2; And here is the demonstration: You're looking for i such that : sumRange(1, i-1) < idx && idx <= sumRange(1, i) when sumRange(min, max) sum integers between min and max, both inxluded. But since you know that : sumRange(1, i) = i * (i + 1) / 2 Then you have : idx <= i * (i+1) / 2 => 2 * idx <= i * (i+1) => 2 * idx <= i² + i + 1/4 - 1/4 => 2 * idx + 1/4 <= (i + 1/2)² => sqrt(2 * idx + 1/4) - 1/2 <= i
In my case (a CUDA kernel implemented in standard C), I use zero-based indexing (and I want to exclude the diagonal) so I needed to make a few adjustments: // idx is still one-based unsigned long int idx = blockIdx.x * blockDim.x + threadIdx.x + 1; // CUDA kernel launch parameters // but the coordinates are now zero-based unsigned long int x = ceil(sqrt((2.0 * idx) + 0.25) - 0.5); unsigned long int y = idx - (x - 1) * x / 2 - 1; Which results in: [0]: (1, 0) [1]: (2, 0) [2]: (2, 1) [3]: (3, 0) [4]: (3, 1) [5]: (3, 2) I also re-derived the formula of Flórez-Rueda y Moreno 2001 and arrived at: unsigned long int x = floor(sqrt(2.0 * pos + 0.25) + 0.5); CUDA Note: I tried everything I could think of to avoid using double-precision math, but the single-precision sqrt function in CUDA is simply not precise enough to convert positions greater than 121 million or so to x, y coordinates (when using 1,024 threads per block and indexing only along 1 block dimension). Some articles have employed a "correction" to bump the result in a particular direction, but this inevitably falls apart at a certain point.
2d rotation opengl
Here is the code I am using. #define ANGLETORADIANS 0.017453292519943295769236907684886f // PI / 180 #define RADIANSTOANGLE 57.295779513082320876798154814105f // 180 / PI rotation = rotation *ANGLETORADIANS; cosRotation = cos(rotation); sinRotation = sin(rotation); for(int i = 0; i < 3; i++) { px[i] = (vec[i].x + centerX) * (cosRotation - (vec[i].y + centerY)) * sinRotation; py[i] = (vec[i].x + centerX) * (sinRotation + (vec[i].y + centerY)) * cosRotation; printf("num: %i, px: %f, py: %f\n", i, px[i], py[i]); } so far it seams my Y value is being fliped.. say I enter the value of X = 1 and Y = 1 with a 45 rotation you should see about x = 0 and y = 1.25 ish but I get x = 0 y = -1.25. Also my 90 degree rotation always return x = 0 and y = 0. p.s I know I'm only centering my values and not putting them back where they came from. It's not needed to put them back as all I need to know is the value I'm getting now.
Your bracket placement doesn't look right to me. I would expect: px[i] = (vec[i].x + centerX) * cosRotation - (vec[i].y + centerY) * sinRotation; py[i] = (vec[i].x + centerX) * sinRotation + (vec[i].y + centerY) * cosRotation;
Your brackets are wrong. It should be px[i] = ((vec[i].x + centerX) * cosRotation) - ((vec[i].y + centerY) * sinRotation); py[i] = ((vec[i].x + centerX) * sinRotation) + ((vec[i].y + centerY) * cosRotation); instead
Wrong pixel locations with glDrawPixels
I have been playing around with trying to draw a 320 by 240 full screen resolution image in opengl using java and lwjgl. I set the resolution to 640 by 480 and doubled the size of the pixels to fill in the space. After a lot of google searching I found some information about using the glDrawPixels function to speed up drawing to the screen. I wanted to test it by assigning random colors to all the pixels on the screen, but it wouldn't fill the screen. I divided the width into 4 sections of 80 pixels each and colored them red, green, blue, and white. I saw that I was interleaving the colors but I can't figure out how. Here is an image of the output: Here is where I run the openGL code: // init OpenGL GL11.glMatrixMode(GL11.GL_PROJECTION); GL11.glLoadIdentity(); GL11.glOrtho(0, 640, 0, 480, 1, -1); GL11.glMatrixMode(GL11.GL_MODELVIEW); while (!Display.isCloseRequested()) { pollInput(); // Clear the screen and depth buffer GL11.glClear(GL11.GL_COLOR_BUFFER_BIT | GL11.GL_DEPTH_BUFFER_BIT); randomizePixels(); GL11.glRasterPos2i(0, 0); GL11.glDrawPixels(320, 240,GL11.GL_RGBA, GL11.GL_UNSIGNED_BYTE,buff); GL11.glPixelZoom(2, 2); Display.update(); } Display.destroy(); } and here is where I create the pixel color data: public void randomizePixels(){ for(int y = 0; y < 240; y++){ for(int x = 0; x < 320; x+=4){ /* pixels[x * 320 + y] = (byte)(-128 + ran.nextInt(256)); pixels[x * 320 + y + 1] = (byte)(-128 + ran.nextInt(256)); pixels[x * 320 + y + 2] = (byte)(-128 + ran.nextInt(256)); pixels[x * 320 + y + 3] = (byte)(-128 + ran.nextInt(256)); */ if(x >= 0 && x < 80){ pixels[y * 240 + x] = (byte)128; pixels[y * 240 + x + 1] = (byte)0; pixels[y * 240 + x + 2] = (byte)0; pixels[y * 240 + x + 3] = (byte)128; }else if(x >= 80 && x < 160){ pixels[y * 240 + x] = (byte)0; pixels[y * 240 + x + 1] = (byte)128; pixels[y * 240 + x + 2] = (byte)0; pixels[y * 240 + x + 3] = (byte)128; }else if(x >= 160 && x < 240){ pixels[y * 240 + x] = (byte)0; pixels[y * 240 + x + 1] = (byte)0; pixels[y * 240 + x + 2] = (byte)128; pixels[y * 240 + x + 3] = (byte)128; }else if(x >= 240 && x < 320){ pixels[y * 240 + x] = (byte)128; pixels[y * 240 + x + 1] = (byte)128; pixels[y * 240 + x + 2] = (byte)128; pixels[y * 240 + x + 3] = (byte)128; } } } buff.put(pixels).flip(); } If you can figure out why I can't get the pixels to line up to the x and y coordinates I want them to go to that would be great. I have read that glDrawPixels probably isn't the best or fastest way to draw pixels to the screen, but I want to understand why I'm having this particular issue before I have to move on to some other method.
Just load your image (unscaled) into a texture and draw a textured quad. Don't use glDrawPixels. This function was never properly optimized in most drivers and has was deprecated since OpenGL-2 and got removed from OpenGL-3 core and later.
I spot 2 issues in your randomizePixels(). 1. Indexing Pixel Buffer The total size of pixel buffer is 320x240x4 bytes because the pixel type is GL_RGBA. So, indexing each pixel with subscript operator, [], it would be; for(int y = 0; y < 240; y++) { for(int x = 0; x < 320; x++) { pixels[y * 320 * 4 + x * 4 + 0] = ... // R pixels[y * 320 * 4 + x * 4 + 1] = ... // G pixels[y * 320 * 4 + x * 4 + 2] = ... // B pixels[y * 320 * 4 + x * 4 + 3] = ... // A } } 2. Colour Value The max intensity of 8bit colour is 255, for example, an opaque red pixel would be (255, 0, 0, 255).
your operating on the texture. better do it on quadrature. it would yield good results