I don't get the error in my code. I try to compare a buffer of unsigned char values to a constant. Then I want to store 1 or 0 depending on the comparison. Here is my code (in a structure):
void operator()(const uint8* src, int32 swidth, int32 sheight, uint8* dst, uint8 value) {
uint8 t[16];
__m128i v_one = _mm_set1_epi8((uint8)1);
__m128i v_value = _mm_set1_epi8(value);
printf("value = %d\n", value);
SHOW(t, v_one);
SHOW(t, v_value);
std::cout << "****" << std::endl;
for (int32 i = 0; i < sheight; ++i) {
const uint8* sdata = src + i * swidth;
uint8* ddata = dst + i * swidth;
int32 j = 0;
for ( ; j <= swidth - 16; j += 16) {
__m128i s = _mm_load_si128((const __m128i*)(sdata + j));
__m128i mask = _mm_cmpgt_epi8(s, v_value);
SHOW(t, s);
SHOW(t, mask);
std::cout << std::endl;
}
}
}
My first line are what I would expect:
value = 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
But then my comparison are wrong:
214 100 199 203 232 50 85 195 70 141 121 160 93 130 242 233
0 0 0 0 0 0 0 0 0 0 255 0 0 0 0 0
And I really don't get where the mistakes are.
The SHOW macro is:
#define SHOW(t, r) \
_mm_storeu_si128((__m128i*)t, r); \
printf("%3d", (int32)t[0]); \
for (int32 k = 1; k < 16; ++k) \
printf(" %3d", (int32)t[k]); \
printf("\n")
You are comparing the elements in your s array with your value array.
All the values in the value array are 100.
You have a mix of values in your s array.
However, _mm_cmpgt_epi8 works on signed values and as these are bytes it considers values from -128 to +127.
So the only possible values that are > 100 are values in the range 101 to 127.
As you've only got 1 value in that range (121) thats the only one which has its mask set.
To see this, change uint8 t[16]; to int8 t[16]; and you should get a more expected result.
After going over this tutorial
http://tommd.github.io/
which uses the HElib library:
https://github.com/shaih/HElib
I get the following output:
The output is getting corrupted. Given that the example has Level 16, there should be plenty of room to perform these operations.
Is there a problem with the parameters ?
Code:
#include "FHE.h"
#include "EncryptedArray.h"
#include <NTL/lzz_pXFactoring.h>
#include <fstream>
#include <sstream>
#include <sys/time.h>
using namespace std;
/**
*
*/
int main(int argc, char** argv) {
/* On our trusted system we generate a new key
* (or read one in) and encrypt the secret data set.
*/
long m=0, p=2, r=1; // Native plaintext space
// Computations will be 'modulo p'
long L=16; // Levels
long c=3; // Columns in key switching matrix
long w=64; // Hamming weight of secret key
long d=0;
long security = 128;
ZZX G;
m = FindM(security,L,c,p, d, 0, 0);
FHEcontext context(m, p, r);
// initialize context
buildModChain(context, L, c);
// modify the context, adding primes to the modulus chain
FHESecKey secretKey(context);
// construct a secret key structure
const FHEPubKey& publicKey = secretKey;
// an "upcast": FHESecKey is a subclass of FHEPubKey
//if(0 == d)
G = context.alMod.getFactorsOverZZ()[0];
secretKey.GenSecKey(w);
// actually generate a secret key with Hamming weight w
addSome1DMatrices(secretKey);
cout << "Generated key" << endl;
EncryptedArray ea(context, G);
// constuct an Encrypted array object ea that is
// associated with the given context and the polynomial G
long nslots = ea.size();
vector<long> v1;
for(int i = 0 ; i < nslots; i++) {
v1.push_back(i*2);
}
Ctxt ct1(publicKey);
ea.encrypt(ct1, publicKey, v1);
vector<long> v2;
Ctxt ct2(publicKey);
for(int i = 0 ; i < nslots; i++) {
v2.push_back(i*3);
}
ea.encrypt(ct2, publicKey, v2);
// On the public (untrusted) system we
// can now perform our computation
Ctxt ctSum = ct1;
Ctxt ctProd = ct1;
ctSum += ct2;
ctProd *= ct2;
vector<long> res;
ea.decrypt(ctSum, secretKey, res);
cout << "All computations are modulo " << p << "." << endl;
for(int i = 0; i < res.size(); i ++) {
cout << v1[i] << " + " << v2[i] << " = " << res[i] << endl;
}
ea.decrypt(ctProd, secretKey, res);
for(int i = 0; i < res.size(); i ++) {
cout << v1[i] << " * " << v2[i] << " = " << res[i] << endl;
}
return 0;
}
Generated key
All computations are modulo 2.
0 + 0 = 0
2 + 3 = 1
4 + 6 = 0
6 + 9 = 1
8 + 12 = 0
10 + 15 = 1
12 + 18 = 0
14 + 21 = 1
16 + 24 = 0
18 + 27 = 1
20 + 30 = 0
22 + 33 = 1
24 + 36 = 0
26 + 39 = 1
28 + 42 = 0
30 + 45 = 1
32 + 48 = 0
34 + 51 = 1
36 + 54 = 0
38 + 57 = 1
40 + 60 = 0
42 + 63 = 1
44 + 66 = 0
46 + 69 = 1
48 + 72 = 0
50 + 75 = 1
52 + 78 = 0
54 + 81 = 1
56 + 84 = 0
58 + 87 = 1
60 + 90 = 0
... Some sum output omitted
0 * 0 = 0
2 * 3 = 0
4 * 6 = 0
6 * 9 = 0
8 * 12 = 0
10 * 15 = 0
12 * 18 = 0
14 * 21 = 0
16 * 24 = 0
18 * 27 = 0
20 * 30 = 0
22 * 33 = 0
24 * 36 = 0
26 * 39 = 0
28 * 42 = 0
30 * 45 = 0
32 * 48 = 0
34 * 51 = 0
36 * 54 = 0
38 * 57 = 0
40 * 60 = 0
42 * 63 = 0
44 * 66 = 0
46 * 69 = 0
48 * 72 = 0
50 * 75 = 0
52 * 78 = 0
54 * 81 = 0
56 * 84 = 0
58 * 87 = 0
60 * 90 = 0
62 * 93 = 0
64 * 96 = 0
66 * 99 = 0
68 * 102 = 0
70 * 105 = 0
72 * 108 = 0
74 * 111 = 0
76 * 114 = 0
78 * 117 = 0
80 * 120 = 0
82 * 123 = 0
84 * 126 = 0
86 * 129 = 0
....
Ah, so this is a misunderstanding of the operations being performed. Notice the constant p=2. I have the text All computations are modulo 2.. Perhaps also stating All inputs are modulo 2 would help hammer the point home. Lets look at some of our computations:
0 + 0 mod 2 = 0
2 + 3 mod 2 = 1
4 + 6 mod 2 = 0
6 + 9 mod 2 = 1
All looks good - addition ring 2 is just exclusive OR. How about multiplication? In ring 2 (binary) that's just AND:
0 * 0 = 0
2 * 3 = 6 mod 2 = 0
4 * 6 = 24 mod 2 = 0
6 * 9 = 54 mod 2 = 0
So that all checks out as well. Finally, look back at the blog and see that I called this out again and give you a way to operate on something you might consider more pleasing:
In this case, I am building for GF(2) - so my homormorphic addition
is XOR and multiplication is AND. Changing this is as easy as changing
the value of p. Folks wanting to see 2+2=4 should set p to something
that matches their desired domain, such as 257 to obtain 8 bit Ints.
However, HELib has regressed in this aspect - setting p equal to anything larger than 2 did not work last time I tried it. Shai confirmed this is a known regression.
all
I have two input files like this :
file1 :
#W #S #this line dosen't exit
110 170 Bias
110 200 Bias
110 215 Bias
110 320 Bias
125 170 Bias
125 200 Bias
125 215 Bias
125 320 Bias
135 170 Bias
135 200 Bias
135 215 Bias
135 320 Bias
140 170 Bias
140 200 Bias
140 215 Bias
140 320 Bias
file2 :
FUNCTION BIAS ( W, S )
Bias = 0
IF AND ( W >= 0, W < 120 ) THEN
IF ( S >= 0 ) THEN Bias = -1
IF ( S >= 180 ) THEN Bias = -2
IF ( S >= 190 ) THEN Bias = -3
IF ( S >= 200 ) THEN Bias = -4
IF ( S >= 210 ) THEN Bias = -5
IF ( S >= 220 ) THEN Bias = -6
IF ( S >= 240 ) THEN Bias = -7
ENDIF
IF AND ( W >= 120, W < 130 ) THEN
IF ( S >= 0 ) THEN Bias = -11
IF ( S >= 180 ) THEN Bias = -12
IF ( S >= 190 ) THEN Bias = -13
IF ( S >= 200 ) THEN Bias = -14
IF ( S >= 210 ) THEN Bias = -15
IF ( S >= 220 ) THEN Bias = -16
IF ( S >= 240 ) THEN Bias = -17
ENDIF
IF AND ( W >= 130, W < 140 ) THEN
IF ( S >= 0 ) THEN Bias = 1
IF ( S >= 180 ) THEN Bias = 2
IF ( S >= 190 ) THEN Bias = 3
IF ( S >= 200 ) THEN Bias = 4
IF ( S >= 210 ) THEN Bias = 5
IF ( S >= 220 ) THEN Bias = 6
IF ( S >= 240 ) THEN Bias = 7
ENDIF
IF ( W >= 140 ) THEN
IF ( S >= 0 ) THEN Bias = 11
IF ( S >= 180 ) THEN Bias = 12
IF ( S >= 190 ) THEN Bias = 13
IF ( S >= 200 ) THEN Bias = 14
IF ( S >= 210 ) THEN Bias = 15
IF ( S >= 220 ) THEN Bias = 16
IF ( S >= 240 ) THEN Bias = 17
ENDIF
RETURN (Bias)
What I wanna do is to find out the corresponding value of a math function : "BIAS(W,S)" with the input (W,S) pair from file1
for example : W/S=135/195, "W" satisfy
IF AND ( W >= 130, W < 140 )
so we will go to check "S"
IF ( S >= 0 ) THEN Bias = 1
IF ( S >= 180 ) THEN Bias = 2
IF ( S >= 190 ) THEN Bias = 3
IF ( S >= 200 ) THEN Bias = 4
IF ( S >= 210 ) THEN Bias = 5
IF ( S >= 220 ) THEN Bias = 6
IF ( S >= 240 ) THEN Bias = 7
then finally we can find out S=195 is in between 190 and 200, the value of BIAS(W,S) is 3
what I want for the output is like this :
110 170 Bias -1
110 200 Bias -4
110 215 Bias -5
110 320 Bias -7
125 170 Bias -11
125 200 Bias -14
125 215 Bias -15
125 320 Bias -17
135 170 Bias 1
135 200 Bias 4
135 215 Bias 5
135 320 Bias 7
140 170 Bias 11
140 200 Bias 14
140 215 Bias 15
140 320 Bias 17
It's very easy to check by human eyes
but as you can see, file2 is basically a text file instead of a regular 2D-array numerical file, How can I extract the corresponding value? Any hint?
I just translated your logic into awk:
script.awk:
{
w=$1;
s=$2;
if (w >= 0 && w < 120) {
if ( s >= 0) { bias= -1 }
if ( s >= 180 ) { bias= -2 }
if ( s >= 190 ) { bias= -3 }
if ( s >= 200 ) { bias= -4 }
if ( s >= 210 ) { bias= -5 }
if ( s >= 220 ) { bias= -6 }
if ( s >= 240 ) { bias= -7 }
}
if (w >= 120 && w < 130) {
if ( s >= 0) { bias= -11 }
if ( s >= 180 ) { bias= -12 }
if ( s >= 190 ) { bias= -13 }
if ( s >= 200 ) { bias= -14 }
if ( s >= 210 ) { bias= -15 }
if ( s >= 220 ) { bias= -16 }
if ( s >= 240 ) { bias= -17 }
}
if (w >= 130 && w < 140) {
if ( s >= 0) { bias= 1 }
if ( s >= 180 ) { bias= 2 }
if ( s >= 190 ) { bias= 3 }
if ( s >= 200 ) { bias= 4 }
if ( s >= 210 ) { bias= 5 }
if ( s >= 220 ) { bias= 6 }
if ( s >= 240 ) { bias= 7 }
}
if (w >= 140 ) {
if ( s >= 0) { bias= 11 }
if ( s >= 180 ) { bias= 12 }
if ( s >= 190 ) { bias= 13 }
if ( s >= 200 ) { bias= 14 }
if ( s >= 210 ) { bias= 15 }
if ( s >= 220 ) { bias= 16 }
if ( s >= 240 ) { bias= 17 }
}
print $0" "bias;
}
Execution:
awk -f script.awk file1
110 170 Bias -1
110 200 Bias -4
110 215 Bias -5
110 320 Bias -7
125 170 Bias -11
125 200 Bias -14
125 215 Bias -15
125 320 Bias -17
135 170 Bias 1
135 200 Bias 4
135 215 Bias 5
135 320 Bias 7
140 170 Bias 11
140 200 Bias 14
140 215 Bias 15
140 320 Bias 17
Run the tst.awk script below on "file2" to convert the script in whatever language that is to awk and save it's output to a new file named "getbias.awk", then run:
awk -f getbias.awk '<your script>' file1
where <your script> parses file1 and calls the generated getbias() function below to get the bias values for each line.
$ cat tst.awk
{
sub(/BIAS/,"getbias")
sub(/ENDIF/,"}")
sub(/ THEN/,"")
$0 = tolower($0)
}
/^function/ { sub(/\)/,",\tbias )"); $0 = $0 " {" }
/^return/ { $0 = $0 ORS "}" }
/^if/ { sub(/ and/,""); sub(/,/," \\&\\&"); $0 = $0 " {" }
{ print }
.
$ awk -f tst.awk file2
function getbias ( w, s , bias ) {
bias = 0
if ( w >= 0 && w < 120 ) {
if ( s >= 0 ) bias = -1
if ( s >= 180 ) bias = -2
if ( s >= 190 ) bias = -3
if ( s >= 200 ) bias = -4
if ( s >= 210 ) bias = -5
if ( s >= 220 ) bias = -6
if ( s >= 240 ) bias = -7
}
if ( w >= 120 && w < 130 ) {
if ( s >= 0 ) bias = -11
if ( s >= 180 ) bias = -12
if ( s >= 190 ) bias = -13
if ( s >= 200 ) bias = -14
if ( s >= 210 ) bias = -15
if ( s >= 220 ) bias = -16
if ( s >= 240 ) bias = -17
}
if ( w >= 130 && w < 140 ) {
if ( s >= 0 ) bias = 1
if ( s >= 180 ) bias = 2
if ( s >= 190 ) bias = 3
if ( s >= 200 ) bias = 4
if ( s >= 210 ) bias = 5
if ( s >= 220 ) bias = 6
if ( s >= 240 ) bias = 7
}
if ( w >= 140 ) {
if ( s >= 0 ) bias = 11
if ( s >= 180 ) bias = 12
if ( s >= 190 ) bias = 13
if ( s >= 200 ) bias = 14
if ( s >= 210 ) bias = 15
if ( s >= 220 ) bias = 16
if ( s >= 240 ) bias = 17
}
return (bias)
}
I want to emulate the behavior of CUDA bilinear interpolation on CPU, but I found that the return value of tex2D seems not fit to the bilinear formula.
I guess that casting the interpolation coefficients from float to 9-bit fixed point format with 8 bits of fractional value[1] results in different values.
According to the conversion fomula [2, line 106], the result of the conversion will be the same as the input float when the coeffient is 1/2^n, with n=0,1,..., 8, but I still (not always) receive weird values.
Below I report an example of weird values. In this case, weird values always happen when id = 2*n+1, could anyone tell me why?
Src Array:
Src[0][0] = 38;
Src[1][0] = 39;
Src[0][1] = 118;
Src[1][1] = 13;
Texture Definition:
static texture<float4, 2, cudaReadModeElementType> texElnt;
texElnt.addressMode[0] = cudaAddressModeClamp;
texElnt.addressMode[1] = cudaAddressModeClamp;
texElnt.filterMode = cudaFilterModeLinear;
texElnt.normalized = false;
Kernel Function:
static __global__ void kernel_texElnt(float* pdata, int w, int h, int c, float stride/*0.03125f*/) {
const int gx = blockIdx.x*blockDim.x + threadIdx.x;
const int gy = blockIdx.y*blockDim.y + threadIdx.y;
const int gw = gridDim.x * blockDim.x;
const int gid = gy*gw + gx;
if (gx >= w || gy >= h) {
return;
}
float2 pnt;
pnt.x = (gx)*(stride)/*1/32*/;
pnt.y = 0.0625f/*1/16*/;
float4 result = tex2D( texElnt, pnt.x + 0.5, pnt.y + 0.5f);
pdata[gid*3 + 0] = pnt.x;
pdata[gid*3 + 1] = pnt.y;
pdata[gid*3 + 2] = result.x;
}
Bilinear Result of CUDA
id pnt.x pnt.y tex2D
0 0.00000 0.0625 43.0000000
1 0.03125 0.0625 42.6171875
2 0.06250 0.0625 42.6484375
3 0.09375 0.0625 42.2656250
4 0.12500 0.0625 42.2968750
5 0.15625 0.0625 41.9140625
6 0.18750 0.0625 41.9453125
7 0.21875 0.0625 41.5625000
8 0.25000 0.0625 41.5937500
9 0.28125 0.0625 41.2109375
0 0.31250 0.0625 41.2421875
10 0.34375 0.0625 40.8593750
11 0.37500 0.0625 40.8906250
12 0.40625 0.0625 40.5078125
13 0.43750 0.0625 40.5390625
14 0.46875 0.0625 40.1562500
15 0.50000 0.0625 40.1875000
16 0.53125 0.0625 39.8046875
17 0.56250 0.0625 39.8359375
18 0.59375 0.0625 39.4531250
19 0.62500 0.0625 39.4843750
20 0.65625 0.0625 39.1015625
21 0.68750 0.0625 39.1328125
22 0.71875 0.0625 38.7500000
23 0.75000 0.0625 38.7812500
24 0.78125 0.0625 38.3984375
25 0.81250 0.0625 38.4296875
26 0.84375 0.0625 38.0468750
27 0.87500 0.0625 38.0781250
28 0.90625 0.0625 37.6953125
29 0.93750 0.0625 37.7265625
30 0.96875 0.0625 37.3437500
31 1.00000 0.0625 37.3750000
CPU Result:
// convert coefficient ((1-α)*(1-β)), (α*(1-β)), ((1-α)*β), (α*β) to fixed point format
id pnt.x pnt.y tex2D
0 0.00000 0.0625 43.00000000
1 0.03125 0.0625 43.23046875
2 0.06250 0.0625 42.64843750
3 0.09375 0.0625 42.87890625
4 0.12500 0.0625 42.29687500
5 0.15625 0.0625 42.52734375
6 0.18750 0.0625 41.94531250
7 0.21875 0.0625 42.17578125
8 0.25000 0.0625 41.59375000
9 0.28125 0.0625 41.82421875
0 0.31250 0.0625 41.24218750
10 0.34375 0.0625 41.47265625
11 0.37500 0.0625 40.89062500
12 0.40625 0.0625 41.12109375
13 0.43750 0.0625 40.53906250
14 0.46875 0.0625 40.76953125
15 0.50000 0.0625 40.18750000
16 0.53125 0.0625 40.41796875
17 0.56250 0.0625 39.83593750
18 0.59375 0.0625 40.06640625
19 0.62500 0.0625 39.48437500
20 0.65625 0.0625 39.71484375
21 0.68750 0.0625 39.13281250
22 0.71875 0.0625 39.36328125
23 0.75000 0.0625 38.78125000
24 0.78125 0.0625 39.01171875
25 0.81250 0.0625 38.42968750
26 0.84375 0.0625 38.66015625
27 0.87500 0.0625 38.07812500
28 0.90625 0.0625 38.30859375
29 0.93750 0.0625 37.72656250
30 0.96875 0.0625 37.95703125
31 1.00000 0.0625 37.37500000
I leave a simple code on my github [3], after running the program you will got two files in D:\.
Edit 2014/01/20
I run the program with different increments and found the specification of tex2D "when alpha multiplied beta is less than 0.00390625, the return of tex2D does not match the bilinear interpolation formula"
Already satisfactory answers have been provided to this question, so now I just want to give a compendium of hopefully useful information on bilinear interpolation, how can it be implemented in C++ and the different ways it can be done in CUDA.
Maths behind bilinear interpolation
Assume that the original function T(x, y) is sampled at the Cartesian regular grid of points (i, j) with 0 <= i < M1, 0 <= j < M2 and i and j integers. For each value of y, one can first use 0 <= a < 1 to represent an arbitrary point i + a comprised between i and i + 1. Then, a linear interpolation along the y = j axis (which is parallel to the x axis) at that point can be performed obtaining
where r(x,y) is the function interpolating the samples of T(x,y). The same can be done for the line y = j + 1, obtaining
Now, for each i + a, an interpolation along the y axis can be performed on the samples r(i+a,j) and r(i+a,j+1). Accordingly, if one uses 0 <= b < 1 to represent an arbitrary point j + b located between j and j + 1, then a linear interpolation along the x = i + a axis (which is parallel to the y axis) can be worked out, so getting the final result
Note that the relations between i, j, a, b, x and y are the following
C/C++ implementation
Let me stress that this implementation, as well as the following CUDA ones, assume, as done at the beginning, that the samples of T are located on the Cartesian regular grid of points (i, j) with 0 <= i < M1, 0 <= j < M2 and i and j integers (unit spacing). Also, the routine is provided in single precision, complex (float2) arithmetics, but it can be easily cast in other arithmetics of interest.
void bilinear_interpolation_function_CPU(float2 * __restrict__ h_result, float2 * __restrict__ h_data,
float * __restrict__ h_xout, float * __restrict__ h_yout,
const int M1, const int M2, const int N1, const int N2){
float2 result_temp1, result_temp2;
for(int k=0; k<N2; k++){
for(int l=0; l<N1; l++){
const int ind_x = floor(h_xout[k*N1+l]);
const float a = h_xout[k*N1+l]-ind_x;
const int ind_y = floor(h_yout[k*N1+l]);
const float b = h_yout[k*N1+l]-ind_y;
float2 h00, h01, h10, h11;
if (((ind_x) < M1)&&((ind_y) < M2)) h00 = h_data[ind_y*M1+ind_x]; else h00 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y) < M2)) h10 = h_data[ind_y*M1+ind_x+1]; else h10 = make_float2(0.f, 0.f);
if (((ind_x) < M1)&&((ind_y+1) < M2)) h01 = h_data[(ind_y+1)*M1+ind_x]; else h01 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y+1) < M2)) h11 = h_data[(ind_y+1)*M1+ind_x+1]; else h11 = make_float2(0.f, 0.f);
result_temp1.x = a * h10.x + (-h00.x * a + h00.x);
result_temp1.y = a * h10.y + (-h00.y * a + h00.y);
result_temp2.x = a * h11.x + (-h01.x * a + h01.x);
result_temp2.y = a * h11.y + (-h01.y * a + h01.y);
h_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
h_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);
}
}
}
The if/else statements within the above code are simply boundary checkings. If the sample falls outside the [0, M1-1] x [0, M2-1], then it is set to 0.
Standard CUDA implementation
This is a "standard" CUDA implementation tracing the above CPU one. No usage of texture memory.
__global__ void bilinear_interpolation_kernel_GPU(float2 * __restrict__ d_result, const float2 * __restrict__ d_data,
const float * __restrict__ d_xout, const float * __restrict__ d_yout,
const int M1, const int M2, const int N1, const int N2)
{
const int l = threadIdx.x + blockDim.x * blockIdx.x;
const int k = threadIdx.y + blockDim.y * blockIdx.y;
if ((l<N1)&&(k<N2)) {
float2 result_temp1, result_temp2;
const int ind_x = floor(d_xout[k*N1+l]);
const float a = d_xout[k*N1+l]-ind_x;
const int ind_y = floor(d_yout[k*N1+l]);
const float b = d_yout[k*N1+l]-ind_y;
float2 d00, d01, d10, d11;
if (((ind_x) < M1)&&((ind_y) < M2)) d00 = d_data[ind_y*M1+ind_x]; else d00 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y) < M2)) d10 = d_data[ind_y*M1+ind_x+1]; else d10 = make_float2(0.f, 0.f);
if (((ind_x) < M1)&&((ind_y+1) < M2)) d01 = d_data[(ind_y+1)*M1+ind_x]; else d01 = make_float2(0.f, 0.f);
if (((ind_x+1) < M1)&&((ind_y+1) < M2)) d11 = d_data[(ind_y+1)*M1+ind_x+1]; else d11 = make_float2(0.f, 0.f);
result_temp1.x = a * d10.x + (-d00.x * a + d00.x);
result_temp1.y = a * d10.y + (-d00.y * a + d00.y);
result_temp2.x = a * d11.x + (-d01.x * a + d01.x);
result_temp2.y = a * d11.y + (-d01.y * a + d01.y);
d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);
}
}
CUDA implementation with texture fetch
This is the same implementation as above, but the global memory is now accessed by the texture cache. For example, T[i,j] is accessed as
tex2D(d_texture_fetch_float,ind_x,ind_y);
(where, of course ind_x = i and ind_y = j, and d_texture_fetch_float is assumed to be a global scope variable) instead of
d_data[ind_y*M1+ind_x];
Note that the hard-wired texture filtering capabilities are not exploited here. The routine below has the same precision as the above one and could result somewhat faster than that on old CUDA architectures.
__global__ void bilinear_interpolation_kernel_GPU_texture_fetch(float2 * __restrict__ d_result,
const float * __restrict__ d_xout, const float * __restrict__ d_yout,
const int M1, const int M2, const int N1, const int N2)
{
const int l = threadIdx.x + blockDim.x * blockIdx.x;
const int k = threadIdx.y + blockDim.y * blockIdx.y;
if ((l<N1)&&(k<N2)) {
float2 result_temp1, result_temp2;
const int ind_x = floor(d_xout[k*N1+l]);
const float a = d_xout[k*N1+l]-ind_x;
const int ind_y = floor(d_yout[k*N1+l]);
const float b = d_yout[k*N1+l]-ind_y;
const float2 d00 = tex2D(d_texture_fetch_float,ind_x,ind_y);
const float2 d10 = tex2D(d_texture_fetch_float,ind_x+1,ind_y);
const float2 d11 = tex2D(d_texture_fetch_float,ind_x+1,ind_y+1);
const float2 d01 = tex2D(d_texture_fetch_float,ind_x,ind_y+1);
result_temp1.x = a * d10.x + (-d00.x * a + d00.x);
result_temp1.y = a * d10.y + (-d00.y * a + d00.y);
result_temp2.x = a * d11.x + (-d01.x * a + d01.x);
result_temp2.y = a * d11.y + (-d01.y * a + d01.y);
d_result[k*N1+l].x = b * result_temp2.x + (-result_temp1.x * b + result_temp1.x);
d_result[k*N1+l].y = b * result_temp2.y + (-result_temp1.y * b + result_temp1.y);
}
}
Texture binding can be done according to
void TextureBindingBilinearFetch(const float2 * __restrict__ data, const int M1, const int M2)
{
size_t pitch;
float* data_d;
gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2));
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
gpuErrchk(cudaBindTexture2D(0,&d_texture_fetch_float,data_d,&desc,M1,M2,pitch));
d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp;
d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp;
gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice));
}
Note that now we need no if/else boundary checking, because the texture will automatically clamp to zero the samples falling outside the [0, M1-1] x [0, M2-1] sampling region, thanks to the instructions
d_texture_fetch_float.addressMode[0] = cudaAddressModeClamp;
d_texture_fetch_float.addressMode[1] = cudaAddressModeClamp;
CUDA implementation with texture interpolation
This is the last implementation and uses the hard-wired capabilities of texture filtering.
__global__ void bilinear_interpolation_kernel_GPU_texture_interp(float2 * __restrict__ d_result,
const float * __restrict__ d_xout, const float * __restrict__ d_yout,
const int M1, const int M2, const int N1, const int N2)
{
const int l = threadIdx.x + blockDim.x * blockIdx.x;
const int k = threadIdx.y + blockDim.y * blockIdx.y;
if ((l<N1)&&(k<N2)) { d_result[k*N1+l] = tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f); }
}
Note that the interpolation formula implemented by this feature is the same as derived above, but now
where x_B = x - 0.5 and y_B = y - 0.5. This explains the 0.5 offset in the instruction
tex2D(d_texture_interp_float, d_xout[k*N1+l] + 0.5f, d_yout[k*N1+l] + 0.5f)
In this case, texture binding should be done as follows
void TextureBindingBilinearInterp(const float2 * __restrict__ data, const int M1, const int M2)
{
size_t pitch;
float* data_d;
gpuErrchk(cudaMallocPitch((void**)&data_d,&pitch, M1 * sizeof(float2), M2));
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
gpuErrchk(cudaBindTexture2D(0,&d_texture_interp_float,data_d,&desc,M1,M2,pitch));
d_texture_interp_float.addressMode[0] = cudaAddressModeClamp;
d_texture_interp_float.addressMode[1] = cudaAddressModeClamp;
d_texture_interp_float.filterMode = cudaFilterModeLinear; // --- Enable linear filtering
d_texture_interp_float.normalized = false; // --- Texture coordinates will NOT be normalized
gpuErrchk(cudaMemcpy2D(data_d,pitch,data,sizeof(float2)*M1,sizeof(float2)*M1,M2,cudaMemcpyHostToDevice));
}
Note that, as already mentioned in the other answers, a and b are stored in 9-bit fixed point format with 8 bits of fractional value, so this approach will be very fast, but less accurate than those above.
The UV interpolants are truncated to 9 bits, not the participating texel values. In Chapter 10 (Texturing) of The CUDA Handbook, this is described in detail (including CPU emulation code) for the 1D case. Code is open source and may be found at https://github.com/ArchaeaSoftware/cudahandbook/blob/master/texturing/tex1d_9bit.cu
Wrong formula of bilinear interpolation makes result of texture fetching weird.
Formula - 1: you can find it in cuda appendix or wiki easily
tex(x,y)=(1−α)(1−β)T[i,j] + α(1−β)T[i+1,j] + (1−α)βT[i,j+1] + αβT[i+1,j+1]
Formula - 2: reduce times of multiply
tex(x,y)=T[i,j] + α(T[i+1,j]-T[i,j]) + β(T[i,j+1]-T[i,j]) + αβ(T[i,j]+T[i+1,j+1] - T[i+1, j]-T[i,j+1])
If you use 9-bit fixed point format to formula 1, you will get mismatch result of texture fetching, but formula 2 works fine.
Conclusion :
If you want to emulate the bilinear interpolation implemented by cuda texture, you should use formula 3. Try it!
Formula - 3:
tex(x,y)=T[i,j] + frac(α)(T[i+1,j]-T[i,j]) + frac(β)(T[i,j+1]-T[i,j]) + frac(αβ)(T[i,j]+T[i+1,j+1] - T[i+1, j]-T[i,j+1])
// frac(x) turns float to 9-bit fixed point format with 8 bits of fraction values.
float frac( float x ) {
float frac, tmp = x - (float)(int)(x);
float frac256 = (float)(int)( tmp*256.0f + 0.5f );
frac = frac256 / 256.0f;
return frac;
}
I have been playing around with trying to draw a 320 by 240 full screen resolution image in opengl using java and lwjgl. I set the resolution to 640 by 480 and doubled the size of the pixels to fill in the space. After a lot of google searching I found some information about using the glDrawPixels function to speed up drawing to the screen. I wanted to test it by assigning random colors to all the pixels on the screen, but it wouldn't fill the screen. I divided the width into 4 sections of 80 pixels each and colored them red, green, blue, and white. I saw that I was interleaving the colors but I can't figure out how.
Here is an image of the output:
Here is where I run the openGL code:
// init OpenGL
GL11.glMatrixMode(GL11.GL_PROJECTION);
GL11.glLoadIdentity();
GL11.glOrtho(0, 640, 0, 480, 1, -1);
GL11.glMatrixMode(GL11.GL_MODELVIEW);
while (!Display.isCloseRequested()) {
pollInput();
// Clear the screen and depth buffer
GL11.glClear(GL11.GL_COLOR_BUFFER_BIT | GL11.GL_DEPTH_BUFFER_BIT);
randomizePixels();
GL11.glRasterPos2i(0, 0);
GL11.glDrawPixels(320, 240,GL11.GL_RGBA, GL11.GL_UNSIGNED_BYTE,buff);
GL11.glPixelZoom(2, 2);
Display.update();
}
Display.destroy();
}
and here is where I create the pixel color data:
public void randomizePixels(){
for(int y = 0; y < 240; y++){
for(int x = 0; x < 320; x+=4){
/*
pixels[x * 320 + y] = (byte)(-128 + ran.nextInt(256));
pixels[x * 320 + y + 1] = (byte)(-128 + ran.nextInt(256));
pixels[x * 320 + y + 2] = (byte)(-128 + ran.nextInt(256));
pixels[x * 320 + y + 3] = (byte)(-128 + ran.nextInt(256));
*/
if(x >= 0 && x < 80){
pixels[y * 240 + x] = (byte)128;
pixels[y * 240 + x + 1] = (byte)0;
pixels[y * 240 + x + 2] = (byte)0;
pixels[y * 240 + x + 3] = (byte)128;
}else if(x >= 80 && x < 160){
pixels[y * 240 + x] = (byte)0;
pixels[y * 240 + x + 1] = (byte)128;
pixels[y * 240 + x + 2] = (byte)0;
pixels[y * 240 + x + 3] = (byte)128;
}else if(x >= 160 && x < 240){
pixels[y * 240 + x] = (byte)0;
pixels[y * 240 + x + 1] = (byte)0;
pixels[y * 240 + x + 2] = (byte)128;
pixels[y * 240 + x + 3] = (byte)128;
}else if(x >= 240 && x < 320){
pixels[y * 240 + x] = (byte)128;
pixels[y * 240 + x + 1] = (byte)128;
pixels[y * 240 + x + 2] = (byte)128;
pixels[y * 240 + x + 3] = (byte)128;
}
}
}
buff.put(pixels).flip();
}
If you can figure out why I can't get the pixels to line up to the x and y coordinates I want them to go to that would be great. I have read that glDrawPixels probably isn't the best or fastest way to draw pixels to the screen, but I want to understand why I'm having this particular issue before I have to move on to some other method.
Just load your image (unscaled) into a texture and draw a textured quad.
Don't use glDrawPixels. This function was never properly optimized in most drivers and has was deprecated since OpenGL-2 and got removed from OpenGL-3 core and later.
I spot 2 issues in your randomizePixels().
1. Indexing Pixel Buffer
The total size of pixel buffer is 320x240x4 bytes because the pixel type is GL_RGBA. So, indexing each pixel with subscript operator, [], it would be;
for(int y = 0; y < 240; y++)
{
for(int x = 0; x < 320; x++)
{
pixels[y * 320 * 4 + x * 4 + 0] = ... // R
pixels[y * 320 * 4 + x * 4 + 1] = ... // G
pixels[y * 320 * 4 + x * 4 + 2] = ... // B
pixels[y * 320 * 4 + x * 4 + 3] = ... // A
}
}
2. Colour Value
The max intensity of 8bit colour is 255, for example, an opaque red pixel would be (255, 0, 0, 255).
your operating on the texture. better do it on quadrature. it would yield good results