CUDA, NPP Filters - c++

The CUDA NPP library supports filtering of image using the nppiFilter_8u_C1R command but keep getting errors. I have no problem getting the boxFilterNPP sample code up and running.
eStatusNPP = nppiFilterBox_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(),
oDeviceDst.data(), oDeviceDst.pitch(),
oSizeROI, oMaskSize, oAnchor);
But if I change it to use nppiFilter_8u_C1R instead, eStatusNPP return the error -24 (NPP_TEXTURE_BIND_ERROR). The code below is the alterations I made to the original boxFilterNPP sample.
NppiSize oMaskSize = {5,5};
npp::ImageCPU_32s_C1 hostKernel(5,5);
for(int x = 0 ; x < 5; x++){
for(int y = 0 ; y < 5; y++){
hostKernel.pixels(x,y)[0].x = 1;
}
}
npp::ImageNPP_32s_C1 pKernel(hostKernel);
Npp32s nDivisor = 1;
eStatusNPP = nppiFilter_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(),
oDeviceDst.data(), oDeviceDst.pitch(),
oSizeROI,
pKernel.data(),
oMaskSize, oAnchor,
nDivisor);
This have been tried on CUDA 4.2 and 5.0, with same result.
The code runs with the expected result when oMaskSize = {1,1}

Filter applies the mask extending upward and to the left, following the mathematical convention that the convolution between two functions reverses the direction of the second function.
The box filter mask extends downwards and to the right, which is probably more intuitive.
In any case, the problem is caused by the fact that the input image in the changed code would have to be sampled at what would effectively be SOURCE[-4, -4) in order to compute DESTINATION[0, 0]. Since the input image is being accessed via a texture sampler, binding the source image pointer offset by (-4, -4) causes the texture-bind error you're seeing.
Workaround: The simplest workaround for this issue would be to set the anchor point to (4, 4), which would effectively move the mask down and to the right. You still need to be aware that you'd want to invert the weights in the kernel array (i.e. K[-4, -4] -> K[0, 0], K[0, 0] -> K[-4, -4], etc.).

I had the same problem when I stored my kernel as an ImageCPU/ImageNPP.
A good solution is to store the kernel as a traditional 1D array on the device. I tried this, and it gave me good results (and none of those unpredictable or garbage images).
Thanks to Frank Jargstorff in this StackOverflow post for the 1D idea.
NppiSize oMaskSize = {5,5};
Npp32s hostKernel[5*5];
for(int x = 0 ; x < 5; x++){
for(int y = 0 ; y < 5; y++){
hostKernel[x*5+y] = 1;
}
}
Npp32s* pKernel; //just a regular 1D array on the GPU
cudaMalloc((void**)&pKernel, 5 * 5 * sizeof(Npp32s));
cudaMemcpy(pKernel, hostKernel, 5 * 5 * sizeof(Npp32s), cudaMemcpyHostToDevice);
Using this original image, here's the blurred result that I get from your code with the 1D kernel array:
Other parameters that I used:
Npp32s nDivisor = 25;
NppiPoint oAnchor = {4, 4};

Thank you for the help.
Got over the error, but I'm seeing some odd behavior. The image changes depending on what program I run just before and the image do not show what i am going fore.
The example that I am trying to mimic is the nppiFilterBox_8u_C1R with the use of nppiFilter_8u_C1R where i set the kernel to ones and the nDivisor to the sum of the kernel.
This code is still a alteration on the boxFilterNPP sample code.
NppiSize oMaskSize = {5,5};
npp::ImageCPU_32s_C1 hostKernel(5,5);
for(int x = 0 ; x < 5; x++){
for(int y = 0 ; y < 5; y++){
hostKernel.pixels(x,y)[0].x = 1;
}
}
npp::ImageNPP_32s_C1 pKernel(hostKernel);
Npp32s nDivisor = 25;
NppiPoint oAnchor = {4, 4};
eStatusNPP = nppiFilter_8u_C1R(oDeviceSrc.data(),oDeviceSrc.pitch(),
oDeviceDst.data(), oDeviceDst.pitch(),
oSizeROI,
pKernel.data(),
oMaskSize, oAnchor,
nDivisor);
Since the kernel is only ones the need to invert the weights should not be a issue.
The 5 different kinds of image this code return are show below. Mostly the last one is returned.
http://1ordrup.dk/kasper/image/Lena_boxFilter1.jpg
http://1ordrup.dk/kasper/image/Lena_boxFilter2.jpg
http://1ordrup.dk/kasper/image/Lena_boxFilter3.jpg
http://1ordrup.dk/kasper/image/Lena_boxFilter4.jpg
http://1ordrup.dk/kasper/image/Lena_boxFilter5.jpg
I think the reason this happens is that the kernel is not initilised correctly or no used, thus data with pseudo-random content is used for the kernel.

Related

Modifying only the beginning of an image and not it fully, as I wish

I currently have some code that reads an image stored in the tga format, then do something with it and then store it in a new tga file.
The problem is that only the bottom one third is being modified, the other two thirds are equal to the original image. Here is the code:
int size = width*height*bpp;
char imageArray [size];
char * arrayPtr = &imageArray[0];
......
for (int x=0; x<width; x++) {
for (int y=0; y<height; y++) {
imageArray [x*height + 3*y] = 255;
imageArray [x*height + 3*y + 1] = 0;
imageArray [x*height + 3*y + 2] = 0;
}
}
fileWriter.write (arrayPtr, size);
As can be seen inside the loops, I am modifying each color value, in this case making it into a single color image. Unfortunately only the bottom third will be modified, even with the number of loop iterations being equal to the number of pixels, and doing three operations by iteration, the number of it is equal to the number of bytes of the original image.
So I have no idea of what I am doing wrong and would be thankful for any recommendations.
The whole offset has to be multiplied by bpp, not only y:
imageArray [bpp*(x*height + y)] = 255;
imageArray [bpp*(x*height + y) + 1] = 0;
....
I think I understand your problem now, but it relies on some assumptions about how you are bringing in your data and what bpp means.
You are trying to loop over every pixel here and update the 3 values.
You set size = width*height*bpp where I can only assume bpp means bits-per-pixel and is the 3 showing up in your loop. Try stepping through this with x=1 and y=0. If the data is being layed out contiguously like:
RGB # x=0,y=0; RGB # x=1,y=0; ... then you can see you end up writing over your data from the first iteration of the loop. Everytime you nest the loop, the index should get multiplied entirely by the next levels dimension. Just replace x*height + 3*y with (x*height + y)*bpp assuming bpp = 3.
It all depends on the order that bytes a stored in the image array.
Your formulation suggest a by-column/by-row/by-color. But it can also be by-row/by-column/by-color or even by-color/by-row/by-column.
The index formulation should be
x*(b*h)+y*b+c
y*(b*w)+x*b+c
c*(w*h)+y*h+x
(b, w ,and h are color bytes, width and height)
Note how indexes cumulate in the sums. You have at least forgotten one multiplication, assuming the order is correct.

Convolutional network filter always negative

I asked a question about a network which I've been building last week, and I iterated on the suggestions which lead me to finding a few problems. I've come back to this project and fixed up all the issues and learnt a lot more about CNNs in the process. Now I'm stuck on an issue were all of my weights move to massively negative values, which coupled with the RELU ends in the output image always being completely black (making it impossible for the classifier to do it's job).
On two labeled images:
These are passed into a two layer network, one classifier (which gets 100% on its own) and a one filter 3*3 convolutional layer.
On the first iteration the output from the conv layer looks like (images in same order as above):
The filter is 3*3*3, due to the images being RGB. The weights are all random numbers between 0.0f-1.0f. On the next iteration the images are completely black, printing the filters shows that they are now in range of -49678.5f (the highest I can see) and -61932.3f.
This issue in turn is due to the gradients being passed back from the Logistic Regression/Linear layer being crazy high for the cross (label 0, prediction 0). For the circle (label 1, prediction 0) the values are between roughly -12 and -5, but for the cross they are all in the positive high 1000 to high 2000 range.
The code which sends these back looks something like (some parts omitted):
void LinearClassifier::Train(float * x,float output, float y)
{
float h = output - y;
float average = 0.0f;
for (int i =1; i < m_NumberOfWeights; ++i)
{
float error = h*x[i-1];
m_pGradients[i-1] = error;
average += error;
}
average /= static_cast<float>(m_NumberOfWeights-1);
for (int theta = 1; theta < m_NumberOfWeights; ++theta)
{
m_pWeights[theta] = m_pWeights[theta] - learningRate*m_pGradients[theta-1];
}
// Bias
m_pWeights[0] -= learningRate*average;
}
This is passed back to the single convolution layer:
// This code is in three nested for loops (for layer,for outWidth, for outHeight)
float gradient = 0.0f;
// ReLu Derivative
if ( m_pOutputBuffer[outputIndex] > 0.0f)
{
gradient = outputGradients[outputIndex];
}
for (int z = 0; z < m_InputDepth; ++z)
{
for ( int u = 0; u < m_FilterSize; ++u)
{
for ( int v = 0; v < m_FilterSize; ++v)
{
int x = outX + u - 1;
int y = outY + v - 1;
int inputIndex = x + y*m_OutputWidth + z*m_OutputWidth*m_OutputHeight;
int kernelIndex = u + v*m_FilterSize + z*m_FilterSize*m_FilterSize;
m_pGradients[inputIndex] += m_Filters[layer][kernelIndex]*gradient;
m_GradientSum[layer][kernelIndex] += input[inputIndex]*gradient;
}
}
}
This code is iterated over by passing each image in a one at a time fashion. The gradients are obviously going in the right direction but how do I stop the huge gradients from throwing the prediction function?
RELU activations are notorious for doing this. You usually have to use a low learning rate. The reasoning behind this is that when the RELU returns positive numbers it can continue to learn freely, but if a unit gets in a position where the signal coming into it is always negative it can become a "dead" neuron and never activate again.
Also initializing your weights is more delicate with RELU. It appears that you are initializing to range 0-1 which creates a huge bias. Two tips here - Use a range centered around 0, and a range that is much smaller. A normal distribution with mean 0 and std 0.02 usually works well.
I fixed it by downscaling the gradients int the CNN layer, but now I'm confused as to why this works/is needed so if anyone has any intuition as to why this works that'd be great.

Multidimensional Integration - Coupled Limits

I need to calculate the value of a high dimensional integral in C++. I have found numerous libraries capable of solving this task for fixed limit integrals,
\int_{0}^{L} \int_{0}^{L} dx dy f(x,y) .
However the integrals which I am looking at have variable limits,
\int_{0}^{L} \int_{x}^{L} dx dy f(x,y) .
To clarify what i mean, here is a naive 2D Riemann sum implementation in 2D, which returns the desired result,
int steps = 100;
double integral = 0;
double dl = L/((double) steps);
double x[2] = {0};
for(int i = 0; i < steps; i ++){
x[0] = dl*i;
for(int j = i; j < steps; j ++){
x[1] = dl*j;
double val = f(x);
integral += val*val*dl*dl;
}
}
where f is some arbitrary function and L the common upper integration limit. While this implementation works, it's slow and thus impractical for higher dimensions.
Effective algorithms for higher dimensions exist, but to my knowledge, library implementations (e.g. Cuba) take a fixed value vector as the limit argument which renders them useless for my problem.
Is there any reason for this and/or is there any trick to circumvent the problem?
Your integration order is wrong, should be dy dx.
You are integrating over the triangle
0 <= x <= y <= L
inside the square [0,L]x[0,L]. This can be simulated by integrating over the full square where the integrand f is defined as 0 outside of the triangle. In many cases, when f is defined on the full square, this can be accomplished by taking the product of f with the indicator function of the triangle as new integrand.
When integrating over a triangular region such as 0<=x<=y<=L one can take advantage of symmetry: integrate f(min(x,y),max(x,y)) over the square 0<=x,y<=L and divide the result by 2. This has an advantage over extending f by zero (the method mentioned by LutzL) in that the extended function is continuous, which improves the performance of the integration routine.
I compared these on the example of the integral of 2x+y over 0<=x<=y<=1. The true value of the integral is 2/3. Let's compare the performance; for demonstration purpose I use Matlab routine, but this is not specific to language or library used.
Extending by zero
f = #(x,y) (2*x+y).*(x<=y);
result = integral2(f, 0, 1, 0, 1);
fprintf('%.9f\n',result);
Output:
Warning: Reached the maximum number of function evaluations
(10000). The result fails the global error test.
0.666727294
Extending by symmetry
g = #(x,y) (2*min(x,y)+max(x,y));
result2 = integral2(g, 0, 1, 0, 1)/2;
fprintf('%.9f\n',result2);
Output:
0.666666776
The second result is 500 times more accurate than the first.
Unfortunately, this symmetry trick is not available for general domains; but integration over a triangle comes up often enough so it's useful to keep it in mind.
I was a bit confused by your integral definition but from your code i see it like this:
just did some testing so here is your code:
//---------------------------------------------------------------------------
double f(double *x) { return (x[0]+x[1]); }
void integral0()
{
double L=10.0;
int steps = 10000;
double integral = 0;
double dl = L/((double) steps);
double x[2] = {0};
for(int i = 0; i < steps; i ++){
x[0] = dl*i;
for(int j = i; j < steps; j ++){
x[1] = dl*j;
double val = f(x);
integral += val*val*dl*dl;
}
}
}
//---------------------------------------------------------------------------
Here is optimized code:
//---------------------------------------------------------------------------
void integral1()
{
double L=10.0;
int i0,i1,steps = 10000;
double x[2]={0.0,0.0};
double integral,val,dl=L/((double)steps);
#define f(x) (x[0]+x[1])
integral=0.0;
for(x[0]= 0.0,i0= 0;i0<steps;i0++,x[0]+=dl)
for(x[1]=x[0],i1=i0;i1<steps;i1++,x[1]+=dl)
{
val=f(x);
integral+=val*val;
}
integral*=dl*dl;
#undef f
}
//---------------------------------------------------------------------------
results:
[ 452.639 ms] integral0
[ 336.268 ms] integral1
so the increase in speed is ~ 1.3 times (on 32bit app on WOW64 AMD 3.2GHz)
for higher dimensions it will multiply
but still I think this approach is slow
The only thing to reduce complexity I can think of is algebraically simplify things
either by integration tables or by Laplace or Z transforms
but for that the f(*x) must be know ...
constant time reduction can of course be done
by the use of multi-threading
and or GPU ussage
this can give you N times speed increase
because this is all directly parallelisable

Generating incomplete iterated function systems

I am doing this assignment for fun.
http://groups.csail.mit.edu/graphics/classes/6.837/F04/assignments/assignment0/
There are sample outputs at site if you want to see how it is supposed to look. It involves iterated function systems, whose algorithm according the the assignment is:
for "lots" of random points (x0, y0)
for k=0 to num_iters
pick a random transform fi
(xk+1, yk+1) = fi(xk, yk)
display a dot at (xk, yk)
I am running into trouble with my implementation, which is:
void IFS::render(Image& img, int numPoints, int numIterations){
Vec3f color(0,1,0);
float x,y;
float u,v;
Vec2f myVector;
for(int i = 0; i < numPoints; i++){
x = (float)(rand()%img.Width())/img.Width();
y = (float)(rand()%img.Height())/img.Height();
myVector.Set(x,y);
for(int j = 0; j < numIterations;j++){
float randomPercent = (float)(rand()%100)/100;
for(int k = 0; k < num_transforms; k++){
if(randomPercent < range[k]){
matrices[k].Transform(myVector);
}
}
}
u = myVector.x()*img.Width();
v = myVector.y()*img.Height();
img.SetPixel(u,v,color);
}
}
This is how my pick a random transform from the input matrices:
fscanf(input,"%d",&num_transforms);
matrices = new Matrix[num_transforms];
probablility = new float[num_transforms];
range = new float[num_transforms+1];
for (int i = 0; i < num_transforms; i++) {
fscanf (input,"%f",&probablility[i]);
matrices[i].Read3x3(input);
if(i == 0) range[i] = probablility[i];
else range[i] = probablility[i] + range[i-1];
}
My output shows only the beginnings of a Sierpinski triangle (1000 points, 1000 iterations):
My dragon is better, but still needs some work (1000 points, 1000 iterations):
If you have RAND_MAX=4 and picture width 3, an evenly distributed sequence like [0,1,2,3,4] from rand() will be mapped to [0,1,2,0,1] by your modulo code, i.e. some numbers will occur more often. You need to cut off those numbers that are above the highest multiple of the target range that is below RAND_MAX, i.e. above ((RAND_MAX / 3) * 3). Just check for this limit and call rand() again.
Since you have to fix that error in several places, consider writing a utility function. Then, reduce the scope of your variables. The u,v declaration makes it hard to see that these two are just used in three lines of code. Declare them as "unsigned const u = ..." to make this clear and additionally get the compiler to check that you don't accidentally modify them afterwards.

super mysterious logic error for a planar fitting image processing algorithm

so i have this image processing program where i am using a linear regression algorithm to find a plane that best fits all of the points (x,y,z: z being the pixel color intensity (0-255)
Simply speaking i have this picture of ? x ? dimension. I run this algorithm and i get these A, B, C values. (3 float values)
then i go every pixel in the program and minus the pixel value with mod_val where
mod_val = (-A * x -B * y ) / C
A,B,C are constants while x,y is the pixel location in a x,y plane.
When the dimension of the picture is divisible by 100 its perfect but when its not the picture fractures. The picture itself is the same as the original but there is a diagonal line with color contrast that goes across the picture. The program is supposed to make the pixel color uniform from the center.
I tried running the picture where mod_val = 0 for not divisble by 100 dimension pictures and it copies a new picture perfectly. So i doubt there is a problem with storing and writing the read data in terms of alignment. (fyi this picture is a grey scale 8 bit.bmp)
I have tried changing the A,B,C values but the diagonal remains the same. The color of the image fragments within the diagonals change.
when i run 1400 x 1100 picture it works perfectly with the mod_val equation written above which is the most baffling part.
I spent a lot of time looking for rounding errors. They are virtually all floats. The dimension i used for breaking picture is 1490 x 1170.
here is a gragment of the code where i think a error is occuring:
int img_row = row_length;
int img_col = col_length;
int i = 0;
float *pfAmultX = new float[img_row];
for (int x = 0; x < img_row; x++)
{
pfAmultX[x] = (A * x)/C;
}
for (int y = 0; y < img_col; y++)
{
float BmultY = B*y/C;
for (int x = 0; x < img_row; x++, i++)
{
modify_val = pfAmultX[x] + BmultY;
int temp = (int) data.data[i];
data.data[i] += (unsigned char) modify_val;
if(temp >= 250){
data.data[i] = 255;
}
else if(temp < 0){
data.data[i] = 0;
}
}
}
delete[] pfAmultX;
The img_row, img_col is correct according to VS debugger mode
Any help would be greatly appreciated. I've been trying to find this bug for many hours now and my boss is telling me that i can't go back home until i find this bug.....
before algorithm (1400 x 1100, works)
after
before (1490 x 1170, demonstrates the problem)
after
UPDATE:
well i have boiled down the problem as something with the x coordinate after extensive testing.
This is because when i use large A or B values or both (C value is always ~.999) for 1400x1100 it does not create diagonals.
However, for the other image, large B values do not create diagonals but a fairly small - avg A value creates diagonals.
Whats even more, when i test a picture where x is disivible by 100 but y is divisible by 10. the answer is correct.
well in the end i found the solution. It was a problem due to the padding the the bitmap. When the dimension on the x was not divisible by 4 it would use padding which would throw off all of the x coordinates. This also meant that the row_value i received from the bmp header was the same as the dimension but not really the same in reality. I had to make a edit where i had to do: 4 * (row_value_from_bmp_header + 3)/ 4.