I recently took some code that tracked an object based on color in OpenCV c++ and rewrote it in the python bindings.
The overall results and method were the same minus syntax obviously. But, when I perform the below code on each frame of a video it takes almost 2-3 seconds to complete where as the c++ variant, also below, is instant in comparison and I can iterate between frames as fast as my finger can press a key.
Any ideas or comments?
cv.PyrDown(img, dsimg)
for i in range( 0, dsimg.height ):
for j in range( 0, dsimg.width):
if dsimg[i,j][1] > ( _RED_DIFF + dsimg[i,j][2] ) and dsimg[i,j][1] > ( _BLU_DIFF + dsimg[i,j][0] ):
res[i,j] = 255
else:
res[i,j] = 0
for( int i =0; i < (height); i++ )
{
for( int j = 0; j < (width); j++ )
{
if( ( (data[i * step + j * channels + 1]) > (RED_DIFF + data[i * step + j * channels + 2]) ) &&
( (data[i * step + j * channels + 1]) > (BLU_DIFF + data[i * step + j * channels]) ) )
data_r[i *step_r + j * channels_r] = 255;
else
data_r[i * step_r + j * channels_r] = 0;
}
}
Thanks
Try using numpy to do your calculation, rather than nested loops. You should get C-like performance for simple calculations like this from numpy.
For example, your nested for loops can be replaced with a couple of numpy expressions...
I'm not terribly familiar with opencv, but I think the python bindings now have a numpy array interface, so your example above should be as simple as:
cv.PyrDown(img, dsimg)
data = np.asarray(dsimg)
blue, green, red = data.T
res = (green > (_RED_DIFF + red)) & (green > (_BLU_DIFF + blue))
res = res.astype(np.uint8) * 255
res = cv.fromarray(res)
(Completely untested, of course...) Again, I'm really not terribly familar with opencv, but nested python for loops are not the way to go about modifying an image element-wise, regardless...
Hope that helps a bit, anyway!
Related
Here's the code snippet I'd like help understanding
for (i = 0; i < samplesX; i++)
for (j = 0; j < samplesY; j++)
{
newI = DIM * i / samplesX;
newJ = DIM * j / samplesY;
idx = (round(newJ) * DIM) + round(newI);
if (color_dir == 1 && draw_vecs == 1) {
direction_to_color(vx[idx], vy[idx], color_dir);
}
if (color_dir == 1 && draw_vecs == 2) {
direction_to_color(fx[idx], fy[idx], color_dir);
}
else if (color_dir == 2) {
scalar = rho[idx];
set_colormap(scalar, min, max, clampLow, clampHigh);
}
else if (color_dir == 3) {
scalar = sqrt(vx[idx] * vx[idx] + vy[idx] * vy[idx]);
set_colormap(scalar, min, max, clampLow, clampHigh);
}
else if (color_dir == 4) {
scalar = sqrt(fx[idx] * fx[idx] + fy[idx] * fy[idx]);
set_colormap(scalar, min, max, clampLow, clampHigh);
}
/*if (draw_vecs == 1) {
glVertex2f(wn + (fftw_real)newI * wn, hn + (fftw_real)newJ * hn);
glVertex2f((wn + (fftw_real)newI * wn) + vec_scale * vx[idx], (hn + (fftw_real)newJ * hn) + vec_scale * vy[idx]);
}
else if (draw_vecs == 2) {
glVertex2f(wn + (fftw_real)newI * wn, hn + (fftw_real)newJ * hn);
glVertex2f((wn + (fftw_real)newI * wn) + vec_scale * fx[idx], (hn + (fftw_real)newJ * hn) + vec_scale * fy[idx]);
}*/
if (draw_vecs == 1) {
glVertex2f(wn + (fftw_real)i * wn, hn + (fftw_real)j * hn);
glVertex2f((wn + (fftw_real)i * wn) + vec_scale * vx[idx], (hn + (fftw_real)j * hn) + vec_scale * vy[idx]);
}
else if (draw_vecs == 2) {
glVertex2f(wn + (fftw_real)i * wn, hn + (fftw_real)j * hn);
glVertex2f((wn + (fftw_real)i * wn) + vec_scale * fx[idx], (hn + (fftw_real)j * hn) + vec_scale * fy[idx]);
}
}
glEnd();
}
What this currently does, as far as my understanding goes, is display these two-dimensional lines/arrows (hedgehogs) that visualize force/velocity in 2D as can be seen in the picture below.
Sadly, my understanding of linear algebra, calculus and computer graphics in general only goes so far and I'm having trouble dissecting this piece.
Ideally I'd like to understand this and also understand how I can take this pre-existing code and also add in functionality that can display two other glyph types that show a vector and/or scalar field such as
three-dimensional cones
three-dimensional ellipsoids
If I'm missing anything here, please let me know!
Some of the variables included in the above snippet:
const int DIM = 50; //size of simulation grid
int color_dir = 0; //use direction color-coding or not
float scalar;
int newI, newJ;
float temp;
float vec_scale = 1000; //scaling of hedgehogs
int draw_vecs = 1; //draw the vector field or not
The code snippet you have there could have been written simpler (also it takes some educated guessing what some of the variables and functions mean).
Let's break it down.
The first two lines are easy to understand, they're the standard stanza to iterate over a 2D array
for (i = 0; i < samplesX; i++)
for (j = 0; j < samplesY; j++)
i and j are running indices, that will iterate over every discrete coordinate tuple in (i,j) ∈ [i, samplesX) × [j, samplesY). The next two lines remap the 2D indices into into a new value range, specifically [i, samplesX)×[j, samplesY) → [0, DIM)×[0, DIM). A missing piece of information is, what type is DIM of. It would make for it to be some floating point type.
newI = DIM * i / samplesX;
newJ = DIM * j / samplesY;
The next line is bug prone. It translates newI and newJ into a running 1D index for a 1D array, that's addressed by i and j.
Why is this problematic? Because in the conversion to DIM-space information may have been lost. This kind of information loss may lead to security bugs(!), as a matter of fact, Skia, the rendering library used by Google Chrome, Android and other projects had exactly this kind of bug recently; the writeup is a worthwhile read: https://googleprojectzero.blogspot.com/2019/02/the-curious-case-of-convexity-confusion.html
The correct way to implement this is to have DIM be an integer and perform fixed point arithmetic on it, eventually truncating the fractional digits. But I digress. The next block is essentially performing a poor man's lookup table lookup. vx``vy and fx``fy are some flattened 2D arrays, accessed through an 1D index, and direction_to_color maps either to a value presumably to a call of glColor; the same probably also goes for set_colormap. This is a bad use of OpenGL.
The whole remapping from i and j to DIM and then the lookups are just poor implementation of a texture lookup. OpenGL already has textures. Just load as texture coordinate array and enable texturing.
Finally for each spine, two calls of glVertex are made, one with the staring point, which lies on grid centers (wn, hn), to an offset location (wn, hn) + (i, j).
My verdict of that code: Utter garbage! All of this could have been done far more elegantly, even back in 1994 with OpenGL-1.0, which is code seems to have been written for. If you want to implement your own vector field plot, don't use this as a starting point.
These days we have programmable GPUs with shaders. All of that bulk up there can be done is a few lines of shader code.
I'm using CUDA for the iterative Karatsuba algorithm and I would like to ask, why is one line computed always different.
First, I implemented this function, which computed the result always correctly:
__global__ void kernel_res_main(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) / 2;
for(TYPE inner = start; inner < end; inner++){
result[i] += ( A[inner] + A[i - inner] ) * ( B[inner] + B[i - inner] );
result[i] -= ( D[inner] + D[i-inner] );
}
}
}
Now I would like to use the 2D grid and use CUDA for the for-loop, so I changed my function to this:
__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
TYPE rtmp = result[i];
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) >> 1;
if(j >= start && j <= end ){
// WRONG
rtmp += ( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] );
}
}
result[i] = rtmp;
}
I am calling this function like this:
dim3 block( 32, 8 );
dim3 grid( (resultSize+1/32) , (resultSize+7/8) );
kernel_res_nested <<<grid, block>>> (devA, devB, devD, devResult, size, resultSize);
And the result is alway wrong and always different. I can't understand why is that second implementation wrong and always computes wrong results. I can't see there any logical problem connected with data dependency. Does anyone know How can I solve this problem?
For question like this, you are supposed to provide a MCVE. (See item 1 here) For example, I don't know what type is indicated by TYPE, and it does matter for the correctness of the solution I will propose.
In your first kernel, only one thread in your entire grid was reading and writing location result[i]. But in your second kernel, you now have multiple threads writing to the result[i] location. They are conflicting with each other. CUDA doesn't specify the order in which threads will run, and some may run before, after, or at the same time as, others. In this case, some threads may read result[i] at the same time as others. Then, when the threads write their results, they will be inconsistent. And it may vary from run-to-run. You have a race condition there (execution order dependency, not data dependency).
The canonical method to sort this out would be to employ a reduction technique.
However for simplicity, I will suggest that atomics could help you sort it out. This is easier to implement based on what you have shown, and will help confirm the race condition. After that, if you want to try a reduction method, there are plenty of tutorials for that (one is linked above) and plenty of questions here on the cuda tag about it.
You could modify your kernel to something like this, to sort out the race condition:
__global__ void kernel_res_nested(TYPE *A, TYPE *B, TYPE *D, TYPE *result, TYPE size, TYPE resultSize){
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
if( i > 0 && i < resultSize - 1){
TYPE start = (i >= size) ? (i % size ) + 1 : 0;
TYPE end = (i + 1) >> 1;
if(j >= start && j < end ){ // see note below
atomicAdd(result+i, (( A[j] + A[i - j] ) * ( B[j] + B[i - j] ) - ( D[j] + D[i - j] )));
}
}
}
Note that depending on your GPU type, and the actual type of TYPE you are using, this may not work (may not compile) as-is. But since you had previously used TYPE as a loop variable, I am assuming it is an integer type, and the necessary atomicAdd for those should be available.
A few other comments:
This may not be giving you the grid size you expect:
dim3 grid( (resultSize+1/32) , (resultSize+7/8) );
I think the usual calculations there would be:
dim3 grid( (resultSize+31)/32, (resultSize+7)/8 );
I always recommend proper CUDA error checking and running your codes with cuda-memcheck, any time you are having trouble with a CUDA code, to make sure there are no runtime errors.
It also looks to me like this:
if(j >= start && j <= end ){
should be this:
if(j >= start && j < end ){
to match your for-loop range. I am also making an assumption that size is less than resultSize (again, a MCVE would help).
I would like to create my own nonlinear filter in OpenCV using C++, and if I see it correctly, I can use the FilterEngine class to do so. Unfortunately, I'm not really able to follow the documentation of this class. (Link: http://docs.opencv.org/2.4/modules/imgproc/doc/filtering.html#filterengine).
Could someone be so kind to explain the class to me in a little bit more detail?
I'm grateful for every input and every example you can provide me with :-)
.
My specific needs:
1) I would like learn how to create my own nonlinear filters in general.
2) I would like to apply a rank-transform filter to my images:
Meaning: I have a kernel/region and I would like to flag every pixel inside that region with a one if the intensity-value of that (neighbourhood-) pixel is lower than the intensity of the center-pixel. Next, I want to use a simple convolution to save the sum of the transformed region, and store the value at the center-pixel. Let's look at a simple example:
100 120 200 rank-trans. 1 0 0 convolution
110 120 220 --> 1 0 0 --> 2
180 200 200 0 0 0
P.S: I know that I can archive the result of 2) by combining 255 threshold-operations with 255 box-filter operations, and then looping over every pixel and selecting the correct value. However, that seems quite inefficient to me ...
.
Code-Snipped [Edit]:
As I still struggle to understand the FilterEngine(), I started to write my own function for the above-descripted usecase. I would also be happy if you could comment on it to improve its efficiency, as it is quite slow at the moment. (~2sec. for a 1080x1920 image on one CPU-core).
void rankTransform(Mat& out, Mat in, int kernal_size, int borderType) {
// Issue warning if neccessary:
if (kernal_size >= 17) {
std::cout << "Warning, need to change Mat-type. Unsigned short only supports kernels up-to the size of 15x15" << std::endl << std::endl;
};
// First: Get borders around the image:
int border_size = (kernal_size - 1) / 2;
Mat in_incl_border = Mat(1080 + 2 * border_size, 1920 + 2 * border_size, in.depth());
copyMakeBorder(in, in_incl_border, border_size, border_size, border_size, border_size, borderType);
// Second: Loop through the image, conduct a rank transform and
// then sum over the kernel-size:
int start_pixel = 0 + (border_size + 1);
int end_pixel_width = 1920 + border_size;
int end_pixel_height = 1080 + border_size;
int i, j;
int x_1, x_2, y_1;
for (i = start_pixel; i < end_pixel_height; ++i) {
x_1 = i - border_size;
x_2 = i + border_size + 1;
for (j = start_pixel; j < end_pixel_width; ++j) {
y_1 = j - border_size;
out.at<unsigned short>(x_1-1, y_1-1) = static_cast<unsigned short>( (sum( in_incl_border(Range(x_1, x_2), Range(y_1, j + border_size + 1)) < in_incl_border.at<unsigned short>(i, j) )[0])/255 );
};
};
I wrote C++ codes and matlab codes to test speed. My C++ code is:
int nrow = dim[0], ncol = dim[1];
double tmp, ldot;
for (int k = ncol - 1; k >= 0; --k){
grad[k] = 0;
for (int j = nrow - 1; j >= 0; --j){
tmp = exp(eta[j + nrow * k]);
ldot = (-Z[j + nrow * k] + tmp / (1 + tmp));
grad[k] += A[j] * ldot;
}
}
My matlab code is:
prob = exp(eta);
prob = prob./(1+prob);
ldot = prob - Z;
grad=sum(repmat(A,1,nGWAS).*ldot);
I run each code 100 times, it took over 5 seconds for C++ and only 1.2 seconds for matlab.
Anyone can help my here? Thanks.
The folks at matlab know very well how to optimize matrix access.
You chose to access it column by column. My initial guess is that the matrix is laid out in memory row by row. This causes your code to run over the whole matrix ncol times. Cache misses all over the place.
I want to make a FIR filter using a window function. I have some sample data and size variable is a count of samples. The windowSize variable is a size of the window function.
At first I create the window function (blackman window): the variable window
Then I need to multiply it by sin(x) / x function and convolve with real data (variable data):
for (int i = 0; i < size; ++i) {
for (j = 0; j < windowSize; ++j) {
double arg = 2.0 * PI * ((double)j - (double)windowSize / 2.0) / (double)windowSize;
if (i + j - windowSize / 2 < 0)
continue;
if (arg == 0) {
filteredData[i] += data[i + j - windowSize / 2] * window[j] * 1.0 / (double)windowSize;
} else
filteredData[i] += data[i + j - windowSize / 2] * window[j] * (sin(arg) / arg) / (double)windowSize;
}
}
The problem:
As a result I get a filtered data with an average which very different than the average of the original data. Where is a mistake?
In the DSP book it is written that in order to make a FIR filter we should multiply the function sin(x) / x by a window function and then perform the convolution, but nothing is written about x in the sin(x) / x, so I used the:
double arg = 2.0 * PI * ((double)j - (double)windowSize / 2.0) / (double)windowSize;
for the x value, the argument of sine, is it correct?
The sin(x)/x filter is a lowpass filter. That is, it suppresses all frequencies above a certain cutoff frequency.
If the sampling frequency is Fs (Hertz) and you want a cutoff frequency of fc (Hertz), You should be using x = 2*PI*fc/(2*Fs)*n where n goes from -N to +N and N is large enough that the sin(x)/x function is close to zero. Don't forget that sin(x)/x is 1 when x is zero.
To maintain the average of the signal you have to normalize the filter coefficients by their sum. I.e., set f_norm[k] = f[k] / sum(f[k], k=...)
That's all I have to say at this point. It seems like you have a lot to learn. I suggest a good book on signal processing.
As far as the implementation is concerned it looks like you need to initialise filteredData[i], e.g.
for (int i = 0; i < size; ++i) {
filteredData[i] = 0;
for (j = 0; j < windowSize; ++j) {
...