How to use Matlab's 512 element lookup table array in OpenCV? - c++

I am designing morphological operations in OpenCV. I am trying to mimic the functions remove and bridge in Matlab's bwmorph. To do this I referred to the function definition of bwmorph.m, there I obtained the Look up table arrays for remove and bridge.
After that step the procedure is same for both Matlab and OpenCV.
lut(img,lutarray,img)
Problem is that Matlab uses a 512 element (9bit) look up table scheme while OpenCV uses a 256 element (8bit) look up scheme, how do I use the Matlab lutarray in OpenCV?
After doing some research I came across this post.
What does the person mean when they're saying that they "split" the image from 0-512 and then into two parts?
Is the above method even correct? Are there any alternates to doing this?

bwlookup(bw,lut)
http://se.mathworks.com/help/images/ref/bwlookup.html
or internally, applylut both perform a 2-by-2 or 3-by-3 neighborhood operation on a binary (black & white) image, whereas OpenCV's cv::LUT performs a per pixel gray level transform (closely related to intlut in MATLAB). An example of latter is performing a gamma correction on gray level image.
//! transforms array of numbers using a lookup table: dst(i)=lut(src(i))
CV_EXPORTS_W void LUT(InputArray src, InputArray lut, OutputArray dst,
int interpolation=0);
To my knowledge, there is no neighborhood bwlookup implementation in OpenCV. However, following the description of MATLAB's bwlookup, you can write it yourself.
// performs 3-by-3 lookup on binary image
void bwlookup(
const cv::Mat & in,
cv::Mat & out,
const cv::Mat & lut,
int bordertype=cv::BORDER_CONSTANT,
cv::Scalar px = cv::Scalar(0) )
{
if ( in.type() != CV_8UC1 )
CV_Error(CV_StsError, "er");
if ( lut.type() != CV_8UC1 || lut.rows*lut.cols!=512 || !lut.isContinuous() )
CV_Error(CV_StsError, "lut size != 512" );
if ( out.type() != in.type() || out.size() != in.size() )
out = cv::Mat( in.size(), in.type() );
const unsigned char * _lut = lut.data;
cv::Mat t;
cv::copyMakeBorder( in,t,1,1,1,1,bordertype,px);
const int rows=in.rows+1;
const int cols=in.cols+1;
for ( int y=1;y<rows;++y)
{
for ( int x=1;x<cols;++x)
{
int L = 0;
const int jmax=y+1;
#if 0 // row-major order
for ( int j=y-1, k=1; j<=jmax; ++j, k<<=3 )
{
const unsigned char * p = t.ptr<unsigned char>(j) + x-1;
for ( unsigned int u=0;u<3;++u )
{
if ( p[u] )
L += (k<<u);
#else // column-major order (MATLAB)
for ( int j=y-1, k=1; j<=jmax; ++j, k<<=1 )
{
const unsigned char * p = t.ptr<unsigned char>(j) + x-1;
for ( unsigned int u=0;u<3;++u )
{
if ( p[u] )
L += (k<<3*u);
#endif
}
}
out.at<unsigned char>(y-1,x-1)=_lut[ L ];
}
}
}
I tested it against remove and bridge so should work. Hope that helps.
Edit: After checking against a random lookup table,
lut = uint8( rand(512,1)>0.5 ); % #MATLAB
B = bwlookup( A, lut );
I flipped the order the indices appear in the lookup table (doesn't matter if the operation is symmetric).

Related

fftw + opencv inconsistent output

I recently tried to implement an FFT function for Opencv's Mat.
I inspired my implementation mainly from FFTW's code samples and from :
FFTW-OpenCV
I payed close attention to adapt the size of the input image in order to fasten the processing.
It seems that I did something wrong because the output is always a black image.
Here is my implementation:
void fft2_32f(const cv::Mat1f& _src, cv::Mat2f& dst)
{
cv::Mat2f src;
const int rows = cv::getOptimalDFTSize(_src.rows);
const int cols = cv::getOptimalDFTSize(_src.cols);
// const int total = cv::alignSize(rows*cols,steps);
if(_src.isContinuous() && _src.rows == rows && _src.cols == cols)
{
src = cv::Mat2f::zeros(src.size());
dst = cv::Mat2f::zeros(src.size());
// 1) copy the source into a complex matrix (the imaginary component is set to 0).
cblas_scopy(src.total(), _src.ptr<float>(), 1, src.ptr<float>(), 2);
// 2) prepare and apply the transform.
fftwf_complex* ptr_in = reinterpret_cast<fftwf_complex*>(src.ptr<float>());
fftwf_complex* ptr_out = reinterpret_cast<fftwf_complex*>(dst.ptr<float>());
// fftwf_plan fft = fftwf_plan_dft_1d(src.total(), ptr_in, ptr_out, FFTW_FORWARD, FFTW_ESTIMATE);
fftwf_plan fft = fftwf_plan_dft_2d(src.rows, src.cols, ptr_in, ptr_out, FFTW_FORWARD, FFTW_ESTIMATE);
fftwf_execute(fft);
fftwf_destroy_plan(fft);
// 3) normalize
cblas_saxpy(dst.rows * dst.step1(), 1.f/dst.total(), dst.ptr<float>(), 1, dst.ptr<float>(), 1);
}
else
{
src = cv::Mat2f::zeros(rows, cols);
dst = cv::Mat2f::zeros(rows, cols);
// 1) copy the source into a complex matrix (the imaginary component is set to 0).
support::parallel_for(cv::Range(0, _src.rows), [&src, &_src](const cv::Range& range)->void
{
for(int r=range.start; r<range.end; r++)
{
int c=0;
const float* it_src = _src[r];
float* it_dst = src.ptr<float>(r);
#if CV_ENABLE_UNROLLED
for(;c<=_src.cols-4; c+=4, it_src+=4, it_dst+=8)
{
*it_dst = *it_src;
*(it_dst+2) = *(it_src+1);
*(it_dst+4) = *(it_src+2);
*(it_dst+6) = *(it_src+3);
}
#endif
for(; c<_src.cols; c++, it_src++, it_dst+=2)
*it_dst = *it_src;
}
}, 0x80);
// 2) prepare and apply the transform.
fftwf_complex* ptr_in = reinterpret_cast<fftwf_complex*>(src.ptr<float>());
fftwf_complex* ptr_out = reinterpret_cast<fftwf_complex*>(dst.ptr<float>());
fftwf_plan fft = fftwf_plan_dft_2d(src.rows, src.cols, ptr_in, ptr_out, FFTW_FORWARD, FFTW_ESTIMATE);
fftwf_execute(fft);
fftwf_destroy_plan(fft);
double min(0.);
double max(0.);
// 3) normalize
cblas_saxpy(dst.rows * dst.step1(), 1.f/dst.total(), dst.ptr<float>(), 1, dst.ptr<float>(), 1);
}
}
Note:
The parallel_for implementation is inspired by: How to use lambda as a parameter to parallel_for_
Thanks in advance for any help.
I figure out my issue.
This function written as is does work perfectly (at least for the purpose I made it for).
My issue was that :
cv::Mat dst = cv::Mat::zeros(src.size(), CV_32FC2);
cv::Mat1f srcw = src;
cv::Mat1f dstw = dst;
fft2_32f(srcw, dstw); // realocate dstw to the optimal size for receive the output depending on the size of srcw. ... so the dstw is reallocate but not dst.
dst.copyTo(_outputVariable);
In that case the correct information is store in dstw but not in dst because of the reallocation inside the function.
So when I try to visualize my data I had a black image because of that.
The proper call use to be:
cv::Mat dst;
cv::Mat1f srcw = src;
cv::Mat1f dstw;
fft2_32f(srcw, dstw); // realocate dstw to the optimal size for receive the output depending on the size of srcw. ... so the dstw is reallocate but not dst.
dst = dstw;
dst.copyTo(_outputVariable); // or dstw.copyTo(_outputVariable);
With that code I got the proper output.
Note depending on the application a roi (take a look to the operator()(const cv::Rect&) of OpenCV's Mat container) corresponding to the size of the input may be usefull in order to preserve the dimensions.
Thank you for your help :).
Can someone help me to mark this topic as close ? please.

How to write the expression "img[markers == -1] = [255,0,0]" in C++ OpenCV?

I'm trying to convert the OpenCV Python example here to C++.
I'm stuck in this line:
img[markers == -1] = [255,0,0]
where both img and markers are matrices.
What is the efficient way to write this in C++ OpenCV?
Since I've already written some code to back my comments up, it would be a waste not to write it up.
NB: Testing it on an i7-4930k, with MSVC 2013, OpenCV 3.1, 64bit. Using a randomly generated input image and mask (~9% is set to -1).
As Miki stated, the simplest way to do this in C++ is to use:
cv::MatExpr operator== (const cv::Mat& a, double s) to create a mask
which you the use in cv::Mat::setTo(...)
For example:
void set_where_markers_match(cv::Mat3b img
, cv::Vec3b value
, cv::Mat1i markers
, int32_t target)
{
img.setTo(value, markers == target);
}
Even though this creates an intermediate mask Mat, it's still efficient enough for vast majority of cases (roughly 2.9 ms per 2^20 pixels).
So what if you feel this is really not good enough and you want to have a shot at writing something faster?
Let's begin with something simple -- iterate rows and columns and use cv::Mat::at.
void set_where_markers_match(cv::Mat3b img
, cv::Vec3b value
, cv::Mat1i markers
, int32_t target)
{
CV_Assert(img.size == markers.size);
for (int32_t r(0); r < img.rows; ++r) {
for (int32_t c(0); c < img.cols; ++c) {
if (markers.at<int32_t>(r, c) == target) {
img.at<cv::Vec3b>(r, c) = value;
}
}
}
}
A little better, ~2.4 ms per iteration.
Let's try using Mat iterators instead.
void set_where_markers_match(cv::Mat3b img
, cv::Vec3b value
, cv::Mat1i markers
, int32_t target)
{
CV_Assert(img.size == markers.size);
cv::Mat3b::iterator it_img(img.begin());
cv::Mat1i::const_iterator it_mark(markers.begin());
cv::Mat1i::const_iterator it_mark_end(markers.end());
for (; it_mark != it_mark_end; ++it_mark, ++it_img) {
if (*it_mark == target) {
*it_img = value;
}
}
}
This doesn't seem to help in my case, ~3.1 ms per iteration.
Time to drop the gloves -- let's use pointers to the pixel data. We've got to be careful and account for discontinuous Mats (e.g. when you have ROI from a larger Mat) -- let's do processing row at a time.
void set_where_markers_match(cv::Mat3b img
, cv::Vec3b value
, cv::Mat1i markers
, int32_t target)
{
CV_Assert(img.size == markers.size);
for (int32_t r(0); r < img.rows; ++r) {
uint8_t* it_img(img.ptr<uint8_t>(r));
int32_t const* it_mark(markers.ptr<int32_t>(r));
int32_t const* it_mark_end(it_mark + markers.cols);
for (; it_mark != it_mark_end; ++it_mark, it_img += 3) {
if (*it_mark == target) {
it_img[0] = value[0];
it_img[1] = value[1];
it_img[2] = value[2];
}
}
}
}
This is a step forward, ~1.9 ms per iteration.
The next easiest step with OpenCV could be parallelizing this -- we can take advantage of cv::parallel_for_. Let's split the work by rows, so we can reuse the previous algorithm.
class ParallelSWMM : public cv::ParallelLoopBody
{
public:
ParallelSWMM(cv::Mat3b& img
, cv::Vec3b value
, cv::Mat1i const& markers
, int32_t target)
: img_(img)
, value_(value)
, markers_(markers)
, target_(target)
{
CV_Assert(img.size == markers.size);
}
virtual void operator()(cv::Range const& range) const
{
for (int32_t r(range.start); r < range.end; ++r) {
uint8_t* it_img(img_.ptr<uint8_t>(r));
int32_t const* it_mark(markers_.ptr<int32_t>(r));
int32_t const* it_mark_end(it_mark + markers_.cols);
for (; it_mark != it_mark_end; ++it_mark, it_img += 3) {
if (*it_mark == target_) {
it_img[0] = value_[0];
it_img[1] = value_[1];
it_img[2] = value_[2];
}
}
}
}
ParallelSWMM& operator=(ParallelSWMM const&)
{
return *this;
};
private:
cv::Mat3b& img_;
cv::Vec3b value_;
cv::Mat1i const& markers_;
int32_t target_;
};
void set_where_markers_match(cv::Mat3b img
, cv::Vec3b value
, cv::Mat1i markers
, int32_t target)
{
ParallelSWMM impl(img, value, markers, target);
cv::parallel_for_(cv::Range(0, img.rows), impl);
}
This one runs at 0.5 ms.
Let's take a step back -- the in my case the original approach runs single threaded. What if we parallelized that? We can just replace the operator() in the above code with the following:
virtual void operator()(cv::Range const& range) const
{
img_.rowRange(range).setTo(value_, markers_.rowRange(range) == target_);
}
That runs at around 0.9 ms.
That seems about it for the reasonable implementations. We could have a shot at vectorizing this, but this is far from trivial (pixels are 3 bytes, we have to deal with alignment, etc.) -- let's not go into that, although it could be a nice excercise for the curious reader. However, since we're around 10 clock cycles per pixel even for the worst approach, there's not much potential for improvement.
Make your pick. In general I'd go with the first approach, and worry about it only once measurements identify this particular operation as a bottleneck.

Cuda efficient insertion of data into unsorted populated array

I have two arrays within Cuda;
int *main; // unsorted
int *source; // sorted
Part of my algorithm requires that I regulary insert new data into the main array from the source array. If a position within the main array is zero, it assumes it is empty, therefore it can be populated with a value from the source array.
I'm just wondering what the most efficient method of doing this is, I've tried a couple of approaches but still think there are some more performance gains to be made here.
Currently I'm using a modified version of a radix sort, to "shuffle" the contents of the main array to the very end of the main array, leaving all zero values at the beginning of the array, making the insertion from source trivial. The sort has been modified to iterate over a single bit, rather than 32 bits, this works with a simple switch on the input;
input[i] = source[i] > 1 ? 1 : 0
I'm wondering if this is already quite an efficient way of doing this? I'm wondering if I wouldn't gain something by using a tactically deployed atomicAdd such as;
__global__ void find(int *destination, int *indices, const int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if((destination[idx] == 0)&&(count<elements_to_add))
{
indices[count] = idx;
atomicAdd(&count, 1);
}
}
__global__ void insert(int *destination, int *indices, int *source, const int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if((source[idx] > 0)&&(indices[idx] > 0))
{
destination[indices[idx]] = source[idx];
}
}
find<<<G,T>>>(...);
insert<<<G,T>>>(...);
I'm not inserting that many items via the source array at the moment, but that could changing in the future.
This feels like it should be a common problem that has been solved before, I'm wondering if the thrust library may help, but having a browse for appropriate functions it doesn't quite feel right for what I'm trying to accomplish (not very neatly fitting with the code I already have)
Thoughts from experienced Cuda developers appreciated!
You can decouple your finding algorithm, which is categorized as a stream compaction procedure, and your insertion , which is categorized as scatter procedure. However, you can merge the functionality of the two.
Assuming srcPtr is a pointer that its content resides inside the global memory and is already set to zero before the kernel launch.
__global__ void find_and_insert( int* destination, int const* source, int const N, int* srcPtr ) { // Assuming N is the length of the destination buffer and also the length of the source buffer is less than N.
int const idx = blockIdx.x * blockDim.x + threadIdx.x;
// Get the assigned element.
int const dstElem = destination[ idx ];
bool const pred = ( dstElem == 0 );
// Intra-warp binary reduction to count the total number of lanes with empty elements.
int const predBallot = __ballot( pred );
int const intraWarpRed = __popc( predBallot );
// Warp-aggregated atomics to reduce the contention over the srcPtr content.
unsigned int laneID; asm( "mov.u32 %0, %laneid;" : "=r"(laneID) ); //const uint laneID = tidWithinCTA & ( WARP_SIZE - 1 );
int posW;
if( laneID == 0 )
posW = atomicAdd( srcPtr, intraWarpRed );
posW = __shfl( posW, 0 );
// Threads that have found empty elements can fill out their assigned positions from the src. Intra-warp binary prefix sum is used here.
uint laneMask; asm( "mov.u32 %0, %lanemask_lt;" : "=r"(laneMask) ); //const uint laneMask = 0xFFFFFFFF >> ( WARP_SIZE - laneID ) ;
int const positionToRead = posW + __popc( predBallot & laneMask );
if( pred )
destination[ idx ] = source[ positionToRead ];
}
A few things:
This kernel is just a suggestion on how you can do it. Here threads inside the warps collaborate on the task. You can extend the binary reduction and prefix sum over the thread-block.
I wrote this kernel inside the browser and haven't tested it. So be careful.
The whole design is not something new. Similar approaches have been implemented (for example this paper) and is mostly based on the work done by Mark Harris and Michael Garland.

conversion between c++ class and OpenCV matrix operation

I am trying to convert the following c++ line into OpenCV matrix operation (which is also c++):
double myCode::calculate ( int i, int au )
{
double k;
for ( int j = 0; i < N; i ++ );
{
k += fabs(data[i][j] - means[au][j]);
}
}
I want to define "data" and "means" as openCV matrix type, like:
cv::Mat data ( NUMBER_OF_OBSERVATIONS, N, CV_8UC3 );
cv::Mat means = cv::Mat.zeros ( 5, N, CV_8UC3 );
then repeat the above class for this cvMat type "data" and "means". How can I do that? Especially I don't know how to do the line:
k += fabs(data[i][j] - means[au][j]);
Thanks a lot.
You can simply write
double myCode::calculate ( int i, int au )
{
cv::Scalar res = sum(avg(data(RowRange(i)) - means(RowRange(au))));
return res[0] + res[1] + res[2]; // sum all the channels together
}
Note that RowRange() is not actually the correct syntax - look in OpenCV docs for the proper usage of Range(), but that's the idea.
A simple way to access pixels in OpenCV Mat objects is with the at() operator.
If your data type were 1-channel unsigned char (CV_8UC1), you could simply do this:
k += fabs(data.at<uchar>(i,j) - means.at<uchar>(i,j)); //works for CV_8UC1 type
However, you have 3 channels (R, G, B), dictated by the C3 in your CV_8UC3 datatype. So, here's how do your k += fabs(...) on each channel individually:
//for CV_8UC3 type
k += fabs(data.at<cv::Vec3b>(i,j)[0] - means.at<cv::Vec3b>(i,j)[0]); // Blue Channel
k += fabs(data.at<cv::Vec3b>(i,j)[1] - means.at<cv::Vec3b>(i,j)[1]); // Green Channel
k += fabs(data.at<cv::Vec3b>(i,j)[2] - means.at<cv::Vec3b>(i,j)[2]); // Red Channel
This post offers further explanation about pixel access.

Fastest Conversion of Row-Ordered data to Column-Ordered data

I have an IplImage from openCV, which stores its data in a row-ordered format;
image data is stored in a one dimensional array char *data; the element at position x,y is given by
elem(x,y) = data[y*width + x] // see note at end
I would like to convert this image as quickly as possible to and from a second image format that stores its data in column-ordered format; that is
elem(x,y) = data[x*height + y]
Obviously, one way to do this conversion is simply element-by-element through a double for loop.
Is there a faster way?
note for openCV afficionados, the actual location of elem(x,y) is given by data + y*widthstep + x*sizeof(element) but this gives the general idea, and for char data sizeof(element) = 1 and we can make widthstep = width, so the formula is exact
It is called "matrix transposition"
Optimal methods try to minimise the number of cache misses, swapping small tiles
with the size of one or a few cache slots. For a multi-level cache this will get difficult.
start reading here
this one is a bit more advanced
BTW the urls deal with "in place" transposition. Creating a transposed copy will be different (it uses twice as many cache slots, duh!)
Assuming you need a new array that has the elements all moved, the fastest you can manage in algorithmic speed is O(N) on the number of elements (i.e. width * height).
For actual time taken, it is possible to spawn multiple threads where each one copies some of the elements. This is only worthwhile of course if you really do have a lot of them.
If the threads are already created and they accept the tasks in queues, or whatever, this would be most efficient if you are going to process lots of these images.
within your individual "loops" you can avoid doing the same multiplication multiple times, of course, and pointer arithmetic is likely to be a bit faster than random-access.
You've kind of answered yourself but without a code. I think you need sth like:
typedef struct
{
unsigned char r;
unsigned char g;
unsigned char b;
}somePixelFormat;
#define HEIGHT 2
#define WIDTH 4
// let's say this is original image width=4 height=2 expresed as one dimentional
// array of structs that adhere to your pixel format
somePixelFormat src[ WIDTH * HEIGHT ] =
{
{0,0,0}, {1,1,1}, {2,2,2}, {3,3,3},
{4,4,4}, {5,5,5}, {6,6,6}, {7,7,7}
};
somePixelFormat dst[ WIDTH * HEIGHT ];
void printImage( void *img, int width, int height, int pixelByteCount )
{
for ( int row = 0; row < height; row++ )
{
for ( int col = 0; col < width; col++ )
{
printf( "(%02d,%02d,%02d) ", ((somePixelFormat*)img + width * row + col)->r,
((somePixelFormat*)img + width * row + col)->g,
((somePixelFormat*)img + width * row + col)->b );
}
printf ( "\n" );
}
printf("\n\n");
}
void flip( void *dstImg, void *srcImg, int srcWidth, int srcHeight, int pixelByteCount )
{
for ( int row = 0; row < srcHeight; row++ )
{
for ( int col = 0; col < srcWidth; col++ )
{
*((somePixelFormat*)dstImg + srcHeight * col + row) = *((somePixelFormat*)srcImg + srcWidth * row + col);
}
}
}
int main()
{
printImage( src, 4, 2, sizeof(somePixelFormat) );
flip( dst, src, 4, 2, sizeof(somePixelFormat) );
printImage( dst, 2, 4, sizeof(somePixelFormat) );
getchar();
return 0;
}
And here's example output:
(00,00,00) (01,01,01) (02,02,02) (03,03,03)
(04,04,04) (05,05,05) (06,06,06) (07,07,07)
(00,00,00) (04,04,04)
(01,01,01) (05,05,05)
(02,02,02) (06,06,06)
(03,03,03) (07,07,07)