I'm trying to make very simple (LUT-like) operations on a 16-bit gray-scale OpenCV Mat, which is efficient and doesn't slow down the debugger.
While there is a very detailed page in the documentation addressing exactly this issue, it fails to point out that most of those methods are only available on 8-bit images (including the perfect, optimized LUT function).
I tried the following methods:
uchar* p = mat_depth.data;
for (unsigned int i = 0; i < depth_width * depth_height * sizeof(unsigned short); ++i)
{
*p = ...;
*p++;
}
Really fast, unfortunately only supporting uchart (just like LUT).
int i = 0;
for (int row = 0; row < depth_height; row++)
{
for (int col = 0; col < depth_width; col++)
{
i = mat_depth.at<short>(row, col);
i = ..
mat_depth.at<short>(row, col) = i;
}
}
Adapted from this answer: https://stackoverflow.com/a/27225293/518169. Didn't work for me, and it was very slow.
cv::MatIterator_<ushort> it, end;
for (it = mat_depth.begin<ushort>(), end = mat_depth.end<ushort>(); it != end; ++it)
{
*it = ...;
}
Works well, however it uses a lot of CPU and makes the debugger super slow.
This answer https://stackoverflow.com/a/27099697/518169 points out to the source code of the built-in LUT function, however it only mentions advanced optimization techniques, like IPP and OpenCL.
What I'm looking for is a very simple loop like the first code, but for ushorts.
What method do you recommend for solving this problem? I'm not looking for extreme optimization, just something on par with the performance of the single-for-loop on .data.
I implemented Michael's and Kornel's suggestion and benchmarked them both in release and debug modes.
code:
cv::Mat LUT_16(cv::Mat &mat, ushort table[])
{
int limit = mat.rows * mat.cols;
ushort* p = mat.ptr<ushort>(0);
for (int i = 0; i < limit; ++i)
{
p[i] = table[p[i]];
}
return mat;
}
cv::Mat LUT_16_reinterpret_cast(cv::Mat &mat, ushort table[])
{
int limit = mat.rows * mat.cols;
ushort* ptr = reinterpret_cast<ushort*>(mat.data);
for (int i = 0; i < limit; i++, ptr++)
{
*ptr = table[*ptr];
}
return mat;
}
cv::Mat LUT_16_if(cv::Mat &mat)
{
int limit = mat.rows * mat.cols;
ushort* ptr = reinterpret_cast<ushort*>(mat.data);
for (int i = 0; i < limit; i++, ptr++)
{
if (*ptr == 0){
*ptr = 65535;
}
else{
*ptr *= 100;
}
}
return mat;
}
ushort* tablegen_zero()
{
static ushort table[65536];
for (int i = 0; i < 65536; ++i)
{
if (i == 0)
{
table[i] = 65535;
}
else
{
table[i] = i;
}
}
return table;
}
The results are the following (release/debug):
LUT_16: 0.202 ms / 0.773 ms
LUT_16_reinterpret_cast: 0.184 ms / 0.801 ms
LUT_16_if: 0.249 ms / 0.860 ms
So the conclusion is that reinterpret_cast is the faster by 9% in release mode, while the ptr one is faster by 4% in debug mode.
It's also interesting to see that directly calling the if function instead of applying a LUT only makes it slower by 0.065 ms.
Specs: streaming 640x480x16-bit grayscale image, Visual Studio 2013, i7 4750HQ.
OpenCV implementation is based on polymorphism and runtime dispatching over templates. In OpenCV version the use of templates is limited to a fixed set of primitive data types. That is, array elements should have one of the following types:
8-bit unsigned integer (uchar)
8-bit signed integer (schar)
16-bit unsigned integer (ushort)
16-bit signed integer (short)
32-bit signed integer (int)
32-bit floating-point number (float)
64-bit floating-point number (double)
a tuple of several elements where all elements have the same type (one of the above).
In case your cv::Mat is continues you can use pointer arithmetics to go through the whole data pointer and you should only use the appropriate pointer type to your cv::Mat.
Furthermore, keep it mind that cv::Mats are not always continuous (it can be a ROI, padded, or created from pixel pointer) and iterating over them with pointers will crash.
An example loop:
cv::Mat cvmat16sc1 = cv::Mat::eye(10, 10, CV_16SC1);
if (cvmat16sc1.data)
{
if (!cvmat16sc1.isContinuous())
{
cvmat16sc1 = cvmat16sc1.clone();
}
short* ptr = reinterpret_cast<short*>(cvmat16sc1.data);
for (int i = 0; i < cvmat16sc1.cols * cvmat16sc1.rows; i++, ptr++)
{
if (*ptr == 1)
std::cout << i << ": " << *ptr << std::endl;
}
}
Best solution for your problem is already written in the tutorial that you mentioned, in the chapter named "The efficient way". All you need is to replace every instance of uchar with ushort. No other changes are needed.
Related
I am learning C++ at the moment and currently I am experimenting with pointers and structures. In the following code, I am copying vector A into a buffer of size 100 bytes. Afterwards I copy vector B into the same buffer with an offset, so that the vectors are right next to each other in the buffer. Afterward, I want to find the vectors in the buffer again and calculate the dot product between the vectors.
#include <iostream>
const short SIZE = 5;
typedef struct vector {
float vals[SIZE];
} vector;
void vector_copy (vector* v, vector* target) {
for (int i=0; i<SIZE; i++) {
target->vals[i] = v->vals[i];
}
}
float buffered_vector_product (char buffer[]) {
float scalar_product = 0;
int offset = SIZE * 4;
for (int i=0; i<SIZE; i=i+4) {
scalar_product += buffer[i] * buffer[i+offset];
}
return scalar_product;
}
int main() {
char buffer[100] = {};
vector A = {{1, 1.5, 2, 2.5, 3}};
vector B = {{0.5, -1, 1.5, -2, 2.5}};
vector_copy(&A, (vector*) buffer);
vector_copy(&B, (vector*) (buffer + sizeof(vector)));
float prod = buffered_vector_product(buffer);
std::cout << prod <<std::endl;
return 0;
}
Unfortunately this doesn't work yet. The problem lies within the function buffered_vector_product. I am unable to get the float values back from the buffer. Each float value should need 4 bytes. I don't know, how to access these 4 bytes and convert them into a float value. Can anyone help me out? Thanks a lot!
In the function buffered_vector_product, change the lines
int offset = SIZE * 4;
for (int i=0; i<SIZE; i=i+4) {
scalar_product += buffer[i] * buffer[i+offset];
}
to
for ( int i=0; i<SIZE; i++ ) {
scalar_product += ((float*)buffer)[i] * ((float*)buffer)[i+SIZE];
}
If you want to calculate the offsets manually, you can instead replace it with the following:
size_t offset = SIZE * sizeof(float);
for ( int i=0; i<SIZE; i++ ) {
scalar_product += *(float*)(buffer+i*sizeof(float)) * *(float*)(buffer+i*sizeof(float)+offset);
}
However, with both solutions, you should beware of both the alignment restrictions and the strict aliasing rule.
The problem with the alignment restrictions can be solved by changing the line
char buffer[100] = {};
to the following:
alignas(float) char buffer[100] = {};
The strict aliasing rule is a much more complex issue, because the exact rule has changed significantly between different C++ standards and is (or at least was) different from the strict aliasing rule in the C language. See the link in the comments section for further information on this issue.
I need to read a binary file which is made of many basic types such as int, double, UTF8 strings, etc. For instance, think about one file containing n pairs of (int, double) one after the other, without any alignment with n being in the order of tens of millions. I need to get very fast access to that file. I read the file using fread calls and my own buffer which is about 16 kB long.
A profiler shows that my main bottleneck happens to be copying from the memory buffer to its final destination. The most obvious way to write a a function that copy from the buffer to a double would be:
// x: a pointer to the final destination of the data
// p: a pointer to the buffer used to read the file
//
void f0(double* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (int i = 0; i < 8; ++i) {
q[i] = p[i];
}
}
It I use the following code, I get huge speedup on x86-64
void f1(double* x, const unsigned char* p) {
double* r = reinterpret_cast<const double*>(p);
*x = *r;
}
But, as I understand, the program would crash on ARM if p is not 8-byte aligned.
Here are my questions:
Is the second program guaranteed to work on both x86 and x86-64?
How would you write such a function on ARM if you need it as fast as you can?
Here is a small benchmark to test on your machine
#include <chrono>
#include <iostream>
void copy_int_0(int* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (std::size_t i = 0; i < 4; ++i) {
q[i] = p[i];
}
}
void copy_double_0(double* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (std::size_t i = 0; i < 8; ++i) {
q[i] = p[i];
}
}
void copy_int_1(int* x, const unsigned char* p) {
*x = *reinterpret_cast<const int*>(p);
}
void copy_double_1(double* x, const unsigned char* p) {
*x = *reinterpret_cast<const double*>(p);
}
int main() {
const std::size_t n = 10000000;
const std::size_t nb_times = 200;
unsigned char* p = new unsigned char[12 * n];
for (std::size_t i = 0; i < 12 * n; ++i) {
p[i] = 0;
}
int* q0 = new int[n];
for (std::size_t i = 0; i < n; ++i) {
q0[i] = 0;
}
double* q1 = new double[n];
for (std::size_t i = 0; i < n; ++i) {
q1[i] = 0.0;
}
const auto begin_0 = std::chrono::high_resolution_clock::now();
for (std::size_t k = 0; k < nb_times; ++k) {
for (std::size_t i = 0; i < n; ++i) {
copy_int_0(q0 + i, p + 12 * i);
copy_double_0(q1 + i, p + 4 + 12 * i);
}
}
const auto end_0 = std::chrono::high_resolution_clock::now();
const double time_0 =
1.0e-9 *
std::chrono::duration_cast<std::chrono::nanoseconds>(end_0 - begin_0)
.count();
std::cout << "Time 0: " << time_0 << " s" << std::endl;
const auto begin_1 = std::chrono::high_resolution_clock::now();
for (std::size_t k = 0; k < nb_times; ++k) {
for (std::size_t i = 0; i < n; ++i) {
copy_int_1(q0 + i, p + 12 * i);
copy_double_1(q1 + i, p + 4 + 12 * i);
}
}
const auto end_1 = std::chrono::high_resolution_clock::now();
const double time_1 =
1.0e-9 *
std::chrono::duration_cast<std::chrono::nanoseconds>(end_1 - begin_1)
.count();
std::cout << "Time 1: " << time_1 << " s" << std::endl;
std::cout << "Prevent optimization: " << q0[0] << " " << q1[0] << std::endl;
delete[] q1;
delete[] q0;
delete[] p;
return 0;
}
The results I get are
clang++ -std=c++11 -O3 -march=native copy.cpp -o copy
./copy
Time 0: 8.49403 s
Time 1: 4.01617 s
g++ -std=c++11 -O3 -march=native copy.cpp -o copy
./copy
Time 0: 8.65762 s
Time 1: 3.89979 s
icpc -std=c++11 -O3 -xHost copy.cpp -o copy
./copy
Time 0: 8.46155 s
Time 1: 0.0278496 s
I did not check the assembly yet but I guess that the Intel compiler is fooling my benchmark here.
Is the second program guaranteed to work on both x86 and x86-64?
No.
When you dereference a double* the compiler is free to assume that the memory location actually contains a double, which means that it must be aligned to alignof(double).
A lot of x86 instructions are safe to use for unaligned data, but not all of them. Specifically, there are SIMD instructions which require proper alignment which your compiler is free to use.
This isn't just theoretical; LZ4 used to use something very similar to what you posted (it's C, not C++, so it was a C-style cast not reinterpret_cast, but that doesn't really matter), and everything worked as expected. Then GCC 5 was released, and it auto-vectorized the code in question at -O3 using vmovdqa, which requires proper alignment. The end result is that code which worked fine in GCC ≤ 4.9 started crashing at runtime when compiled with GCC ≥ 5.
In other words, even if your program happens to work today, if you depend on unaligned access (or other undefined behavior), it can easily stop working tomorrow. Don't do it.
How would you write such a function on ARM if you need it as fast as you can?
The answer isn't really ARM-specific. After the LZ4 incident Yann Collet (the author of LZ4) did a lot of research to answer this question. There isn't one option which well generate optimal code with every compiler on every architecture.
Using memcpy() is the safest option. If the size is known at compile time the compiler will generally optimize the memcpy() call away… for larger buffers, you can take advantage of that by calling memcpy() in a loop; you'll generally get a loop of fast instructions without the additional overhead of calling memcpy().
If you're feeling more adventurous you can use a packed union to "cast" instead of reinterpret_cast. This is compiler-specific, but when supported it should be safe, and it may be faster than memcpy().
FWIW, I have some code which attempts to find the optimal way to do this depending on various factors (compiler, compiler version, architecture, etc.). It is a bit conservative about platforms I haven't tested, but it should achieve good results on the vast majority of platforms people actually use.
I'm trying to convert an 8 bit array into into a number from 0-255 by adding values depending on the position in the field.
if I use
int array[8]={
0,1,1,0,0,0,0,1
};
int *p = array;
int i;
for (i = 0; i<8; i++){
if(p[i]!=0){
a = pow(2,i);
printf("%i\n",a);
}
};
I get:
2
4
128
as results, which would be right so far.
but if I use
int array[8]={
0,1,1,0,0,0,0,1
};
int *p = array;
int i;
for (i = 0; i<8; i++){
if(p[i]!=0){
a = a + pow(2,i);
printf("%i\n",a);
}
};
I instead get:
2686758
2686762
2686890
when I expect:
134
What am I doing wrong?
You have not initialised a to 0.
The following should work
int array[8]={
0,1,1,0,0,0,0,1
};
int *p = array;
int i;
a = 0 // << Initialise a
for (i = 0; i<8; i++){
if(p[i]!=0){
a = a + pow(2,i);
printf("%i\n",a);
}
};
You always need to provide an initial value for your variables. Otherwise, you can expect them to start with ANY value.
In the second piece of code you are accumulating a=a+pow(2,i) and it is here where the first time a is used, it will contain some undetermined value.
The problem is in the statement: a = a + pow(2,1);
a has indeterminate value because it has not been initialized and you are using it in arithmetic operation.
pow is intended to compute the exponential function of two real (not integer) values. So you could use it to compute π3/2. It is really not ideal for computing integer powers of 2. Much simpler and faster (though possibly less readable until you get used to it) is to write 2i as (1UL << i). However, in this particular case you don't need either of those. You could just do the following:
int a = 0;
for (int index = 0, value = 1; index < 8; ++index, value *= 2)
if (p[i]) a += value;
or even more directly
int a = 0;
for (int value = 1, *p = array; value < 256; value *= 2, ++p)
if (*p) a += value;
(As has been mentioned, the problem in your original was not actually the use of pow but rather the absence of the initialization int a = 0.)
alright ! i fixed it by adding =0; to the int a;intitialization
Tanks !
I am trying to make a fast image threshold function. Currently what I do is:
void threshold(const cv::Mat &input, cv::Mat &output, uchar threshold) {
int rows = input.rows;
int cols = input.cols;
// cv::Mat for result
output.create(rows, cols, CV_8U);
if(input.isContinuous()) { //we have to make sure that we are dealing with a continues memory chunk
const uchar* p;
for (int r = 0; r < rows; ++r) {
p = input.ptr<uchar>(r);
for (int c = 0; c < cols; ++c) {
if(p[c] >= threshold)
//how to access output faster??
output.at<uchar>(r,c) = 255;
else
output.at<uchar>(r,c) = 0;
}
}
}
}
I know that the at() function is quite slow. How can I set the output faster, or in other words how to relate the pointer which I get from the input to the output?
You are thinking of at as the C++ standard library documents it for a few containers, performing a range check and throwing if out of bounds, however this is not the standard library but OpenCV.
According to the cv::Mat::at documentation:
The template methods return a reference to the specified array element. For the sake of higher performance, the index range checks are only performed in the Debug configuration.
So there's no range check as you may be thinking.
Comparing both cv::Mat::at and cv::Mat::ptr in the source code we can see they are almost identical.
So cv::Mat::ptr<>(row) is as expensive as
return (_Tp*)(data + step.p[0] * y);
While cv::Mat::at<>(row, column) is as expensive as:
return ((_Tp*)(data + step.p[0] * i0))[i1];
You might want to take cv::Mat::ptr directly instead of calling cv::Mat::at every column to avoid further repetition of the data + step.p[0] * i0 operation, doing [i1] by yourself.
So you would do:
/* output.create and stuff */
const uchar* p, o;
for (int r = 0; r < rows; ++r) {
p = input.ptr<uchar>(r);
o = output.ptr<uchar>(r); // <-----
for (int c = 0; c < cols; ++c) {
if(p[c] >= threshold)
o[c] = 255;
else
o[c] = 0;
}
}
As a side note you don't and shouldn't check for cv::Mat::isContinuous here, the gaps are from one row to another, you are taking pointers to a single row, so you don't need to deal with the matrix gaps.
How would I be able to cycle through an image using opencv as if it were a 2d array to get the rgb values of each pixel? Also, would a mat be preferable over an iplimage for this operation?
cv::Mat is preferred over IplImage because it simplifies your code
cv::Mat img = cv::imread("lenna.png");
for(int i=0; i<img.rows; i++)
for(int j=0; j<img.cols; j++)
// You can now access the pixel value with cv::Vec3b
std::cout << img.at<cv::Vec3b>(i,j)[0] << " " << img.at<cv::Vec3b>(i,j)[1] << " " << img.at<cv::Vec3b>(i,j)[2] << std::endl;
This assumes that you need to use the RGB values together. If you don't, you can uses cv::split to get each channel separately. See etarion's answer for the link with example.
Also, in my cases, you simply need the image in gray-scale. Then, you can load the image in grayscale and access it as an array of uchar.
cv::Mat img = cv::imread("lenna.png",0);
for(int i=0; i<img.rows; i++)
for(int j=0; j<img.cols; j++)
std::cout << img.at<uchar>(i,j) << std::endl;
UPDATE: Using split to get the 3 channels
cv::Mat img = cv::imread("lenna.png");
std::vector<cv::Mat> three_channels = cv::split(img);
// Now I can access each channel separately
for(int i=0; i<img.rows; i++)
for(int j=0; j<img.cols; j++)
std::cout << three_channels[0].at<uchar>(i,j) << " " << three_channels[1].at<uchar>(i,j) << " " << three_channels[2].at<uchar>(i,j) << std::endl;
// Similarly for the other two channels
UPDATE: Thanks to entarion for spotting the error I introduced when copying and pasting from the cv::Vec3b example.
Since OpenCV 3.0, there are official and fastest way to run function all over the pixel in cv::Mat.
void cv::Mat::forEach (const Functor& operation)
If you use this function, operation is runs on multi core automatically.
Disclosure : I'm contributor of this feature.
If you use C++, use the C++ interface of opencv and then you can access the members via http://docs.opencv.org/2.4/doc/tutorials/core/how_to_scan_images/how_to_scan_images.html#the-efficient-way or using cv::Mat::at(), for example.
This is an old question but needs to get updated since opencv is being actively developed. Recently, OpenCV has introduced parallel_for_ which complies with c++11 lambda functions. Here is the example
parallel_for_(Range(0 , img.rows * img.cols), [&](const Range& range){
for(int r = range.start; r<range.end; r++ )
{
int i = r / img.cols;
int j = r % img.cols;
img.ptr<uchar>(i)[j] = doSomethingWithPixel(img.at<uchar>(i,j));
}
});
This is mention-worthy that this method uses the CPU cores in modern computer architectures.
Since OpenCV 3.3 (see changelog) it is also possible to use C++11 style for loops:
// Example 1
Mat_<Vec3b> img = imread("lena.jpg");
for( auto& pixel: img ) {
pixel[0] = gamma_lut[pixel[0]];
pixel[1] = gamma_lut[pixel[1]];
pixel[2] = gamma_lut[pixel[2]];
}
// Example 2
Mat_<float> img2 = imread("float_image.exr", cv::IMREAD_UNCHANGED);
for(auto& p : img2) p *= 2;
The docs show a well written comparison of different ways to iterate over a Mat image here.
The fastest way is to use C style pointers. Here is the code copied from the docs:
Mat& ScanImageAndReduceC(Mat& I, const uchar* const table)
{
// accept only char type matrices
CV_Assert(I.depth() != sizeof(uchar));
int channels = I.channels();
int nRows = I.rows;
int nCols = I.cols * channels;
if (I.isContinuous())
{
nCols *= nRows;
nRows = 1;
}
int i,j;
uchar* p;
for( i = 0; i < nRows; ++i)
{
p = I.ptr<uchar>(i);
for ( j = 0; j < nCols; ++j)
{
p[j] = table[p[j]];
}
}
return I;
}
Accessing the elements with the at is quite slow.
Note that if your operation can be performed using a lookup table, the built in function LUT is by far the fastest (also described in the docs).
If you want to modify RGB pixels one by one, the example below will help!
void LoopPixels(cv::Mat &img) {
// Accept only char type matrices
CV_Assert(img.depth() == CV_8U);
// Get the channel count (3 = rgb, 4 = rgba, etc.)
const int channels = img.channels();
switch (channels) {
case 1:
{
// Single colour
cv::MatIterator_<uchar> it, end;
for (it = img.begin<uchar>(), end = img.end<uchar>(); it != end; ++it)
*it = 255;
break;
}
case 3:
{
// RGB Color
cv::MatIterator_<cv::Vec3b> it, end;
for (it = img.begin<cv::Vec3b>(), end = img.end<cv::Vec3b>(); it != end; ++it) {
uchar &r = (*it)[2];
uchar &g = (*it)[1];
uchar &b = (*it)[0];
// Modify r, g, b values
// E.g. r = 255; g = 0; b = 0;
}
break;
}
}
}