I am using OpenCV 3.2 on C++ and I cannot find a way to do the following task:
Suppose I have 10 pointers to double arrays, say row_0,...,row_9, where each array contains 20 elements. I want to create a cv::Mat object having 10 rows and 20 columns such that its 0th row starts at address row_0, 1st row starts at address row_1 and so on. In other words I already have each row stored contiguously in the memory (however the entire block of 10 rows may not be contiguous) and I want to 'gather' them into a Mat object. How can I accomplish this?
Of course I can declare a 10*20 array, copy the rows successively into it and then call the cv::Mat(int rows,int cols,int type,void *data) constructor, but this requires unnecessary copying of the data. The matrix I need is actually much bigger than 10x20. Moreover I am needing to do this many times in my application, so copying would make my program slow.
Related
My goal is to create an array of vectors (with a capacity of 10 integers for each vector) on the heap. I can create my array easily enough:
vector<int>* test = new vector<int>[NUM_VERTS];
However, this creates an array of empty vectors. I know each vector will store at least 10 ints and so I want to create the vectors with size 10 to start with to avoid having them re-size themselves multiple times (I'm reading in a big file and so efficiency is important). How can I modify the statement above so that it does what I want?
As a side note, I'm using VS 2013 and struggling with its debugger. When I run the debugger and look at the contents of test above, it shows me the memory address of the area it points to but not the contents stored at the address. Does anyone know how I can view the contents instead of the address?
Thanks
PS I'm creating the array of vectors on the heap instead of the stack because the array is extremely large (just under a million entries). When I tried to create it on the stack I got a stack overflow error.
You can create a vector of vectors that is not the same as an array but for your use case it should be equivalent:
std::vector< std::vector<int> > test(NUM_VERTS, std::vector<int>(10));
there's no need to allocate it with new because vector elements are already on the heap.
If you need the pointer to the first contained vector you can just use
std::vector<int> *p = &test[0];
and use p as if it was an heap-allocated array of vectors (all the elements of a vector are guaranteed to be consecutive in memory).
I have a function that saves data on a 2D array. The problem is, the array is to be sent on the internet 50 times a second, so I want the size of the array to be as little as possible, with no waste. So, let's say that my program manages to update 3 times in this 50th of a second, i want the second dimension of the array to be exatly 3. If it updates 10 times, it'll be 10, if just 1, it'll be 1. I am trying to achieve this by incresing the second dimension size by 1 before every update.
char* a[254];
/* every tick */
for (int i=0; i<254; i++)
{
a[i]=new char[counter]; //counter is incremented by one every update
}
would this work, or it would just move the bounds of the array and mess the already existing data up?
EDIT: I'll try to explain a little bit better. If I have a [2][2] array, the third position is [2][1]. Let's say all the array is 0 except for [2][1] wich is 1. Now, if i increment the second dimension with new, will 1 be moved to the new [2][1] position or it would still remain on the third, wich is now [1][3]?
Yes, every "new" call will allocate new memory w/o copying the already existing data. Moreover memory leak will met
I have quite a good understanding about how to allocate and copy linear memory with cudaMalloc() and cudaMemcpy(). However, when I want to use the CUDA functions to allocate and copy 2D or 3D matrices, I am often befuddled by the various arguments, especially concerning pitched pointers which are always present when dealing with 2D/3D arrays. The documentation is good for providing a couple examples on how to use them but it assumes that I am familiar with the notion of padding and pitch, which I am not.
I usually end up tweaking the various examples I find in the documentation or somewhere else on the web, but the blind debugging that follows is quite painful, so my question is:
What is a pitch? How do I use it? How do I allocate and copy 2D and 3D arrays in CUDA?
Here is an explanation about pitched pointer and padding in CUDA.
Linear memory vs padded memory
First, lets start with the reason for the existence of non linear memory. When allocating memory with cudaMalloc, the result is like an allocation with malloc, we have a contiguous memory chunk of the size specified and we can put anything we want in it. If we want to allocate a vector of 10000 float, we simply do:
float* myVector;
cudaMalloc(&myVector, 10000*sizeof(float));
and then access ith element of myVector by classic indexing:
float element = myVector[i];
and if we want to access the next element, we just do:
float next_element = myvector[i+1];
It works very fine because accessing an element right next to the first one is (for reasons I am not aware of and I don't wish to be for now) cheap.
Things become a little bit different when we use our memory as a 2D array. Lets say our 10000 float vector is in fact a 100x100 array. We can allocate it by using the same cudaMalloc function, and if we want to read the i-th row, we do:
float* myArray;
cudaMalloc(&myArray, 10000*sizeof(float));
int row[100]; // number of columns
for (int j=0; j<100; ++j)
row[j] = myArray[i*100+j];
Word alignment
So we have to read memory from myArray+100*i to myArray+101*i-1. The number of memory access operation it will take depends on the number of memory words this row takes. The number of bytes in a memory word depends on the implementation. To minimize the number of memory accesses when reading a single row, we must assure that we start the row on the start of a word, hence we must pad the memory for every row until the start of a new one.
Bank conflicts
Another reason for padding arrays is the bank mechanism in CUDA, concerning shared memory access. When the array is in the shared memory, it is split into several memory banks. Two CUDA threads can access it simultaneously, provided they don't access memory belonging to the same memory bank. Since we usually want to treat each row in parallel, we can ensure that we can access it simulateously by padding each row to the start of a new bank.
Now, instead of allocating the 2D array with cudaMalloc, we will use cudaMallocPitch:
size_t pitch;
float* myArray;
cudaMallocPitch(&myArray, &pitch, 100*sizeof(float), 100); // width in bytes by height
Note that the pitch here is the return value of the function: cudaMallocPitch checks what it should be on your system and returns the appropriate value. What cudaMallocPitch does is the following:
Allocate the first row.
Check if the number of bytes allocated makes it correctly aligned. For example that it is a multiple of 128.
If not, allocate further bytes to reach the next multiple of 128. the pitch is then the number of bytes allocated for a single row, including the extra bytes (padding bytes).
Reiterate for each row.
At the end, we have typically allocated more memory than necessary because each row is now the size of pitch, and not the size of w*sizeof(float).
But now, when we want to access an element in a column, we must do:
float* row_start = (float*)((char*)myArray + row * pitch);
float column_element = row_start[column];
The offset in bytes between two successive columns can no more be deduced from the size of our array, that is why we want to keep the pitch returned by cudaMallocPitch. And since pitch is a multiple of the padding size (typically, the biggest of word size and bank size), it works great. Yay.
Copying data to/from pitched memory
Now that we know how to create and access a single element in an array created by cudaMallocPitch, we might want to copy whole part of it to and from other memory, linear or not.
Lets say we want to copy our array in a 100x100 array allocated on our host with malloc:
float* host_memory = (float*)malloc(100*100*sizeof(float));
If we use cudaMemcpy, we will copy all the memory allocated with cudaMallocPitch, including the padded bytes between each rows. What we must do to avoid padding memory is copying each row one by one. We can do it manually:
for (size_t i=0; i<100; ++i) {
cudaMemcpy(host_memory[i*100], myArray[pitch*i],
100*sizeof(float), cudaMemcpyDeviceToHost);
}
Or we can tell the CUDA API that we want only the useful memory from the memory we allocated with padding bytes for its convenience so if it could deal with its own mess automatically it would be very nice indeed, thank you. And here enters cudaMemcpy2D:
cudaMemcpy2D(host_memory, 100*sizeof(float)/*no pitch on host*/,
myArray, pitch/*CUDA pitch*/,
100*sizeof(float)/*width in bytes*/, 100/*heigth*/,
cudaMemcpyDeviceToHost);
Now the copy will be done automatically. It will copy the number of bytes specified in width (here: 100xsizeof(float)), height time (here: 100), skipping pitch bytes every time it jumps to a next row. Note that we must still provide the pitch for the destination memory because it could be padded, too. Here it is not, so the pitch is equal to the pitch of a non-padded array: it is the size of a row. Note also that the width parameter in the memcpy function is expressed in bytes, but the height parameter is expressed in number of elements. That is because of the way the copy is done, someway like I wrote the manual copy above: the width is the size of each copy along a row (elements that are contiguous in memory) and the height is the number of times this operation must be accomplished. (These inconsistencies in units, as a physicist, annoys me very much.)
Dealing with 3D arrays
3D arrays are no different that 2D arrays actually, there is no additional padding included. A 3D array is just a 2D classical array of padded rows. That is why when allocating a 3D array, you only get one pitch that is the difference in bytes count between to successive points along a row. If you want to access to successive points along the depth dimension, you can safely multiply the pitch by the number of columns, which gives you the slicePitch.
The CUDA API for accessing 3D memory is slightly different than the one for 2D memory, but the idea is the same :
When using cudaMalloc3D, you receive a pitch value that you must carefully keep for subsequent access to the memory.
When copying a 3D memory chunk, you cannot use cudaMemcpy unless you are copying a single row. You must use any other kind of copy utility provided by the CUDA utility that takes the pitch into account.
When you copy your data to/from linear memory, you must provide a pitch to your pointer even though it is irrelevant: this pitch is the size of a row, expressed in bytes.
The size parameters are expressed in bytes for the row size, and in number of elements for the column and depth dimension.
I have the following situation solved with a vector, but one of my older colleagues told me in a discussion that it would be much faster with an array.
I calculate lots (and I mean lots!) of 12-dimensional vectors from lots of audio files and have to store them for processing. I really need all those vectors before I can start my calculation. Anyhow, I can not predict how many audios, and I can not predict how many vectors are extracted from each audio. Therefor I need a structure to hold the vectors dynamically.
Therefor I create a new double array for each vector and push it to a vector.
I now want to face and test, if my colleague is really right that the calculation can be boosted with using also an array instead of a vector for storing.
vector<double*>* Features = new vector<double*>();
double* feature = new double[12];
// adding elements
Features->push_back(features);
As far as i know to create dynamically 2d array I need to know the count of rows.
double* container = new double*[rows];
container[0] = new double[12];
// and so on..
I know rows after processing all audios, and I don't want to process the audio double times.
Anyone got any idea on how to solve this and append it, or is it just not possible in that way and I should use either vector or create own structure (which assumed may be slower than vector).
Unless have any strong reasons not to, I would suggest something like this:
std::vector<std::array<double, 12>> Features;
You get all the memory locality you could want, and all of the the automagic memory management you need.
You can certainly do this, but it would be much better if you perform this with std::vector. For dynamic growth of a 2D array, you would have to perform all these things.
Create a temporary 2D Array
Allocate memory to it.
Allocate memory to its each component array.
Copy data into its component arrays.
Delete each component array of the original 2D Array.
Delete the 2D Array.
Take new Input.
Add new item to the temporary 2D array.
Create the original 2D Array and allocate memory to it.
Allocate memory to its component arrays.
Copy temporary data into it again.
After doing this in each step, it is hardly acceptable that arrays would be any faster. Use std:vector. The above written answers explain that.
Using vector will make the problem easier because it makes growing the data automatic. Unfortunately due to how vectors grow, using vectors may not be the best solution because of the number of times required to grow for a large data set. On the other hand if you set the initial size of the vector quite large but only need a small number of 12 index arrays. You just wasted a large amount of memory. If there is someway to produce a guess of the size required you might use that guess value to dynamically allocate arrays or set the vector to that size initially.
If you are only going to calculate with the data once or twice, than maybe you should consider using map or list. These two structures for large arrays will create a memory structure that matches your exact needs, and bypass the extra time requirements for growing the arrays. On the other hand the calculations with these data structures will be slower.
I hope these thoughts add some alternative solutions to this discussion.
I'm profiling the code that I have developed, and I see a bottleneck in my code when I use cvSet2D.
Is there some alternative to cvSet2D more efficient?
How can I write that pixels?
I recommend you to use the C++ functions and not the old C style functions.
The most efficient way to write to pixels is the following.
cv::Mat frame;
// load your img to your Mat
uchar* p = frame.ptr(row); // returns a row pointer
p[column] = x; // accesses value in the given column
One thing to note is that you might have more columns than you have pixel columns, e.g. on a 3 channel image you have 3 times the number of actual pixel columns you can access.
For more information on different ways to iterate over pixels, check this tutorial.
You need to get a pointer to the data field of the structure.
(C API) The IplImage structure has a char* field called data. Access (your_type*)image->data for the first element, and then use it like a regular C 1D array, but be careful to use the field stepWidth to jump from a line to the next (because lines of data may be aligned on multiples of 16 bits for memory access optimization).
(C++ API) Use T* cv::Mat::ptr<T>(int i) to get a pointer to the first element of the line you want to access. Then use as a regular C++ 1D array.
This should be the faster access pattern (see the book OpenCV2 Cookbook for a comparison of the different access patterns).