Advantages on flattening to 1D - c++

I have questions about the flattening operation I see on forums. People often recommend flattening a multi-dimensional vector, or array to a single dimension one.
For example:
int height = 10;
int width = 10;
std::vector<int> grid;
for(int i = 0; i < height; i++){
for(int j = 0; j < width; j++){
grid.push_back(rand() % i + j);
}
}
std::vector<std::vector<int>> another_grid;
for(int i = 0; i < height; i++){
std::vector<int> row;
for(int j = 0; j < width; j++){
row.push_back(rand() % i + j);
}
another_grid.push_back(row);
}
I can guess that it's less memory consuming to have a single vector instead of many ones, but what about a multidimensional array of int ? Is there real advantages to flatten multi dimensional data structures ?

I can think of multiple reasons to do this, in no particular order and there might be more that I missed:
Slightly less memory use: each vector takes 24 bytes*, if you have 1000 rows, it's 24K more memory. Not that important, but it's there.
Fewer allocations: Again, not very important, but allocations can be slow, and if this is happening for instance in real time and you're allocating buffers for images coming from a camera, having 1 allocation is better than potentially thousands.
Locality: This is the most important one, with a single allocation, all the data is going to be very close to each other, so accessing nearby data will be much faster either because it's already in the cache, or the prefetching hardware can accurately pull the next cache line.
Easier serialization/deserialization: For instance, if this is a texture data, it can be passed to a GPU with a single copy. Same applies for writing to a disk or network, though you may want some compression with those.
The downside is it's less comfortable to write and use, but with a proper class abstracting this away, it's pretty much a must-have if performance matters. It may also be less efficient for certain operations. For instance, with the vector<vector<>> version, you can swap entire rows with a single pointer swap, and the single vector version needs to copy a bunch of data around.
*: This depends on your implementation, but on 64-bit platforms, this is common.

Related

CUDA access matrix stored in RAM and possibility of being implemented

Recently I started working with numerical computation and solving mathematical problems numerically, programing in C++ with OpenMP. But now my problem is to big and take days to solve even parallelized. So, I’m thinking in start learning CUDA to reduce the time, but I have some doubts.
The heart of my code is the following function. The entries are two pointes to vectors. N_mesh_points_x,y,z are integers pre-defined, weights_x,y,z are column matrices, kern_1 is an exponential function and table_kernel is a function who access a 50 Gb matrix stored in RAM and pre calculated.
void Kernel::paralel_iterate(std::vector<double>* K1, std::vector<double>* K2 )
{
double r, sum_1 = 0 , sum_2 = 0;
double phir;
for (int l = 0; l < N_mesh_points_x; l++){
for (int m = 0; m < N_mesh_points_y; m++){
for (int p = 0; p < N_mesh_points_z; p++){
sum_1 = 0;
sum_2 = 0;
#pragma omp parallel for schedule(dynamic) private(phir) reduction(+: sum_1,sum_2)
for (int i = 0; i < N_mesh_points_x; i++){
for (int j = 0; j < N_mesh_points_y; j++){
for (int k = 0; k < N_mesh_points_z; k++){
if (!(i==l) || !(j==m) || !(k==p)){
phir = weights_x[i]*weights_y[j]*weights_z[k]*kern_1(i,j,k,l,m,p);
sum_1 += phir * (*K1)[position(i,j,k)];
sum_2 += phir;
}
}
}
}
(*K2)[ position(l,m,p)] = sum_1 + (table_kernel[position(l,m,p)] - sum_2) * (*K1)[position (l,m,p)];
}
}
}
return;
}
My questions are:
Can I program, at least the central part of this function, in CUDA? I only parallelized with OpenMP the internals loops because was giving the wrong answer when I parallelized all the loops.
The function table_kernel who access a big matrix, the matrix is too big to be stored in the memory of my video card, so the file will stay in RAM. Is this a problem? Can CUDA easily access the files in RAM? Or this can’t be done and all the files needed to be stored inside video card?
Can I program, at least the central part of this function, in CUDA? I only parallelized with OpenMP the internals loops because was giving the wrong answer when I parallelized all the loops.
Yes, you should be able to program the portion that you currently have in the OpenMP scope, as a CUDA kernel.
The function table_kernel who access a big matrix, the matrix is to big to be stored in the memory of my video card, so the file will stay in RAM. This is a problem? The CUDA can access easily the files in RAM? Or this can’t be done and all the files needed to be stored inside video card?
Since you only access this outside the OpenMP scope, if you only use a CUDA kernel for the work that you are currently doing with OpenMP, it should not be necessary to access table_kernel from the GPU, and therefore this should not be an issue. If you attempt to add additional loops to be parallelized on the GPU, then this may become an issue. Since the access would be relatively infrequent (compared to the processing going on in the inner loops), if you wanted to pursue this, you could try making the table_kernel data available to the GPU via cudaHostAlloc - basically mapping host memory in the GPU address space. This normally is a significant performance hazard, but if you make infrequent accesses to it as mentioned, it may or may not be a serious performance issue.
Note that you won't be able to use or access std::vector in device code, so those types of data containers would probably have to be realized as ordinary double arrays.

How Can I make it faster in c++11 with std::vector?

I have cv::Mat Mat_A and cv::Mat Mat_B both are (800000 X 512) floats
and below code is looks slow .
int rows = Mat_B.rows;
cv::Mat Mat_A = cv::repeat(img, rows, 1, Mat_A);
Mat_A = Mat_A - Mat_B
cv::pow(Mat_A,2,Mat_A)
cv::reduce(Mat_A, Mat_A, 1, CV_REDUCE_SUM);
cv::minMaxLoc(Mat_A, &dis, 0, &point, 0);
How Can I do this in std::vector ?
I think it should be faster.
In my 2.4 Ghz mabook pro it took 4 sec ? very slow.
I don't think you should use std::vector to do these operations. Image processing (CV aka Computer Vision) algorithms tend to be quite computationally heavy because there is so much data to deal with. OpenCV 2.0 C++ is highly optimized for this kind of operations, e.g. cv::Mat has a header and whenever a cv::Mat is copied with copy assignment or constructor, only the headers are copied with a pointer to the data. They use reference counting to keep track of instances. So memory management is done for you, and that's a good thing.
https://docs.opencv.org/2.4/doc/tutorials/core/mat_the_basic_image_container/mat_the_basic_image_container.html
You could try to compile without debug symbols, i.e. release vs debug. And you can also try to compile with optimization flags, e.g. for gcc -O3 which should reduce the size of your binary and speed up runtime operations. Maybe it might make a difference.
https://www.rapidtables.com/code/linux/gcc/gcc-o.html
Another thing you could try is to give your process a higher priority, i.e. the higher the priority, the less it the process yields the CPU. Again, that might not make a lot of difference, it all depends of other processes and their priorities, etc.
https://superuser.com/questions/42817/is-there-any-way-to-set-the-priority-of-a-process-in-mac-os-x
I hope that helps a bit.
Well your thinking is wrong.
Why your program is slow:
Your CPU have to loop through a lot of number and do calculation. This will make computation complexity high. That's why it's slow. Your program's speed is in proportion to size of Mat A and B. You can check this point by reducing/increasing the size of Mat A and B.
Can we accelerate it by std::vector
Sorry but it's no. Using std::vector will not reduce the calculation complexity. The math arthmetic of opencv is da "best", re-writing will only lead to slower code.
How to accelerate the calculation: you need to enable the acceleration options for opencv
you can see it at : https://github.com/opencv/opencv/wiki/CPU-optimizations-build-options . Intel provide intel mkl library to accelerate the matrix calculation. You could try it first.
Personally, the easiest approach is to use the GPU. But your machine doesn't have GPU, so it's out of the scope here.
You keep iterating over the data over and over again to do independent operations on them.
Something like this iterates only once over the data.
//assumes Mat_B and img cv::Mat
using px_t = float;//you mentioned float so I'll assume both img and Mat_B use floats
int rows = Mat_B.rows;
cv::Mat output(1,rows, Mat_B.type());
auto output_ptr = output.ptr<px_t>(0);
auto img_ptr = img.ptr<px_t>(0);
int min_idx =0;
int max_idx =0;
px_t min_ele = std::numeric_limits<px_t>::max();
px_t max_ele = std::numeric_limits<px_t>::min();
for(int i = 0; i< rows; ++i)
{
output[i]=0;
auto mat_row = Mat_B.ptr<px_t>(i);
for(int j = 0; j< Mat_B.cols; ++j)
{
output[i] +=(img_ptr[j]-mat_row[j])*(img_ptr[j]-mat_row[j]);
}
if(output[i]<min_ele)
{
min_idx = i;
min_ele = output[i];
}
if(output[i]>max_ele)
{
max_idx = i;
max_ele = output[i];
}
}
While I am also not sure if it is faster you can do this, assuming Mat_B contains uchar
std::vector<uchar> array_B(Mat_B.rows* Mat_B.cols);
if(Mat_B.isContinuous())
array_B = Mat_B.data;

Is using for loops faster than saving things in a vector in C++?

Sorry for the bad title, but I can actually not think of a better one (open to suggestions).
I got a big grid (1000*1000*1000).
for (int k = 0; k<dims.nz; k++);
{
for (int i = 0; i < dims.nx; i++)
{
for (int j = 0; j < dims.ny; j++)
{
if (inputLabel->evalReg(i, j, 0) == 0)
{
sum = sum + anotherField->evalReg(i, j, 0);
}
}
}
}
I go through all grid points to find which grid points have the value 0 in my labelfield and sum up the corresponding values of another field.
After this I want to set all the points that I detected above to a certain value.
Would it be faster to do basically do the same for loop again (while this time setting values instead of reading them), or should I write all the positions that I got into a separate vector (which would have to change size in ever step of the loop in which we detect something) and simply build a for loop like
for(int p=0; p<size_vec_1,p++)
{
anotherField->set(vec_1[p],vec_2[p],vec_3[p], random value);
}
The point is that I do not know how much of the grid will be affected by my routien due to different data. Might be half of the data or something completly different. Can I do a genereal estimation of the speed of the methods or is it soley depending on the distribution of my values ?
The point is that I do not know how much of the grid will be affected by my routien due to different data.
Here's a trick, which may work: sample inputLabel randomly, to make an approximation how many entries are 0. If a few, then go the "putting indices into a vector" way. If a lot, then go the "scan the array again" way.
It needs fine tuning for a specific computer, what should be the threshold between the two cases, how many samples to take (it should not be too large, as the approximation will take too much time, but should not be too small to have a good approximation), etc.
Bonus trick: take cache-line-aligned and cache-line-sized samples. This way the approximation will take the similar amount of time (because it is memory bound), but the approximation will be better.

Clean and efficient way of iterating over ranges of indices in a single array

I have a contiguous array of particles in 3D space for a fluid simulation and I need to do a lot of neighbor searches on it. I've found that partitioning the search space into cubic cells and sorting the particles in-place by the cell they are in works well for my problem. That way for any given cell, its particles are in a contiguous span so you can iterate over them all easily if you know the begin and end indices. eg cell N might occupy [N_begin, N_end) of the array used to store the particles.
However, no matter how you divide the space, a particle might have neighbors not only in its own cell but also in every neighboring cell (imagine a particle that is almost touching the boundary between cells to understand why). Because of this, a neighbor search needs to return all particles in the same cell as well as all of its neighbors in 3D space, a total of up to 27 cells when not on the edge of the simulation space. There is no ordering of the cells into the array (which is by its nature 1D) that can get all 27 of those spans to be adjacent for any requested cell. To work around this, I keep track of where each cell begins and ends in the particle array and I have a function that determines which cells hold potential neighbors. To represent multiple ranges of indices, it has to return up to 27 pairs of indices signifying the begin and end of those ranges.
std::vector<std::pair<int, int>> get_neighbor_indices(const Vec3f &position);
The index is actually required at some point so it works better this way than a pair of iterators or some other abstraction. The problem is that it forces me to use a loop structure that is tied to the implementation. I would like something like the following, using some pseudocode and omissions to simplify.
for(int i = 0; i < num_of_particles; ++i) {
auto neighbor_indices = get_neighbor_indices(particle[i].position);
for (int j : neighbor_indices) {
// do stuff with particle[i] and particle[j]
}
}
That would only work if neighbor_indices was a complete list of all indices, but that is a significant number of indices that are trivial to calculate on the fly and therefore a huge waste of memory. So the best I can get without compromising the performance of the code is the following.
for(int i = 0; i < num_of_particles; ++i) {
auto neighbor_indices = get_neighbor_indices(particle[i].position);
for (const auto& indices_pair : neighbor_indices) {
for (int j = indices_pair.first; j < indices_pair.second; ++j) {
// do stuff with particle[i] and particle[j]
}
}
}
The loss of genericity is a setback for my project because I have to test and measure performance a lot and make adjustments when I come across a performance problem. Implementation-specific code significantly slows down this process.
I'm looking for something like an iterator, except it will return an index instead of a reference. It will allow me to use it as follows.
for(int i = 0; i < num_of_particles; ++i) {
auto neighbor_indices = get_neighbor_indices(particle[i].position);
for (int j : neighbor_indices) {
// do stuff with particle[i] and particle[j]
}
}
The only issue with this iterator-like approach is that incrementing it is cumbersome. I have to manually keep track of the range I'm in, as well as continuously check when I'm at its end to switch to the next. It's a lot of code to get rid of just that one line that breaks the genericity of the iteration loop. So I'm looking for either a cleaner way to implement the "iterator" or just a better way to iterate over a number of ranges as if they were one.
Keep in mind that this is in a bottleneck computation loop so abstractions have to be zero or negligible cost.

c++ Memory reduction with arrays

I'm creating a block based game and I would like to improve on its memory usage.
I'm currently creating blocks with a sizeof() 8. And I can't reduce its size.
Block:
...
bool front, back, left, right, top, bottom;//If the face is active, true
BYTE blockType;
BYTE data;
I came up with a solution but I have no idea how to properly implement it because I'm quite new to c++. The solution:
All air blocks are exactly equal but each take up 8 bytes of memory. If I could set all air blocks, pointing to the same piece of memory, this should (I guess) be using less memory. (unless the pointer address takes up 8 bytes ?)
Currently my array looks like this:
Chunk:
Block*** m_pBlocks;
Chunk::Chunk()
{
m_pBlocks = new Block**[CHUNK_SIZE];
for(int x = 0; x < CHUNK_SIZE; x++){
m_pBlocks[x] = new Block*[CHUNK_HEIGHT];
for(int y = 0; y < CHUNK_HEIGHT; y++){
m_pBlocks[x][y] = new Block[CHUNK_SIZE];
}
}
}
I know you can't make these point to null or point to something else so how should I do this?
Using bitfields to reduce the size of Block.
class Block {
// bit fields, reduce 6 bytes to 1
unsigned char front:1, back:1, left:1, right:1, top:1, bottom:1;//If the face is active, true
BYTE blockType;
BYTE data;
// optional alignment to size 4.
// BYTE pad;
};
Block m_pBlocks[32][64][32]; // 32*64*32=64K * sizeof(Block)=256K that is a lot.
And yes, using a pointer that is 8 bytes is not really a save.
But there are several methods that helps save more, if you have a hightmap everything above the hight map is air! so you only need a 2d array to check that, where the elements are the hight. All air voxels below the hightmap must be defined along with the other none-air elements.
Other data structures are often more space effective like Octree or Sparse voxel octree.
if you don't need to modify those blocks, you could create a lookup map.
please, do not use new, and avoid pointers as much as you can.
bool lookup[CHUNK_SIZE][CHUNK_HEIGHT];
Chunk::Chunk()
{
for(int x = 0; x < CHUNK_SIZE; x++)
for(int y = 0; y < CHUNK_HEIGHT; y++)
lookup[x][y] = true;
}
now you can just query the lookup table to see if you have that particular block set.
Moreover you now have all those values close together, which is beneficial for performance.