C++: Time for filling an array is too long - c++

We are writing a method (myFunc) that writes some data to the array. The array must be a field of the class (MyClass).
Example:
class MyClass {
public:
MyClass(int dimension);
~MyClass();
void myFunc();
protected:
float* _nodes;
};
MyClass::MyClass(int dimension){
_nodes = new float[dimension];
}
void MyClass::myFunc(){
for (int i = 0; i < _dimension; ++i)
_nodes[i] = (i % 2 == 0) ? 0 : 1;
}
The method myFunc is called near 10000 times and it takes near 9-10 seconds (with other methods).
But if we define myFunc as:
void MyClass::myFunc(){
float* test = new float[_dimension];
for (int i = 0; i < _dimension; ++i)
test[i] = (i % 2 == 0) ? 0 : 1;
}
our programm works much faster - it takes near 2-3 seconds (if it's calles near 10000 times).
Thanks in advance!

This may help (in either case)
for (int i = 0; i < _dimension; )
{
test[i++] = 0.0f;
test[i++] = 1.0f;
}
I'm assuming _dimension is even, but easy to fix if it is not.

If you want to speed up Debug-mode, maybe help the compiler, try
void MyClass::myFunc(){
float* const nodes = _nodes;
const int dimension = _dimension;
for (int i = 0; i < dimension; ++i)
nodes[i] = (i % 2 == 0) ? 0.0f : 1.0f;
}
Of course, in reality you should focus on using Release-mode for everything performance-related.

In your example code, you do not initialise _dimension in the constructor, but use it in MyFunc. So you might be filling millions of entries in the array even though you have only allocated a few thousand entries. In the example that works, you use the same dimension for creating and filling the array so you are probably initialising it correctly in that case..
Just make sure that _dimension is properly initialised.

This is faster on most machine.
void MyClass::myFunc(){
float* const nodes = _nodes;
const int dimension = _dimension;
if(dimension < 2){
if(dimension < 1)
return;
nodes[0] = 0.0f;
return;
}
nodes[0] = 0.0f;
nodes[1] = 1.0f;
for (int i = 2; ; i <<= 1){
if( (i << 1) < dimension ){
memcpy(nodes + i, nodes, i * sizeof(float));
}else{
memcpy(nodes + i, nodes, (dimension - i) * sizeof(float));
break;
}
}
}

Try this:
memset(test, 0, sizeof(float) * _dimension));
for (int i = 1; i < _dimension; i += 2)
{
test[i] = 1.0f;
}
You can also run this piece once and store the array at static location.
For each consecutive iteration you can address the stored data without any computation.

Related

Setting argument for kernel extremely slow (OpenCL)

In my OpenCL Dijkstra's algorithm implementation, the slowest part by far is writing the 1D reduced graph matrix to the kernel argument, which is global memory.
My graph is a two dimensional array; for OpenCL it gets reduced to a 1D array like so:
for (int q = 0; q < numberOfVertices; q++)
{
for (int t = 0; t < numberOfVertices; t++)
{
reducedGraph[q * numberOfVertices + t] = graph[q][t];
}
}
Put into a buffer:
cl::Buffer graphBuffer = cl::Buffer(context, CL_MEM_READ_WRITE, numberOfVertices * numberOfVertices * sizeof(int));
Setting the argument then takes an extremely long time. For my test with 5,760,000 vertices, writing the data to the argument takes more than 3 seconds while the algorithm itself takes less than a millisecond:
kernel_dijkstra.setArg(5, graphBuffer);
The kernel uses the graph as a global argument:
void kernel min_distance(global int* dist, global bool* verticesSet, const int sizeOfChunks, global int* result, const int huge_int, global int* graph, const int numberOfVertices)
Is there any way to speed this up? Thank you!
Edit: My Kernel's code:
// Kernel source, calculates minimum distance in segment and relaxes graph.
std::string kernel_code =
void kernel min_distance(global int* dist, global bool* verticesSet, const int sizeOfChunks, global int* result, const int huge_int, global int* graph, const int numberOfVertices) {
for (int b = 0; b < numberOfVertices; b++) {
int gid = get_global_id(0);
int min = huge_int, min_index = -1;
for (int v = gid * sizeOfChunks; v < sizeOfChunks * gid + sizeOfChunks; v++) {
if (verticesSet[v] == false && dist[v] < min && dist[v] != 0) {
min = dist[v];
min_index = v;
}
}
result[gid] = min_index;
if (gid != 0) continue;
min = huge_int;
min_index = -1;
int current_min;
for (int a = 0; a < numberOfVertices; a++) {
current_min = dist[result[a]];
if (current_min < min && current_min != -1 && current_min != 0) { min = current_min; min_index = result[a]; }
}
verticesSet[min_index] = true;
// relax graph with found global min.
int a = 0;
int min_dist = dist[min_index];
int current_dist;
int compare_dist;
for (int i = min_index * numberOfVertices; i < min_index * numberOfVertices + numberOfVertices; i++) {
current_dist = dist[a];
compare_dist = graph[min_index * numberOfVertices + a];
if (current_dist > min_dist + compare_dist && !verticesSet[a] && compare_dist != 0) {
dist[a] = min_dist + compare_dist;
}
a++;
}
}
};
How I enqueue it:
numberOfComputeUnits = default_device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>();
queue.enqueueNDRangeKernel(kernel_dijkstra, 0, cl::NDRange(numberOfVertices), numberOfComputeUnits);
The error here is that your memory allocation is way too large: 5.76M vertices need a 133TB buffer because the buffer size is quadratic in vertex number. Neither the C++ compiler nor OpenCL will report this as an error and even your kernel will appearently start and run just fine, but in reality it does not compute anything because memory is not enough, and you will get random and undefined results.
Generally .setArg(...) should not take longer than a few milliseconds. Also it is beneficial to do the initialization part (containing buffer allocation, .setArg(...) etc.) only once in the beginning and then repeatedly run the kernel or exchange data in the buffers without reallocation.

How to fill the middle of a dynamic 2D array with a smaller one?

I am working with a dynamic square 2D array that I sometimes need to enlarge for my needs. The enlarging part consist in adding a new case on each border of the array, like this:
To achieve this, I first copy the content of my actual 2D array in a temporary other 2D array of the same size. Then I create the new 2D array with the good size, and copy the original content of the array in the middle of the new one.
Is there any quick way to copy the content of the old array in the middle of my new array? The only way I have found so far is only by using two for sections:
for(int i = 1; i < arraySize-1; i++)
{
for(int j = 1; j < arraySize-1; j++)
{
array[i][j] = oldArray[i-1][j-1];
}
}
But I'm wondering if there is no quicker way to achieve this. I thought about using std::fill, but I don't see how it would be possible to use it in this particular case.
My EnlargeArray function:
template< typename T >
void MyClass<T>::EnlargeArray()
{
const int oldArraySize = tabSize;
// Create temporary array
T** oldArray = new T*[oldArraySize];
for(int i = 0; i < oldArraySize; i++)
{
oldArray[i] = new T[oldArraySize];
}
// Copy old array content in the temporary array
for(int i = 0; i < arraySize; i++)
{
for(int j = 0; j < arraySize; j++)
{
oldArray[i][j] = array[i][j];
}
}
tabSize+=2;
const int newArraySize = arraySize;
// Enlarge the array
array= new T*[newArraySize];
for(int i = 0; i < newArraySize; i++)
{
array[i] = new T[newArraySize] {0};
}
// Copy of the old array in the center of the new array
for(int i = 1; i < arraySize-1; i++)
{
for(int j = 1; j < arraySize-1; j++)
{
array[i][j] = oldArray[i-1][j-1];
}
}
for(int i = 0; i < oldArraySize; i++)
{
delete [] oldArray[i];
}
delete [] oldArray;
}
Is there any quick way to copy the content of the old array in the middle of my new array?
(Assuming the question is "can I do better than a 2D for-loop?".)
Short answer: no - if your array has R rows and C columns you will have to iterate over all of them, performing R*C operations.
std::fill and similar algorithms still have to go through every element internally.
Alternative answer: if your array is huge and you make sure to avoid
false sharing, splitting the copy operation in multiple threads that deal with a independent subset of the array could be beneficial (this depends on many factors and on the hardware - research/experimentation/profiling would be required).
First, you can use std::make_unique<T[]> to manage the lifetime of your arrays. You can make your array contiguous if you allocate a single array of size row_count * col_count and perform some simple arithmetic to convert (col, row) pairs into array indices. Then, assuming row-major order:
Use std::fill to fill the first and last rows with zeros.
Use std::copy to copy the old rows into the middle of the middle rows.
Fill the cells at the start and end of the middle rows with zero using simple assignment.
Do not enlarge the array. Keep it as it is and allocate new memory only for the borders. Then, in the public interface of your class, adapt the calculation of the offets.
To the client of the class, it will appear as if the array had been enlarged, when in fact it wasn't really touched by the supposed enlargement. The drawback is that the storage for the array contents is no longer contiguous.
Here is a toy example, using std::vector because I cannot see any reason to use new[] and delete[]:
#include <vector>
#include <iostream>
#include <cassert>
template <class T>
class MyClass
{
public:
MyClass(int width, int height) :
inner_data(width * height),
border_data(),
width(width),
height(height)
{
}
void Enlarge()
{
assert(border_data.empty()); // enlarge only once
border_data.resize((width + 2) * 2 + (height * 2));
width += 2;
height += 2;
}
int Width() const
{
return width;
}
int Height() const
{
return height;
}
T& operator()(int x, int y)
{
assert(x >= 0);
assert(y >= 0);
assert(x < width);
assert(y < height);
if (border_data.empty())
{
return inner_data[y * width + x];
}
else
{
if (y == 0)
{
return border_data[x]; // top border
}
else if (y == height - 1)
{
return border_data[width + x]; // bottom border
}
else if (x == 0)
{
return border_data[width + height + y]; // left border
}
else if (x == width - 1)
{
return border_data[width * 2 + height * 2 + y]; // right border
}
else
{
return inner_data[(y - 1) * (width - 2) + (x - 1)]; // inner matrix
}
}
}
private:
std::vector<T> inner_data;
std::vector<T> border_data;
int width;
int height;
};
int main()
{
MyClass<int> test(2, 2);
test(0, 0) = 10;
test(1, 0) = 20;
test(0, 1) = 30;
test(1, 1) = 40;
for (auto y = 0; y < test.Height(); ++y)
{
for (auto x = 0; x < test.Width(); ++x)
{
std::cout << test(x, y) << '\t';
}
std::cout << '\n';
}
std::cout << '\n';
test.Enlarge();
test(2, 0) = 50;
test(1, 1) += 1;
test(3, 3) = 60;
for (auto y = 0; y < test.Height(); ++y)
{
for (auto x = 0; x < test.Width(); ++x)
{
std::cout << test(x, y) << '\t';
}
std::cout << '\n';
}
}
Output:
10 20
30 40
0 0 50 0
0 11 20 0
0 30 40 0
0 0 0 60
The key point is that the physical representation of the enlarged "array" no longer matches the logical one.

Pointer to array of structs element field in a function

I have a function which searches for extrema in some data array. It takes pointer to data, some parameters and a pointer to struct array where the result is stored. The function returns the length of the resulting struct array.
int FindPeaks(float *data, int WindowWidth, float threshold, struct result *p)
{
int RightBorder = 0;
int LeftBorder = 0;
bool flag = 1;
int i = 0;
int k = 0;
while(1)
{
flag = 1;
if (WindowWidth >= 200) cout << "Your window is larger than the signal! << endl";
if (i >= 200) break;
if ((i + WindowWidth) < 200) RightBorder = i + WindowWidth;
if ((i - WindowWidth) >= 0) LeftBorder = i - WindowWidth;
for(int j = LeftBorder; j <= RightBorder; j ++)
{
if (*(data + i) < *(data + j))
{
flag = 0;
break;
}
}
if (flag && *(data + i) >= threshold && i != 0 && i != 199)
{
struct result pointer = p + k;
pointer.amplitude = *(data + i);
pointer.position = i;
i = i + WindowWidth;
k++;
}
else
{
i ++;
}
}
return k;
}
I'm confused with a reference to i-th struct field to put the result in there. Am I doing something wrong?
You're trying to be too smart with use of pointers, so your code won't even compile.
Instead of using *(data + i) or *(data+j) everywhere, use data[i] or data[j]. They're equivalent, and the second is often more readable when working with an array (assuming the data passed by the caller is actually (the address of the first element of) an array of float).
The problem you asked about is this code
struct result pointer = p + k;
pointer.amplitude = *(data + i);
pointer.position = i;
where p is a pointer to struct result passed to the function as argument. In this case, pointer actually needs to be a true pointer. Assuming you want it to point at p[k] (rather than creating a separate struct result) you might need to do
struct result *pointer = p + k; /* equivalently &p[k] */
pointer->amplitude = data[i];
pointer->position = i;
This will get the code to compile. Note that you have not described what the function is actually supposed to achieve, so I have not bothered to check if the code actually does anything sensible.
Note that you're actually (mis)using C techniques in C++. There are much better alternatives in modern C++, such as using standard containers.

Optimization of C++ code - std::vector operations

I have this funcition (RotateSlownessTop) and it's called about 800 times computing the corresponding values. But the calculation is slow and is there a way I can make the computations faster.
The number of element in X/Y is 7202. (Fairly large set)
I did the performance analysis and the screenshot has been attached.
void RotateSlownessTop(vector <double> &XR1, vector <double> &YR1, float theta = 0.0)
{
Matrix2d a;
a(0,0) = cos(theta);
a(0,1) = -sin(theta);
a(1, 0) = sin(theta);
a(1, 1) = cos(theta);
vector <double> XR2(7202), YR2(7202);
for (size_t i = 0; i < X.size(); ++i)
{
XR2[i] = (a(0, 0)*X[i] + a(0, 1)*Y[i]);
YR2[i] = (a(1, 0)*X[i] + a(1, 1)*Y[i]);
}
size_t i = 0;
size_t j = 0;
while (i < YR2.size())
{
if (i > 0)
if ((XR2[i]>0) && (XR2[i-1]<0))
j = i;
if (YR2[i] > (-1e-10) && YR2[i]<0.0)
YR2[i] = 0.0;
if (YR2[i] < (1e-10) && YR2[i]>0.0)
YR2[i] = -YR2[i];
if ( YR2[i]<0.0)
{
YR2.erase(YR2.begin() + i);
XR2.erase(XR2.begin() + i);
--i;
}
++i;
}
size_t k = 0;
while (j < YR2.size())
{
YR1[k] = (YR2[j]);
XR1[k] = (XR2[j]);
YR2.erase(YR2.begin() + j);
XR2.erase(XR2.begin() + j);
++k;
}
size_t l = 0;
for (; k < XR1.size(); ++k)
{
XR1[k] = XR2[l];
YR1[k] = YR2[l];
l++;
}
}
Edit1: I have updated the code by replacing all push_back() with operator[], since I read somewhere that this is much faster.
However the whole program is still slow. Any suggestions are appreciated.
If the size is large, you can improve the push_back by pre-allocating the space needed. Add this before the loop:
XR2.reserve(X.size());
YR2.reserve(X.size());

C++ time spent allocating vectors

I am trying to speed up a piece of code that is ran a total of 150,000,000 times.
I have analysed it using "Very Sleepy", which has indicated that the code is spending the most time in these 3 areas, shown in the image:
The code is as follows:
double nonLocalAtPixel(int ymax, int xmax, int y, int x , vector<nodeStructure> &nodeMST, int squareDimension, Mat &inputImage) {
vector<double> nodeWeights(8,0);
vector<double> nodeIntensities(8,0);
bool allZeroWeights = true;
int numberEitherside = (squareDimension - 1) / 2;
int index = 0;
for (int j = y - numberEitherside; j < y + numberEitherside + 1; j++) {
for (int i = x - numberEitherside; i < x + numberEitherside + 1; i++) {
// out of range or the centre pixel
if (j<0 || i<0 || j>ymax || i>xmax || (j == y && i == x)) {
index++;
continue;
}
else {
int centreNodeIndex = y*(xmax+1) + x;
int thisNodeIndex = j*(xmax+1) + i;
// add to intensity list
Scalar pixelIntensityScalar = inputImage.at<uchar>(j, i);
nodeIntensities[index] = ((double)*pixelIntensityScalar.val);
// find weight from p to q
float weight = findWeight(nodeMST, thisNodeIndex, centreNodeIndex);
if (weight!=0 && allZeroWeights) {
allZeroWeights = false;
}
nodeWeights[index] = (weight);
index++;
}
}
}
// find min b
int minb = -1;
int bCost = -1;
if (allZeroWeights) {
return 0;
}
else {
// iteratate all b values
for (int i = 0; i < nodeWeights.size(); i++) {
if (nodeWeights[i]==0) {
continue;
}
double thisbCost = nonLocalWithb(nodeIntensities[i], nodeIntensities, nodeWeights);
if (bCost<0 || thisbCost<bCost) {
bCost = thisbCost;
minb = nodeIntensities[i];
}
}
}
return minb;
}
Firstly, I assume the spent time indicated by Very Sleepy means that the majority of time is spent allocating the vector and deleting the vector?
Secondly, are there any suggestions to speed this code up?
Thanks
use std::array
reuse the vectors by passing it as an argument of the function or a global variable if possible (not aware of the structure of the code so I need more infos)
allocate one 16 vector size instead of two vectors of size 8. Will make your memory less fragmented
use parallelism if findWeight is thread safe (you need to provide more details on that too)