Shift vector in thrust - c++

I'm looking at a project involving online (streaming) data. I want to work with a sliding window of that data. For example, say that I want to hold 10 values in my vector. When value 11 comes in, I want to drop value 1, shift everything over, and then place value 11 where value 10 was.
The long way would be something like the following:
int n = 9;
thrust::device_vector<float> val;
val.resize(n+1,0);
// Shift left
for(int i=0; i != n-1; i++){
val[i] = val[i+1];
}
// add the new value to the last position
val[n] = newValue;
Is there a "fast" way to do this with thrust? The project I'm looking at will have around 500 vectors that will need this operation done simultaneously.
Thanks!

As I have said, Ring buffer is what you need. No need to shift there, only one counter and a fixed size array.
Let's think how we may deal with 500 of ring buffers.
If you want to have 500 (let it be 512) sliding windows and process them all on the GPU, then you might pack them into one big 2D texture, where each column is an array of samples for the same moment.
If you're getting new samples for each of the vector at once (I mean one new sample for each 512 buffers at one processing step), then this "ring texture" (like a cylinder) only needs to be updated once (upload the array of new samples at each step) and you need just one counter.

I highly recommend using a different, yet still free, library for this problem. In 4 lines of ArrayFire code, you can do all 500 vectors, as follows:
array val = array(window_width, num_vectors);
val = shift(val, 0, 1);
array newValue = array(1,num_vectors);
val(span,end) = newValue;
I benchmarked against Thrust code for the same and ArrayFire is getting about a 10X speedup over Thrust.
Downside is that ArrayFire is not open source, but it is still free for this sort of problem.

Want you want is simply thrust::copy. You can't do a shift in place in parallel, because you can't guarantee a value is read before it is written.
int n = 9;
thrust::device_vector<float> val_in(n);
thrust::device_vector<float> val_out(n+1);
thrust::copy(val_in.begin() + 1, val_in.end(), val_out.begin());
// add the new value to the last position
val_out[n] = newValue;

Related

Sparse matrix without knowing size

Does Eigen support element insertion into a sparse matrix if the size is not known?
I have a stream of data coming in, and I am attempting to store that sparsely, but I don't know the maximum value of the indices (row/column) of the data ahead of time (I can guess, but not guarantee). Looking at Eigen's insert code, it has an assertion (1130, SparseMatrix.h) that the index that you wish to insert into is <=rows(), <=cols().
Do I really need to wait until I have all the data before I can start using Eigen's sparse matrix code? The design I would have to go for then would require me to wait for all the data, then scanto find the maximum index, which is not ideal for my application. I curently don't need the full matrix to start working - an limited one with the currently available data would be fine.
Please don't close this question unless you have an answer, the linked answer was for dense matrices, not sparse ones, which have different internal storage...
I'm also looking for information on the case where matrix size is not immediately available at run-time, rather than at compile time, and olny for sparse.
The recommendation is still to store the values into a intermediate triplets container and build the sparse matrix at the end. If you don't want to read all the stream... then just read first nnn triplets until your desired condition and then use setFromTriplets() with the partial list of triplets.
But, if you still don't want to read the full matrix to start working, you can guess a size for your matrix and make it grow in case you read a value that can not be stored in your current size using conservativeResize().
#include <Eigen/Sparse>
#include <iostream>
Eigen::SparseMatrix<double> mat;
mat.resize(100,100); //Initial size guess. Could be 1, 10, 1000, etc...
fstream inputStream ("filename.txt", "r"):
while(inputStream)
{
//Read position and value from stream
unsigned i, j;
double v;
inputStream >> i >> j >> v;
//check current size of the matrix and make it grow if necessary
if ( (i >= mat.rows()) || (j >= mat.cols()) )
mat.conservativeResize(std::max(i+1, mat.rows()), std::max(j+1, mat.cols()) );
//store the value in matrix
mat.coeffRef(i,j) = v;
//Insert here your condition to break before of read all the stream
if (mat.nonZeros() > 150 )
break;
}
//Do some clean-up in case you think is necessary
mat.makeCompressed();

Is using for loops faster than saving things in a vector in C++?

Sorry for the bad title, but I can actually not think of a better one (open to suggestions).
I got a big grid (1000*1000*1000).
for (int k = 0; k<dims.nz; k++);
{
for (int i = 0; i < dims.nx; i++)
{
for (int j = 0; j < dims.ny; j++)
{
if (inputLabel->evalReg(i, j, 0) == 0)
{
sum = sum + anotherField->evalReg(i, j, 0);
}
}
}
}
I go through all grid points to find which grid points have the value 0 in my labelfield and sum up the corresponding values of another field.
After this I want to set all the points that I detected above to a certain value.
Would it be faster to do basically do the same for loop again (while this time setting values instead of reading them), or should I write all the positions that I got into a separate vector (which would have to change size in ever step of the loop in which we detect something) and simply build a for loop like
for(int p=0; p<size_vec_1,p++)
{
anotherField->set(vec_1[p],vec_2[p],vec_3[p], random value);
}
The point is that I do not know how much of the grid will be affected by my routien due to different data. Might be half of the data or something completly different. Can I do a genereal estimation of the speed of the methods or is it soley depending on the distribution of my values ?
The point is that I do not know how much of the grid will be affected by my routien due to different data.
Here's a trick, which may work: sample inputLabel randomly, to make an approximation how many entries are 0. If a few, then go the "putting indices into a vector" way. If a lot, then go the "scan the array again" way.
It needs fine tuning for a specific computer, what should be the threshold between the two cases, how many samples to take (it should not be too large, as the approximation will take too much time, but should not be too small to have a good approximation), etc.
Bonus trick: take cache-line-aligned and cache-line-sized samples. This way the approximation will take the similar amount of time (because it is memory bound), but the approximation will be better.

How can I pixelate a 1d array

I want to pixelate an image stored in a 1d array, although i am not sure how to do it, this is what i have comeup with so far...
the value of pixelation is currently 3 for testing purposes.
currently it just creates a section of randomly coloured pixels along the left third of the image, if i increase the value of pixelation the amount of random coloured pixels decreases and vice versa, so what am i doing wrong?
I have also already implemented the rotation, reading of the image and saving of a new image this is just a separate function which i need assistance with.
picture pixelate( const std::string& file_name, picture& tempImage, int& pixelation /* TODO: OTHER PARAMETERS HERE */)
{
picture pixelated = tempImage;
RGB tempPixel;
tempPixel.r = 0;
tempPixel.g = 0;
tempPixel.b = 0;
int counter = 0;
int numtimesrun = 0;
for (int x = 1; x<tempImage.width; x+=pixelation)
{
for (int y = 1; y<tempImage.height; y+=pixelation)
{
//RGB tempcol;
//tempcol for pixelate
for (int i = 1; i<pixelation; i++)
{
for (int j = 1; j<pixelation; j++)
{
tempPixel.r +=tempImage.pixel[counter+pixelation*numtimesrun].colour.r;
tempPixel.g +=tempImage.pixel[counter+pixelation*numtimesrun].colour.g;
tempPixel.b +=tempImage.pixel[counter+pixelation*numtimesrun].colour.b;
counter++;
//read colour
}
}
for (int k = 1; k<pixelation; k++)
{
for (int l = 1; l<pixelation; l++)
{
pixelated.pixel[numtimesrun].colour.r = tempPixel.r/pixelation;
pixelated.pixel[numtimesrun].colour.g = tempPixel.g/pixelation;
pixelated.pixel[numtimesrun].colour.b = tempPixel.b/pixelation;
//set colour
}
}
counter = 0;
numtimesrun++;
}
cout << x << endl;
}
cout << "Image successfully pixelated." << endl;
return pixelated;
}
I'm not too sure what you really want to do with your code, but I can see a few problems.
For one, you use for() loops with variables starting at 1. That's certainly wrong. Arrays in C/C++ start at 0.
The other main problem I can see is the pixelation parameter. You use it to increase x and y without knowing (at least in that function) whether it is a multiple of width and height. If not, you will definitively be missing pixels on the right edge and at the bottom (which edges will depend on the orientation, of course). Again, it very much depends on what you're trying to achieve.
Also the i and j loops start at the position defined by counter and numtimesrun which means that the last line you want to hit is not tempImage.width or tempImage.height. With that you are rather likely to have many overflows. Actually that would also explain the problems you see on the edges. (see update below)
Another potential problem, cannot tell for sure without seeing the structure declaration, but this sum using tempPixel.c += <value> may overflow. If the RGB components are defined as unsigned char (rather common) then you will definitively get overflows. So your average sum is broken if that's the fact. If that structure uses floats, then you're good.
Note also that your average is wrong. You are adding source data for pixelation x pixalation and your average is calculated as sum / pixelation. So you get a total which is pixalation times larger. You probably wanted sum / (pixelation * pixelation).
Your first loop with i and j computes a sum. The math is most certainly wrong. The counter + pixelation * numtimesrun expression will start reading at the second line, it seems. However, you are reading i * j values. That being said, it may be what you are trying to do (i.e. a moving average) in which case it could be optimized but I'll leave that out for now.
Update
If I understand what you are doing, a representation would be something like a filter. There is a picture of a 3x3:
.+. *
+*+ =>
.+.
What is on the left is what you are reading. This means the source needs to be at least 3x3. What I show on the right is the result. As we can see, the result needs to be 1x1. From what I see in your code you do not take that in account at all. (the varied characters represent varied weights, in your case all weights are 1.0).
You have two ways to handle that problem:
The resulting image has a size of width - pixelation * 2 + 1 by height - pixelation * 2 + 1; in this case you keep one result and do not care about the edges...
You rewrite the code to handle edges. This means you use less source data to compute the resulting edges. Another way is to compute the edge cases and save that in several output pixels (i.e. duplicate the pixels on the edges).
Update 2
Hmmm... looking at your code again, it seems that you compute the average of the 3x3 and save it in the 3x3:
.+. ***
+*+ => ***
.+. ***
Then the problem is different. The numtimesrun is wrong. In your k and l loops you save the pixels pixelation * pixelation in the SAME pixel and that advanced by one each time... so you are doing what I shown in my first update, but it looks like you were trying to do what is shown in my 2nd update.
The numtimesrun could be increased by pixelation each time:
numtimesrun += pixelation;
However, that's not enough to fix your k and l loops. There you probably need to calculate the correct destination. Maybe something like this (also requires a reset of the counter before the loop):
counter = 0;
... for loops ...
pixelated.pixel[counter+pixelation*numtimesrun].colour.r = ...;
... (take care of g and b)
++counter;
Yet again, I cannot tell for sure what you are trying to do, so I do not know why you'd want to copy the same pixel pixelation x pixelation times. But that explains why you get data only at the left (or top) of the image (very much depends on the orientation, one side for sure. And if that's 1/3rd then pixelation is probably 3.)
WARNING: if you implement the save properly, you'll experience crashes if you do not take care of the overflows mentioned earlier.
Update 3
As explained by Mark in the comment below, you have an array representing a 2d image. In that case, your counter variable is completely wrong since this is 100% linear whereas the 2d image is not. The 2nd line is width further away. At this point, you read the first 3 pixels at the top-left, then the next 3 pixels on the same, and finally the next 3 pixels still on the same line. Of course, it could be that your image is thus defined and these pixels are really one after another, although it is not very likely...
Mark's answer is concise and gives you the information necessary to access the correct pixels. However, you will still be hit by the overflow and possibly the fact that the width and height parameters are not a multiple of pixelation...
I don't do a lot of C++, but here's a pixelate function I wrote for Processing. It takes an argument of the width/height of the pixels you want to create.
void pixelateImage(int pxSize) {
// use ratio of height/width...
float ratio;
if (width < height) {
ratio = height/width;
}
else {
ratio = width/height;
}
// ... to set pixel height
int pxH = int(pxSize * ratio);
noStroke();
for (int x=0; x<width; x+=pxSize) {
for (int y=0; y<height; y+=pxH) {
fill(p.get(x, y));
rect(x, y, pxSize, pxH);
}
}
}
Without the built-in rect() function you'd have to write pixel-by-pixel using another two for loops:
for (int px=0; px<pxSize; px++) {
for (int py=0; py<pxH; py++) {
pixelated.pixel[py * tempImage.width + px].colour.r = tempPixel.r;
pixelated.pixel[py * tempImage.width + px].colour.g = tempPixel.g;
pixelated.pixel[py * tempImage.width + px].colour.b = tempPixel.b;
}
}
Generally when accessing an image stored in a 1D buffer, each row of the image will be stored as consecutive pixels and the next row will follow immediately after. The way to address into such a buffer is:
image[y*width+x]
For your purposes you want both inner loops to generate coordinates that go from the top and left of the pixelation square to the bottom right.

How to copy elements of 2D matrix to 1D array vertically using c++

I have a 2D matrix and I want to copy its values to a 1D array vertically in an efficient way as the following way.
Matrice(3x3)
[1 2 3;
4 5 6;
7 8 9]
myarray:
{1,4,7,2,5,8,3,6,9}
Brute force takes 0.25 sec for 1000x750x3 image. I dont want to use vector because I give myarray to another function(I didnt write this function) as input. So, is there a c++ or opencv function that I can use? Note that, I'm using opencv library.
Copying matrix to array is also fine, I can first take the transpose of the Mat, then I will copy it to array.
cv::Mat transposed = myMat.t();
uchar* X = transposed.reshape(1,1).ptr<uchar>(0);
or
int* X = transposed.reshape(1,1).ptr<int>(0);
depending on your matrix type. It might copy data though.
You can optimize to make it more cache friendly, i.e. you can copy blockwise, keeping track of the positions in myArray, where the data should go to. The point is, that you brute force approach will most likely make each access to the matrix being off-cache, which has a tremendous performance impact. Hence it is better to copy vertical/horizontal taking the cache line size into account.
See the idea bbelow (I didn't test it, so it has most likely bugs, but it should make the idea clear).
size_t cachelinesize = 128/sizeof(pixel); // assumed cachelinesize of 128 bytes
struct pixel
{
char r;
char g;
char b;
};
array<array<pixel, 1000>, 750> matrice;
vector<pixel> vec(1000*750);
for (size_t row = 0; row<matrice.size; ++row)
{
for (size_t col = 0; col<matrice[0].size; col+=cachelinesize)
{
for (size_t i = 0; i<cachelinesize; ++i)
{
vec[row*(col+i)]=matrice[row][col+i]; // check here, if right copy order. I didn't test it.
}
}
}
If you are using the matrix before the vertical assignment/querying, then you can cache the necessary columns when you hit each one of the elements of columns.
//Multiplies and caches
doCalcButCacheVerticalsByTheWay(myMatrix,calcType,myMatrix2,cachedColumns);
instead of
doCalc(myMatrix,calcType,myMatrix2); //Multiplies
then use it like this:
...
tmpVariable=cachedColumns[i];
...
For example, upper function multiplies the matrix with another one, then when the necessary columns are reached, caching into a temporary array occurs so you can access elements of it later in a contiguous order.
I think Mat::reshape is what you want. It does not copying data.

improving performance for graph connectedness computation

I am writing a program to generate a graph and check whether it is connected or not. Below is the code. Here is some explanation: I generate a number of points on the plane at random locations. I then connect the nodes, NOT based on proximity only. By that I mean to say that a node is more likely to be connected to nodes that are closer, and this is determined by a random variable that I use in the code (h_sq) and the distance. Hence, I generate all links (symmetric, i.e., if i can talk to j the viceversa is also true) and then check with a BFS to see if the graph is connected.
My problem is that the code seems to be working properly. However, when the number of nodes becomes greater than ~2000 it is terribly slow, and I need to run this function many times for simulation purposes. I even tried to use other libraries for graphs but the performance is the same.
Does anybody know how could I possibly speed everything up?
Thanks,
int Graph::gen_links() {
if( save == true ) { // in case I want to store the structure of the graph
links.clear();
links.resize(xy.size());
}
double h_sq, d;
vector< vector<luint> > neighbors(xy.size());
// generate links
double tmp = snr_lin / gamma_0_lin;
// xy is a std vector of pairs containing the nodes' locations
for(luint i = 0; i < xy.size(); i++) {
for(luint j = i+1; j < xy.size(); j++) {
// generate |h|^2
d = distance(i, j);
if( d < d_crit ) // for sim purposes
d = 1.0;
h_sq = pow(mrand.randNorm(0, 1), 2.0) + pow(mrand.randNorm(0, 1), 2.0);
if( h_sq * tmp >= pow(d, alpha) ) {
// there exists a link between i and j
neighbors[i].push_back(j);
neighbors[j].push_back(i);
// options
if( save == true )
links.push_back( make_pair(i, j) );
}
}
if( neighbors[i].empty() && save == false ) {
// graph not connected. since save=false i dont need to store the structure,
// hence I exit
connected = 0;
return 1;
}
}
// here I do BFS to check whether the graph is connected or not, using neighbors
// BFS code...
return 1;
}
UPDATE:
the main problem seems to be the push_back calls within the inner for loops. It's the part that takes most of the time in this case. Shall I use reserve() to increase efficiency?
Are you sure the slowness is caused by the generation but not by your search algorithm?
The graph generation is O(n^2) and you can't do too much to it. However, you can apparently use memory in exchange of some of the time if the point locations are fixed for at least some of the experiments.
First, distances of all node pairs, and pow(d, alpha) can be precomputed and saved into memory so that you don't need to compute them again and again. The extra memory cost for 10000 nodes will be about 800mb for double and 400mb for float..
In addition, sum of square of normal variable is chi-square distribution if I remember correctly.. Probably you can have some precomputed table lookup if the accuracy allowed?
At last, if the probability that two nodes will be connected are so small if the distance exceeds some value, then you don't need O(n^2) and probably you can only calculate those node pairs that have distance smaller than some limits?
As a first step you should try to use reserve for both inner and outer vectors.
If this does not bring performance up to your expectations I believe this is because memory allocations that are still happening.
There is a handy class I've used in similar situations, llvm::SmallVector (find it in Google). It provides a vector with few pre-allocated items, so you can have decrease number of allocations by one per vector.
It still can grow when it is running out of items in pre-allocated space.
So:
1) Examine the number of items you have in your vectors on average during runs (I'm talking about both inner and outer vectors)
2) Put in llvm::SmallVector with a pre-allocation of such size (as vector is allocated on the stack you might need to increase stack size, or reduce pre-allocation if you are restricted on available stack memory).
Another good thing about SmallVector is that it has almost the same interface as std::vector (could be easily put instead of it)